TY - JOUR
T1 - The Feasibility of Automated Topic Analysis
T2 - An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classificationψ
AU - Tseng, Yuen Hsien
N1 - Publisher Copyright:
© 2020, Journal of Educational Media and Library Sciences. All Rights Reserved.
PY - 2020
Y1 - 2020
N2 - Text classification (TC) is the task of assigning predefined categories (or labels) to texts for information organization, knowledge management, and many other applications. Normally the categories are topical in library science applications, although they can be any labels suitable for an application. Thus, TC often requires topical analysis which relies on human knowledge. However, in recent decades, machine learning (ML) techniques have been applied to TC for efficiency, as long as a sufficient number of training texts are available for each category. Nevertheless, in real-world cases, the number of texts (documents) for each category is often highly skewed for a certain TC task. This leads to the problem of predicting labels for small categories, which is viable for humans but challenging for machines. Deep learning (DL) is an emerging class of machine learning (ML) which was inspired by human neural networks. This study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four off- the-shelf DL methods (CNN, RCNN, fastText, and BERT) with four traditional ML techniques on five skew-distributed datasets (four in Chinese, and one in English for comparison). Our results show that BERT is effective for moderately skewed datasets, but is still not feasible for highly skewed TC tasks. The other three DL-aware methods (CNN, RCNN, fastText) do not show any advantage in comparison with traditional methods such as SVM for the five TC tasks, although they captured extra language knowledge in the pretrained word representation. To facilitate future study, all of the Chinese datasets used in this study have been released publicly, together with all of the adapted machine learning and evaluation source codes for verification and for further study at https://github.com/SamTseng/Chinese_ Skewed_TxtClf.
AB - Text classification (TC) is the task of assigning predefined categories (or labels) to texts for information organization, knowledge management, and many other applications. Normally the categories are topical in library science applications, although they can be any labels suitable for an application. Thus, TC often requires topical analysis which relies on human knowledge. However, in recent decades, machine learning (ML) techniques have been applied to TC for efficiency, as long as a sufficient number of training texts are available for each category. Nevertheless, in real-world cases, the number of texts (documents) for each category is often highly skewed for a certain TC task. This leads to the problem of predicting labels for small categories, which is viable for humans but challenging for machines. Deep learning (DL) is an emerging class of machine learning (ML) which was inspired by human neural networks. This study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four off- the-shelf DL methods (CNN, RCNN, fastText, and BERT) with four traditional ML techniques on five skew-distributed datasets (four in Chinese, and one in English for comparison). Our results show that BERT is effective for moderately skewed datasets, but is still not feasible for highly skewed TC tasks. The other three DL-aware methods (CNN, RCNN, fastText) do not show any advantage in comparison with traditional methods such as SVM for the five TC tasks, although they captured extra language knowledge in the pretrained word representation. To facilitate future study, all of the Chinese datasets used in this study have been released publicly, together with all of the adapted machine learning and evaluation source codes for verification and for further study at https://github.com/SamTseng/Chinese_ Skewed_TxtClf.
KW - Deep learning
KW - Performance evaluation
KW - Real-world corpus
KW - Text categorization
UR - http://www.scopus.com/inward/record.url?scp=85101770865&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101770865&partnerID=8YFLogxK
U2 - 10.6120/JoEMLS.202003_57(1).0047.RS.CE
DO - 10.6120/JoEMLS.202003_57(1).0047.RS.CE
M3 - Article
AN - SCOPUS:85101770865
SN - 1013-090X
VL - 57
SP - 121
EP - 144
JO - Journal of Educational Media and Library Sciences
JF - Journal of Educational Media and Library Sciences
IS - 1
ER -