TY - GEN
T1 - An Innovative BERT-Based Readability Model
AU - Tseng, Hou Chiang
AU - Chen, Hsueh Chih
AU - Chang, Kuo En
AU - Sung, Yao Ting
AU - Chen, Berlin
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it has been a long-standing critical challenge to develop effective readability prediction models that can automatically and accurately assess the readability of a given text. When building readability prediction models for the Chinese language, word segmentation ambiguity is often a knotty problem that will inevitably happen in the pre-processing of texts. In view of this, we present in this paper a novel readability prediction approach for the Chinese language, building on a recently proposed, so-called Bidirectional Encoder Representation from Transformers (BERT) model that can capture both syntactic and semantic information of a text directly from its character-level representation. With the BERT-based readability prediction model that takes consecutive character-level representations as its input, we effectively assess the readability of a given text without the need of performing error-prone word segmentation. We empirically evaluate the performance of our BERT-based readability prediction model on a benchmark task, by comparing it with a strong baseline that utilizes a celebrated classification model (named fastText) in conjunction with word-level presentations. The results demonstrate that the BERT-based model with character-level representations can perform on par with the fastText-based model with word-level representations, yielding the accuracy of 78.45% on average. This finding also offers the promise of conducting readability assessment of a text in Chinese directly based on character-level representations.
AB - Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it has been a long-standing critical challenge to develop effective readability prediction models that can automatically and accurately assess the readability of a given text. When building readability prediction models for the Chinese language, word segmentation ambiguity is often a knotty problem that will inevitably happen in the pre-processing of texts. In view of this, we present in this paper a novel readability prediction approach for the Chinese language, building on a recently proposed, so-called Bidirectional Encoder Representation from Transformers (BERT) model that can capture both syntactic and semantic information of a text directly from its character-level representation. With the BERT-based readability prediction model that takes consecutive character-level representations as its input, we effectively assess the readability of a given text without the need of performing error-prone word segmentation. We empirically evaluate the performance of our BERT-based readability prediction model on a benchmark task, by comparing it with a strong baseline that utilizes a celebrated classification model (named fastText) in conjunction with word-level presentations. The results demonstrate that the BERT-based model with character-level representations can perform on par with the fastText-based model with word-level representations, yielding the accuracy of 78.45% on average. This finding also offers the promise of conducting readability assessment of a text in Chinese directly based on character-level representations.
KW - BERT
KW - Readability
KW - Representation learning
KW - Text classification
KW - fastText
UR - http://www.scopus.com/inward/record.url?scp=85076753876&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076753876&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-35343-8_32
DO - 10.1007/978-3-030-35343-8_32
M3 - Conference contribution
AN - SCOPUS:85076753876
SN - 9783030353421
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 301
EP - 308
BT - Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings
A2 - Rønningsbakk, Lisbet
A2 - Wu, Ting-Ting
A2 - Sandnes, Frode Eika
A2 - Huang, Yueh-Min
PB - Springer
T2 - 2nd International Conference on Innovative Technologies and Learning, ICITL 2019
Y2 - 2 December 2019 through 5 December 2019
ER -