An Innovative BERT-Based Readability Model

Hou Chiang Tseng, Hsueh Chih Chen, Kuo En Chang, Yao Ting Sung, Berlin Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it has been a long-standing critical challenge to develop effective readability prediction models that can automatically and accurately assess the readability of a given text. When building readability prediction models for the Chinese language, word segmentation ambiguity is often a knotty problem that will inevitably happen in the pre-processing of texts. In view of this, we present in this paper a novel readability prediction approach for the Chinese language, building on a recently proposed, so-called Bidirectional Encoder Representation from Transformers (BERT) model that can capture both syntactic and semantic information of a text directly from its character-level representation. With the BERT-based readability prediction model that takes consecutive character-level representations as its input, we effectively assess the readability of a given text without the need of performing error-prone word segmentation. We empirically evaluate the performance of our BERT-based readability prediction model on a benchmark task, by comparing it with a strong baseline that utilizes a celebrated classification model (named fastText) in conjunction with word-level presentations. The results demonstrate that the BERT-based model with character-level representations can perform on par with the fastText-based model with word-level representations, yielding the accuracy of 78.45% on average. This finding also offers the promise of conducting readability assessment of a text in Chinese directly based on character-level representations.

Original languageEnglish
Title of host publicationInnovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings
EditorsLisbet Rønningsbakk, Ting-Ting Wu, Frode Eika Sandnes, Yueh-Min Huang
PublisherSpringer
Pages301-308
Number of pages8
ISBN (Print)9783030353421
DOIs
Publication statusPublished - 2019 Jan 1
Event2nd International Conference on Innovative Technologies and Learning, ICITL 2019 - Tromsø, Norway
Duration: 2019 Dec 22019 Dec 5

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11937 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2nd International Conference on Innovative Technologies and Learning, ICITL 2019
CountryNorway
CityTromsø
Period19/12/219/12/5

Fingerprint

Transformer
Encoder
Prediction Model
Model
Segmentation
Text
Syntactics
Preprocessing
Consecutive
Baseline
Benchmark
Semantics
Character
Evaluate
Prediction
Demonstrate
Processing

Keywords

  • BERT
  • fastText
  • Readability
  • Representation learning
  • Text classification

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Tseng, H. C., Chen, H. C., Chang, K. E., Sung, Y. T., & Chen, B. (2019). An Innovative BERT-Based Readability Model. In L. Rønningsbakk, T-T. Wu, F. E. Sandnes, & Y-M. Huang (Eds.), Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings (pp. 301-308). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11937 LNCS). Springer. https://doi.org/10.1007/978-3-030-35343-8_32

An Innovative BERT-Based Readability Model. / Tseng, Hou Chiang; Chen, Hsueh Chih; Chang, Kuo En; Sung, Yao Ting; Chen, Berlin.

Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings. ed. / Lisbet Rønningsbakk; Ting-Ting Wu; Frode Eika Sandnes; Yueh-Min Huang. Springer, 2019. p. 301-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11937 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tseng, HC, Chen, HC, Chang, KE, Sung, YT & Chen, B 2019, An Innovative BERT-Based Readability Model. in L Rønningsbakk, T-T Wu, FE Sandnes & Y-M Huang (eds), Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11937 LNCS, Springer, pp. 301-308, 2nd International Conference on Innovative Technologies and Learning, ICITL 2019, Tromsø, Norway, 19/12/2. https://doi.org/10.1007/978-3-030-35343-8_32
Tseng HC, Chen HC, Chang KE, Sung YT, Chen B. An Innovative BERT-Based Readability Model. In Rønningsbakk L, Wu T-T, Sandnes FE, Huang Y-M, editors, Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings. Springer. 2019. p. 301-308. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-35343-8_32
Tseng, Hou Chiang ; Chen, Hsueh Chih ; Chang, Kuo En ; Sung, Yao Ting ; Chen, Berlin. / An Innovative BERT-Based Readability Model. Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings. editor / Lisbet Rønningsbakk ; Ting-Ting Wu ; Frode Eika Sandnes ; Yueh-Min Huang. Springer, 2019. pp. 301-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{a6d6a7c30d1e4372bfbcee1ff360e67e,
title = "An Innovative BERT-Based Readability Model",
abstract = "Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it has been a long-standing critical challenge to develop effective readability prediction models that can automatically and accurately assess the readability of a given text. When building readability prediction models for the Chinese language, word segmentation ambiguity is often a knotty problem that will inevitably happen in the pre-processing of texts. In view of this, we present in this paper a novel readability prediction approach for the Chinese language, building on a recently proposed, so-called Bidirectional Encoder Representation from Transformers (BERT) model that can capture both syntactic and semantic information of a text directly from its character-level representation. With the BERT-based readability prediction model that takes consecutive character-level representations as its input, we effectively assess the readability of a given text without the need of performing error-prone word segmentation. We empirically evaluate the performance of our BERT-based readability prediction model on a benchmark task, by comparing it with a strong baseline that utilizes a celebrated classification model (named fastText) in conjunction with word-level presentations. The results demonstrate that the BERT-based model with character-level representations can perform on par with the fastText-based model with word-level representations, yielding the accuracy of 78.45{\%} on average. This finding also offers the promise of conducting readability assessment of a text in Chinese directly based on character-level representations.",
keywords = "BERT, fastText, Readability, Representation learning, Text classification",
author = "Tseng, {Hou Chiang} and Chen, {Hsueh Chih} and Chang, {Kuo En} and Sung, {Yao Ting} and Berlin Chen",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/978-3-030-35343-8_32",
language = "English",
isbn = "9783030353421",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer",
pages = "301--308",
editor = "Lisbet R{\o}nningsbakk and Ting-Ting Wu and Sandnes, {Frode Eika} and Yueh-Min Huang",
booktitle = "Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings",

}

TY - GEN

T1 - An Innovative BERT-Based Readability Model

AU - Tseng, Hou Chiang

AU - Chen, Hsueh Chih

AU - Chang, Kuo En

AU - Sung, Yao Ting

AU - Chen, Berlin

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it has been a long-standing critical challenge to develop effective readability prediction models that can automatically and accurately assess the readability of a given text. When building readability prediction models for the Chinese language, word segmentation ambiguity is often a knotty problem that will inevitably happen in the pre-processing of texts. In view of this, we present in this paper a novel readability prediction approach for the Chinese language, building on a recently proposed, so-called Bidirectional Encoder Representation from Transformers (BERT) model that can capture both syntactic and semantic information of a text directly from its character-level representation. With the BERT-based readability prediction model that takes consecutive character-level representations as its input, we effectively assess the readability of a given text without the need of performing error-prone word segmentation. We empirically evaluate the performance of our BERT-based readability prediction model on a benchmark task, by comparing it with a strong baseline that utilizes a celebrated classification model (named fastText) in conjunction with word-level presentations. The results demonstrate that the BERT-based model with character-level representations can perform on par with the fastText-based model with word-level representations, yielding the accuracy of 78.45% on average. This finding also offers the promise of conducting readability assessment of a text in Chinese directly based on character-level representations.

AB - Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it has been a long-standing critical challenge to develop effective readability prediction models that can automatically and accurately assess the readability of a given text. When building readability prediction models for the Chinese language, word segmentation ambiguity is often a knotty problem that will inevitably happen in the pre-processing of texts. In view of this, we present in this paper a novel readability prediction approach for the Chinese language, building on a recently proposed, so-called Bidirectional Encoder Representation from Transformers (BERT) model that can capture both syntactic and semantic information of a text directly from its character-level representation. With the BERT-based readability prediction model that takes consecutive character-level representations as its input, we effectively assess the readability of a given text without the need of performing error-prone word segmentation. We empirically evaluate the performance of our BERT-based readability prediction model on a benchmark task, by comparing it with a strong baseline that utilizes a celebrated classification model (named fastText) in conjunction with word-level presentations. The results demonstrate that the BERT-based model with character-level representations can perform on par with the fastText-based model with word-level representations, yielding the accuracy of 78.45% on average. This finding also offers the promise of conducting readability assessment of a text in Chinese directly based on character-level representations.

KW - BERT

KW - fastText

KW - Readability

KW - Representation learning

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=85076753876&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076753876&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-35343-8_32

DO - 10.1007/978-3-030-35343-8_32

M3 - Conference contribution

AN - SCOPUS:85076753876

SN - 9783030353421

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 301

EP - 308

BT - Innovative Technologies and Learning - 2nd International Conference, ICITL 2019, Proceedings

A2 - Rønningsbakk, Lisbet

A2 - Wu, Ting-Ting

A2 - Sandnes, Frode Eika

A2 - Huang, Yueh-Min

PB - Springer

ER -