Constructing and validating readability models: the method of integrating multilevel linguistic features with machine learning

Yao-Ting Sung, Ju Ling Chen, Ji Her Cha, Hou Chiang Tseng, Tao Hsing Chang, Kuo-En Chang

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Multilevel linguistic features have been proposed for discourse analysis, but there have been few applications of multilevel linguistic features to readability models and also few validations of such models. Most traditional readability formulae are based on generalized linear models (GLMs; e.g., discriminant analysis and multiple regression), but these models have to comply with certain statistical assumptions about data properties and include all of the data in formulae construction without pruning the outliers in advance. The use of such readability formulae tends to produce a low text classification accuracy, while using a support vector machine (SVM) in machine learning can enhance the classification outcome. The present study constructed readability models by integrating multilevel linguistic features with SVM, which is more appropriate for text classification. Taking the Chinese language as an example, this study developed 31 linguistic features as the predicting variables at the word, semantic, syntax, and cohesion levels, with grade levels of texts as the criterion variable. The study compared four types of readability models by integrating unilevel and multilevel linguistic features with GLMs and an SVM. The results indicate that adopting a multilevel approach in readability analysis provides a better representation of the complexities of both texts and the reading comprehension process.

Original languageEnglish
Pages (from-to)340-354
Number of pages15
JournalBehavior Research Methods
Volume47
Issue number2
DOIs
Publication statusPublished - 2015 Jun 1

Fingerprint

Linguistics
Discriminant Analysis
Machine Learning
Linguistic Features
Readability
Semantics
Reading
Linear Models
Language
Support Vector Machine

Keywords

  • Linguistic features
  • Multilevel
  • Readability
  • Support vector machine
  • Validity

ASJC Scopus subject areas

  • Experimental and Cognitive Psychology
  • Developmental and Educational Psychology
  • Arts and Humanities (miscellaneous)
  • Psychology (miscellaneous)
  • Psychology(all)

Cite this

Constructing and validating readability models : the method of integrating multilevel linguistic features with machine learning. / Sung, Yao-Ting; Chen, Ju Ling; Cha, Ji Her; Tseng, Hou Chiang; Chang, Tao Hsing; Chang, Kuo-En.

In: Behavior Research Methods, Vol. 47, No. 2, 01.06.2015, p. 340-354.

Research output: Contribution to journalArticle

@article{628c2ab50f7640bf94e944a4894746e7,
title = "Constructing and validating readability models: the method of integrating multilevel linguistic features with machine learning",
abstract = "Multilevel linguistic features have been proposed for discourse analysis, but there have been few applications of multilevel linguistic features to readability models and also few validations of such models. Most traditional readability formulae are based on generalized linear models (GLMs; e.g., discriminant analysis and multiple regression), but these models have to comply with certain statistical assumptions about data properties and include all of the data in formulae construction without pruning the outliers in advance. The use of such readability formulae tends to produce a low text classification accuracy, while using a support vector machine (SVM) in machine learning can enhance the classification outcome. The present study constructed readability models by integrating multilevel linguistic features with SVM, which is more appropriate for text classification. Taking the Chinese language as an example, this study developed 31 linguistic features as the predicting variables at the word, semantic, syntax, and cohesion levels, with grade levels of texts as the criterion variable. The study compared four types of readability models by integrating unilevel and multilevel linguistic features with GLMs and an SVM. The results indicate that adopting a multilevel approach in readability analysis provides a better representation of the complexities of both texts and the reading comprehension process.",
keywords = "Linguistic features, Multilevel, Readability, Support vector machine, Validity",
author = "Yao-Ting Sung and Chen, {Ju Ling} and Cha, {Ji Her} and Tseng, {Hou Chiang} and Chang, {Tao Hsing} and Kuo-En Chang",
year = "2015",
month = "6",
day = "1",
doi = "10.3758/s13428-014-0459-x",
language = "English",
volume = "47",
pages = "340--354",
journal = "Behavior Research Methods",
issn = "1069-9384",
publisher = "Springer New York",
number = "2",

}

TY - JOUR

T1 - Constructing and validating readability models

T2 - the method of integrating multilevel linguistic features with machine learning

AU - Sung, Yao-Ting

AU - Chen, Ju Ling

AU - Cha, Ji Her

AU - Tseng, Hou Chiang

AU - Chang, Tao Hsing

AU - Chang, Kuo-En

PY - 2015/6/1

Y1 - 2015/6/1

N2 - Multilevel linguistic features have been proposed for discourse analysis, but there have been few applications of multilevel linguistic features to readability models and also few validations of such models. Most traditional readability formulae are based on generalized linear models (GLMs; e.g., discriminant analysis and multiple regression), but these models have to comply with certain statistical assumptions about data properties and include all of the data in formulae construction without pruning the outliers in advance. The use of such readability formulae tends to produce a low text classification accuracy, while using a support vector machine (SVM) in machine learning can enhance the classification outcome. The present study constructed readability models by integrating multilevel linguistic features with SVM, which is more appropriate for text classification. Taking the Chinese language as an example, this study developed 31 linguistic features as the predicting variables at the word, semantic, syntax, and cohesion levels, with grade levels of texts as the criterion variable. The study compared four types of readability models by integrating unilevel and multilevel linguistic features with GLMs and an SVM. The results indicate that adopting a multilevel approach in readability analysis provides a better representation of the complexities of both texts and the reading comprehension process.

AB - Multilevel linguistic features have been proposed for discourse analysis, but there have been few applications of multilevel linguistic features to readability models and also few validations of such models. Most traditional readability formulae are based on generalized linear models (GLMs; e.g., discriminant analysis and multiple regression), but these models have to comply with certain statistical assumptions about data properties and include all of the data in formulae construction without pruning the outliers in advance. The use of such readability formulae tends to produce a low text classification accuracy, while using a support vector machine (SVM) in machine learning can enhance the classification outcome. The present study constructed readability models by integrating multilevel linguistic features with SVM, which is more appropriate for text classification. Taking the Chinese language as an example, this study developed 31 linguistic features as the predicting variables at the word, semantic, syntax, and cohesion levels, with grade levels of texts as the criterion variable. The study compared four types of readability models by integrating unilevel and multilevel linguistic features with GLMs and an SVM. The results indicate that adopting a multilevel approach in readability analysis provides a better representation of the complexities of both texts and the reading comprehension process.

KW - Linguistic features

KW - Multilevel

KW - Readability

KW - Support vector machine

KW - Validity

UR - http://www.scopus.com/inward/record.url?scp=84897120854&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84897120854&partnerID=8YFLogxK

U2 - 10.3758/s13428-014-0459-x

DO - 10.3758/s13428-014-0459-x

M3 - Article

C2 - 24687843

AN - SCOPUS:84897120854

VL - 47

SP - 340

EP - 354

JO - Behavior Research Methods

JF - Behavior Research Methods

SN - 1069-9384

IS - 2

ER -