TY - JOUR
T1 - Text summarization using a trainable summarizer and latent semantic analysis
AU - Yeh, Jen Yuan
AU - Ke, Hao Ren
AU - Yang, Wei Pang
AU - Meng, I. Heng
N1 - Funding Information:
The research was supported by the Software Technology for Advanced Network Application project of Institute for Information Industry and sponsored by MOEA, ROC.
Copyright:
Copyright 2008 Elsevier B.V., All rights reserved.
PY - 2005/1
Y1 - 2005/1
N2 - This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA+T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA+GA, 44% and 40% for LSA+T.R.M. in single-document and corpus level were achieved respectively.
AB - This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA+T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA+GA, 44% and 40% for LSA+T.R.M. in single-document and corpus level were achieved respectively.
KW - Corpus-based approach
KW - Latent semantic analysis
KW - Text relationship map
KW - Text summarization
UR - http://www.scopus.com/inward/record.url?scp=4744366943&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=4744366943&partnerID=8YFLogxK
U2 - 10.1016/j.ipm.2004.04.003
DO - 10.1016/j.ipm.2004.04.003
M3 - Article
AN - SCOPUS:4744366943
VL - 41
SP - 75
EP - 95
JO - Information Processing and Management
JF - Information Processing and Management
SN - 0306-4573
IS - 1
ER -