Text summarization using a trainable summarizer and latent semantic analysis

Jen Yuan Yeh, Hao-Ren Ke, Wei Pang Yang, I. Heng Meng

Research output: Contribution to journalArticle

157 Citations (Scopus)

Abstract

This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA+T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA+GA, 44% and 40% for LSA+T.R.M. in single-document and corpus level were achieved respectively.

Original languageEnglish
Pages (from-to)75-95
Number of pages21
JournalInformation Processing and Management
Volume41
Issue number1
DOIs
Publication statusPublished - 2005 Jan 1

Fingerprint

Semantics
semantics
Genetic algorithms
Latent semantic analysis
Text summarization
Genetic algorithm
Key words
Compression

Keywords

  • Corpus-based approach
  • Latent semantic analysis
  • Text relationship map
  • Text summarization

ASJC Scopus subject areas

  • Information Systems
  • Media Technology
  • Computer Science Applications
  • Management Science and Operations Research
  • Library and Information Sciences

Cite this

Text summarization using a trainable summarizer and latent semantic analysis. / Yeh, Jen Yuan; Ke, Hao-Ren; Yang, Wei Pang; Meng, I. Heng.

In: Information Processing and Management, Vol. 41, No. 1, 01.01.2005, p. 75-95.

Research output: Contribution to journalArticle

Yeh, Jen Yuan ; Ke, Hao-Ren ; Yang, Wei Pang ; Meng, I. Heng. / Text summarization using a trainable summarizer and latent semantic analysis. In: Information Processing and Management. 2005 ; Vol. 41, No. 1. pp. 75-95.
@article{791b83d3275047019f1e92cecf605719,
title = "Text summarization using a trainable summarizer and latent semantic analysis",
abstract = "This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA+T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30{\%}, an average f-measure of 49{\%} for MCBA, 52{\%} for MCBA+GA, 44{\%} and 40{\%} for LSA+T.R.M. in single-document and corpus level were achieved respectively.",
keywords = "Corpus-based approach, Latent semantic analysis, Text relationship map, Text summarization",
author = "Yeh, {Jen Yuan} and Hao-Ren Ke and Yang, {Wei Pang} and Meng, {I. Heng}",
year = "2005",
month = "1",
day = "1",
doi = "10.1016/j.ipm.2004.04.003",
language = "English",
volume = "41",
pages = "75--95",
journal = "Information Processing and Management",
issn = "0306-4573",
publisher = "Elsevier Limited",
number = "1",

}

TY - JOUR

T1 - Text summarization using a trainable summarizer and latent semantic analysis

AU - Yeh, Jen Yuan

AU - Ke, Hao-Ren

AU - Yang, Wei Pang

AU - Meng, I. Heng

PY - 2005/1/1

Y1 - 2005/1/1

N2 - This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA+T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA+GA, 44% and 40% for LSA+T.R.M. in single-document and corpus level were achieved respectively.

AB - This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA+T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA+GA, 44% and 40% for LSA+T.R.M. in single-document and corpus level were achieved respectively.

KW - Corpus-based approach

KW - Latent semantic analysis

KW - Text relationship map

KW - Text summarization

UR - http://www.scopus.com/inward/record.url?scp=4744366943&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4744366943&partnerID=8YFLogxK

U2 - 10.1016/j.ipm.2004.04.003

DO - 10.1016/j.ipm.2004.04.003

M3 - Article

AN - SCOPUS:4744366943

VL - 41

SP - 75

EP - 95

JO - Information Processing and Management

JF - Information Processing and Management

SN - 0306-4573

IS - 1

ER -