Automatic thesaurus generation for Chinese documents

Research output: Contribution to journalArticle

43 Citations (Scopus)

Abstract

This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.

Original languageEnglish
Pages (from-to)1130-1138
Number of pages9
JournalJournal of the American Society for Information Science and Technology
Volume53
Issue number13
DOIs
Publication statusPublished - 2002 Nov 1

Fingerprint

Thesauri
thesaurus
Experiments
Key words
Thesaurus
experiment
Keyword extraction
Experiment

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

Automatic thesaurus generation for Chinese documents. / Tseng, Yuen-Hsien.

In: Journal of the American Society for Information Science and Technology, Vol. 53, No. 13, 01.11.2002, p. 1130-1138.

Research output: Contribution to journalArticle

@article{79c0d7d33f4d46908d1b77ec85658f52,
title = "Automatic thesaurus generation for Chinese documents",
abstract = "This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33{\%} keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3{\%} of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.",
author = "Yuen-Hsien Tseng",
year = "2002",
month = "11",
day = "1",
doi = "10.1002/asi.10146",
language = "English",
volume = "53",
pages = "1130--1138",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "13",

}

TY - JOUR

T1 - Automatic thesaurus generation for Chinese documents

AU - Tseng, Yuen-Hsien

PY - 2002/11/1

Y1 - 2002/11/1

N2 - This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.

AB - This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.

UR - http://www.scopus.com/inward/record.url?scp=0036852821&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036852821&partnerID=8YFLogxK

U2 - 10.1002/asi.10146

DO - 10.1002/asi.10146

M3 - Article

VL - 53

SP - 1130

EP - 1138

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 13

ER -