Automatic thesaurus generation for Chinese documents

Yuen Hsien Tseng*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

51 Citations (Scopus)

Abstract

This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.

Original languageEnglish
Pages (from-to)1130-1138
Number of pages9
JournalJournal of the American Society for Information Science and Technology
Volume53
Issue number13
DOIs
Publication statusPublished - 2002 Nov
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Automatic thesaurus generation for Chinese documents'. Together they form a unique fingerprint.

Cite this