Fast co-occurrence thesaurus construction for Chinese news

Research output: Contribution to journalConference article

3 Citations (Scopus)

Abstract

This paper reports an approach to automatic thesaurus construction for Chinese news articles. An effective Chinese word segmentation and keyword extraction algorithm is first presented. For each document, an average of 33% keywords unknown to a lexicon of 123,226 terms can be identified. The extraction error rate is 3.6%. Keywords extracted from each document are then further filtered for term association analysis by a modified Dice coefficient formula. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method not only speeds up the thesaurus generation process drastically, but also achieves a similar percentage level of term relatedness.

Original languageEnglish
Pages (from-to)853-858
Number of pages6
JournalProceedings of the IEEE International Conference on Systems, Man and Cybernetics
Volume2
Publication statusPublished - 2001 Dec 1
Event2001 IEEE International Conference on Systems, Man and Cybernetics - Tucson, AZ, United States
Duration: 2001 Oct 72001 Oct 10

Fingerprint

Thesauri

Keywords

  • Chinese
  • Co-occurrence analysis
  • Co-occurrence thesaurus
  • Unknown word identification
  • Word segmentation

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Hardware and Architecture

Cite this

Fast co-occurrence thesaurus construction for Chinese news. / Tseng, Yuen-Hsien.

In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, 01.12.2001, p. 853-858.

Research output: Contribution to journalConference article

@article{238e377f6f9146598f1b600b8ca94554,
title = "Fast co-occurrence thesaurus construction for Chinese news",
abstract = "This paper reports an approach to automatic thesaurus construction for Chinese news articles. An effective Chinese word segmentation and keyword extraction algorithm is first presented. For each document, an average of 33{\%} keywords unknown to a lexicon of 123,226 terms can be identified. The extraction error rate is 3.6{\%}. Keywords extracted from each document are then further filtered for term association analysis by a modified Dice coefficient formula. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method not only speeds up the thesaurus generation process drastically, but also achieves a similar percentage level of term relatedness.",
keywords = "Chinese, Co-occurrence analysis, Co-occurrence thesaurus, Unknown word identification, Word segmentation",
author = "Yuen-Hsien Tseng",
year = "2001",
month = "12",
day = "1",
language = "English",
volume = "2",
pages = "853--858",
journal = "Proceedings of the IEEE International Conference on Systems, Man and Cybernetics",
issn = "0884-3627",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Fast co-occurrence thesaurus construction for Chinese news

AU - Tseng, Yuen-Hsien

PY - 2001/12/1

Y1 - 2001/12/1

N2 - This paper reports an approach to automatic thesaurus construction for Chinese news articles. An effective Chinese word segmentation and keyword extraction algorithm is first presented. For each document, an average of 33% keywords unknown to a lexicon of 123,226 terms can be identified. The extraction error rate is 3.6%. Keywords extracted from each document are then further filtered for term association analysis by a modified Dice coefficient formula. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method not only speeds up the thesaurus generation process drastically, but also achieves a similar percentage level of term relatedness.

AB - This paper reports an approach to automatic thesaurus construction for Chinese news articles. An effective Chinese word segmentation and keyword extraction algorithm is first presented. For each document, an average of 33% keywords unknown to a lexicon of 123,226 terms can be identified. The extraction error rate is 3.6%. Keywords extracted from each document are then further filtered for term association analysis by a modified Dice coefficient formula. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method not only speeds up the thesaurus generation process drastically, but also achieves a similar percentage level of term relatedness.

KW - Chinese

KW - Co-occurrence analysis

KW - Co-occurrence thesaurus

KW - Unknown word identification

KW - Word segmentation

UR - http://www.scopus.com/inward/record.url?scp=0035723482&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035723482&partnerID=8YFLogxK

M3 - Conference article

VL - 2

SP - 853

EP - 858

JO - Proceedings of the IEEE International Conference on Systems, Man and Cybernetics

JF - Proceedings of the IEEE International Conference on Systems, Man and Cybernetics

SN - 0884-3627

ER -