Design and prototype of a large-scale and fully sense-tagged corpus

Sue Jin Ker*, Chu Ren Huang, Jia Fei Hong, Shi Yin Liu, Hui Ling Jian, I. Li Su, Shu Kai Hsieh

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Sense tagged corpus plays a very crucial role to Natural Language Processing, especially on the research of word sense disambiguation and natural language understanding. Having a large-scale Chinese sense tagged corpus seems to be very essential, but in fact, such large-scale corpus is the critical deficiency at the current stage. This paper is aimed to design a large-scale Chinese full text sense tagged Corpus, which contains over 110,000 words. The Academia Sinica Balanced Corpus of Modern Chinese (also named Sinica Corpus) is treated as the tagging object, and there are 56 full texts extracted from this corpus. By using the N-gram statistics and the information of collocation, the preparation work for automatic sense tagging is planned by combining the techniques and methods of machine learning and the probability model. In order to achieve a highly precise result, the result of automatic sense tagging needs the touch of manual revising.

Original languageEnglish
Title of host publicationLarge-Scale Knowledge Resources
Subtitle of host publicationConstruction and Application - Third International Conference on Large-Scale Knowledge Resources, LKR 2008, Proceedings
Pages186-193
Number of pages8
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event3rd International Conference on Large-Scale Knowledge Resources, LKR 2008 - Tokyo, Japan
Duration: 2008 Mar 32008 Mar 5

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4938 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference3rd International Conference on Large-Scale Knowledge Resources, LKR 2008
Country/TerritoryJapan
CityTokyo
Period2008/03/032008/03/05

Keywords

  • Bootstrap Method
  • Natural Language Processing
  • Sense Tagged Corpus
  • Word Sense Disambiguation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Design and prototype of a large-scale and fully sense-tagged corpus'. Together they form a unique fingerprint.

Cite this