Quality assurance of automatic annotation of very large corpora: A study based on heterogeneous tagging systems

Chu Ren Huang, Lung Hao Lee, Wei Guang Qu, Jia Fei Hong, Shiwen Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it can serve as the basic model for mapping between the CKIP and ICL tagging systems for any data.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages2725-2729
Number of pages5
ISBN (Electronic)2951740840, 9782951740846
Publication statusPublished - 2008
Externally publishedYes
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 2008 May 282008 May 30

Publication series

NameProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

Conference

Conference6th International Conference on Language Resources and Evaluation, LREC 2008
Country/TerritoryMorocco
CityMarrakech
Period2008/05/282008/05/30

ASJC Scopus subject areas

  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics
  • Education

Fingerprint

Dive into the research topics of 'Quality assurance of automatic annotation of very large corpora: A study based on heterogeneous tagging systems'. Together they form a unique fingerprint.

Cite this