Verifying a Chinese collection for text categorization

Yuen Hsien Tseng*, William John Teahan

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method. Experiments showed that effectiveness was not affected by the confusing documents.

Original languageEnglish
Title of host publicationProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery (ACM)
Pages556-557
Number of pages2
ISBN (Print)1581138814, 9781581138818
DOIs
Publication statusPublished - 2004
Externally publishedYes
EventProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Sheffield, United Kingdom
Duration: 2004 Jul 252004 Jul 29

Publication series

NameProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Other

OtherProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Country/TerritoryUnited Kingdom
CitySheffield
Period2004/07/252004/07/29

Keywords

  • Chinese collection
  • Consistency verification
  • Duplicate detection

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Verifying a Chinese collection for text categorization'. Together they form a unique fingerprint.

Cite this