Verifying a Chinese collection for text categorization

Yuen Hsien Tseng, William John Teahan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method. Experiments showed that effectiveness was not affected by the confusing documents.

Original languageEnglish
Title of host publicationProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
EditorsK. Jarvelin, J. Allen, P. Bruza, M. Sanderson
Pages556-557
Number of pages2
Publication statusPublished - 2004 Nov 25
EventProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Sheffield, United Kingdom
Duration: 2004 Jul 252004 Jul 29

Publication series

NameProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Other

OtherProceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
CountryUnited Kingdom
CitySheffield
Period04/7/2504/7/29

Fingerprint

Labels
Experiments

Keywords

  • Chinese collection
  • Consistency verification
  • Duplicate detection

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Tseng, Y. H., & Teahan, W. J. (2004). Verifying a Chinese collection for text categorization. In K. Jarvelin, J. Allen, P. Bruza, & M. Sanderson (Eds.), Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 556-557). (Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval).

Verifying a Chinese collection for text categorization. / Tseng, Yuen Hsien; Teahan, William John.

Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ed. / K. Jarvelin; J. Allen; P. Bruza; M. Sanderson. 2004. p. 556-557 (Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tseng, YH & Teahan, WJ 2004, Verifying a Chinese collection for text categorization. in K Jarvelin, J Allen, P Bruza & M Sanderson (eds), Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 556-557, Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, 04/7/25.
Tseng YH, Teahan WJ. Verifying a Chinese collection for text categorization. In Jarvelin K, Allen J, Bruza P, Sanderson M, editors, Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2004. p. 556-557. (Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval).
Tseng, Yuen Hsien ; Teahan, William John. / Verifying a Chinese collection for text categorization. Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. editor / K. Jarvelin ; J. Allen ; P. Bruza ; M. Sanderson. 2004. pp. 556-557 (Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval).
@inproceedings{1b767b070ce549718832e675871dd719,
title = "Verifying a Chinese collection for text categorization",
abstract = "This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method. Experiments showed that effectiveness was not affected by the confusing documents.",
keywords = "Chinese collection, Consistency verification, Duplicate detection",
author = "Tseng, {Yuen Hsien} and Teahan, {William John}",
year = "2004",
month = "11",
day = "25",
language = "English",
isbn = "1581138814",
series = "Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval",
pages = "556--557",
editor = "K. Jarvelin and J. Allen and P. Bruza and M. Sanderson",
booktitle = "Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

TY - GEN

T1 - Verifying a Chinese collection for text categorization

AU - Tseng, Yuen Hsien

AU - Teahan, William John

PY - 2004/11/25

Y1 - 2004/11/25

N2 - This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method. Experiments showed that effectiveness was not affected by the confusing documents.

AB - This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method. Experiments showed that effectiveness was not affected by the confusing documents.

KW - Chinese collection

KW - Consistency verification

KW - Duplicate detection

UR - http://www.scopus.com/inward/record.url?scp=8644246878&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=8644246878&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:8644246878

SN - 1581138814

T3 - Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

SP - 556

EP - 557

BT - Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

A2 - Jarvelin, K.

A2 - Allen, J.

A2 - Bruza, P.

A2 - Sanderson, M.

ER -