TY - GEN
T1 - Design and prototype of a large-scale and fully sense-tagged corpus
AU - Ker, Sue Jin
AU - Huang, Chu Ren
AU - Hong, Jia Fei
AU - Liu, Shi Yin
AU - Jian, Hui Ling
AU - Su, I. Li
AU - Hsieh, Shu Kai
PY - 2008
Y1 - 2008
N2 - Sense tagged corpus plays a very crucial role to Natural Language Processing, especially on the research of word sense disambiguation and natural language understanding. Having a large-scale Chinese sense tagged corpus seems to be very essential, but in fact, such large-scale corpus is the critical deficiency at the current stage. This paper is aimed to design a large-scale Chinese full text sense tagged Corpus, which contains over 110,000 words. The Academia Sinica Balanced Corpus of Modern Chinese (also named Sinica Corpus) is treated as the tagging object, and there are 56 full texts extracted from this corpus. By using the N-gram statistics and the information of collocation, the preparation work for automatic sense tagging is planned by combining the techniques and methods of machine learning and the probability model. In order to achieve a highly precise result, the result of automatic sense tagging needs the touch of manual revising.
AB - Sense tagged corpus plays a very crucial role to Natural Language Processing, especially on the research of word sense disambiguation and natural language understanding. Having a large-scale Chinese sense tagged corpus seems to be very essential, but in fact, such large-scale corpus is the critical deficiency at the current stage. This paper is aimed to design a large-scale Chinese full text sense tagged Corpus, which contains over 110,000 words. The Academia Sinica Balanced Corpus of Modern Chinese (also named Sinica Corpus) is treated as the tagging object, and there are 56 full texts extracted from this corpus. By using the N-gram statistics and the information of collocation, the preparation work for automatic sense tagging is planned by combining the techniques and methods of machine learning and the probability model. In order to achieve a highly precise result, the result of automatic sense tagging needs the touch of manual revising.
KW - Bootstrap Method
KW - Natural Language Processing
KW - Sense Tagged Corpus
KW - Word Sense Disambiguation
UR - http://www.scopus.com/inward/record.url?scp=40549101852&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=40549101852&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-78159-2_18
DO - 10.1007/978-3-540-78159-2_18
M3 - Conference contribution
AN - SCOPUS:40549101852
SN - 3540781587
SN - 9783540781585
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 186
EP - 193
BT - Large-Scale Knowledge Resources
T2 - 3rd International Conference on Large-Scale Knowledge Resources, LKR 2008
Y2 - 3 March 2008 through 5 March 2008
ER -