摘要
Automatic extraction of grammatical knowledge from corpora has been one of the ultimate goals and challenges of corpus linguistics. We present in this paper one of the approaches to this challenge in Chinese corpus linguistics by introducing our recent work using the Sketch Engine (SkE, also known as Word Sketch Engine) platform to automatically extract grammatical relations from PoS-annotated Chinese corpora. The SkE approach requires both giga-word size corpora and comprehensive lexico-grammatical information of the language in question. On the one hand, corpus size is crucial as the automatic extraction of grammatical relations requires enough instances of the relation pairs, which in turn require an exponential jump from the million-word size corpus for observation of single lexical items. On the other hand, lexico-grammatical information is crucial to the identification of potential relational pairs based on local context. The quality of such extraction is dependent on the quality of available lexico-grammatical knowledge. We show that a comprehensive lexical grammar, based on Information-based Case Grammar (Chen & Huang 1990) and covering over 40 thousand verbs greatly help the accuracy and recall of grammatical relation detection. The paper concludes by underlining the importance of integrating existing grammatical information to meet the challenge of automatic extraction of grammatical knowledge from large corpora.
原文 | 英語 |
---|---|
頁(從 - 到) | 192-221 |
頁數 | 30 |
期刊 | Journal of Chinese Linguistics Monograph Series |
發行號 | 25 |
出版狀態 | 已發佈 - 2015 |