FROM CORPUS TO GRAMMAR: AUTOMATIC EXTRACTION OF GRAMMATICAL RELATIONS FROM ANNOTATED CORPUS

黃 居仁, 洪 嘉馡(Jia-Fei Hong), 馬 偉雲(Wei-Yun Ma), 石 穆(Petr Simon)

Research output: Contribution to journalArticle

Abstract

Automatic extraction of grammatical knowledge from corpora has been one of the ultimate goals and challenges of corpus linguistics. We present in this paper one of the approaches to this challenge in Chinese corpus linguistics by introducing our recent work using the Sketch Engine (SkE, also known as Word Sketch Engine) platform to automatically extract grammatical relations from PoS-annotated Chinese corpora. The SkE approach requires both giga-word size corpora and comprehensive lexico-grammatical information of the language in question. On the one hand, corpus size is crucial as the automatic extraction of grammatical relations requires enough instances of the relation pairs, which in turn require an exponential jump from the million-word size corpus for observation of single lexical items. On the other hand, lexico-grammatical information is crucial to the identification of potential relational pairs based on local context. The quality of such extraction is dependent on the quality of available lexico-grammatical knowledge. We show that a comprehensive lexical grammar, based on Information-based Case Grammar (Chen & Huang 1990) and covering over 40 thousand verbs greatly help the accuracy and recall of grammatical relation detection. The paper concludes by underlining the importance of integrating existing grammatical information to meet the challenge of automatic extraction of grammatical knowledge from large corpora.
Original languageEnglish
Pages (from-to)192-221
Number of pages30
JournalJournal of Chinese Linguistics Monograph Series
Issue number25
Publication statusPublished - 2015

    Fingerprint

Keywords

  • Mandarin Chinese
  • Grammatical knowledge
  • Automatic extraction
  • Lexical grammar
  • Sketch engin
  • 漢語
  • 語法知識
  • 自動抽取
  • 詞彙語法
  • 速描引擎

Cite this