Development and validation of a corpus for machine humor comprehension

Yuen Hsien Tseng, Wun Syuan Wu, Chia Yueh Chang, Hsueh Chih Chen, Wei Lun Hsu

研究成果: 書貢獻/報告類型會議論文篇章

摘要

This work developed a Chinese humor corpus containing 3,365 jokes collected from over 40 sources. Each joke was labeled with five levels of funniness, eight skill sets of humor, and six dimensions of intent by only one annotator. To validate the manual labels, we trained SVM (Support Vector Machine) and BERT (Bidirectional Encoder Representations from Transformers) with half of the corpus (labeled by one annotator) to predict the skill and intent labels of the other half (labeled by the other annotator). Based on two assumptions that a valid manually labeled corpus should follow, our results showed the validity for the skill and intent labels. As to the funniness label, the validation results showed that the correlation between the corpus label and user feedback rating is marginal, which implies that the funniness level is a harder annotation problem to be solved. The contribution of this work is two folds: 1) a Chinese humor corpus is developed with labels of humor skills, intents, and funniness, which allows machines to learn more intricate humor framing, effect, and amusing level to predict and respond in proper context (https://github.com/SamTseng/Chinese_Humor_MultiLabeled). 2) An approach to verify whether a minimum human labeled corpus is valid or not, which facilitates the validation of low-resource corpora.

原文英語
主出版物標題LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
編輯Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
發行者European Language Resources Association (ELRA)
頁面1346-1352
頁數7
ISBN(電子)9791095546344
出版狀態已發佈 - 2020
事件12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, 法国
持續時間: 2020 五月 112020 五月 16

出版系列

名字LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

會議

會議12th International Conference on Language Resources and Evaluation, LREC 2020
國家/地區法国
城市Marseille
期間2020/05/112020/05/16

ASJC Scopus subject areas

  • 語言與語言學
  • 教育
  • 圖書館與資訊科學
  • 語言和語言學

指紋

深入研究「Development and validation of a corpus for machine humor comprehension」主題。共同形成了獨特的指紋。

引用此