TY - GEN
T1 - MatDC
T2 - 25th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2020
AU - Tseng, Yu Hsiang
AU - Hsieh, Shu Kai
AU - Lian, Richard
AU - Chiang, Chiung Yu
AU - Chang, Yu Lin
AU - Chang, Li Ping
AU - Hsieh, Ji Lung
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/12
Y1 - 2020/12
N2 - The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.
AB - The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.
KW - conversational agent
KW - language resources
KW - semantic annotations
KW - task-oriented dialogue dataset
UR - http://www.scopus.com/inward/record.url?scp=85103812885&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103812885&partnerID=8YFLogxK
U2 - 10.1109/TAAI51410.2020.00038
DO - 10.1109/TAAI51410.2020.00038
M3 - Conference contribution
AN - SCOPUS:85103812885
T3 - Proceedings - 25th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2020
SP - 165
EP - 170
BT - Proceedings - 25th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 December 2020 through 5 December 2020
ER -