TY - GEN
T1 - An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling
AU - Yan, Bi Cheng
AU - Wu, Meng Che
AU - Hung, Hsiao Tsung
AU - Chen, Berlin
N1 - Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT). Most of the existing MDD approaches focus on dealing with categorical errors (viz. one canonical phone is substituted by another one, aside from those mispronunciations caused by deletions or insertions). However, accurate detection and diagnosis of non-categorial or distortion errors (viz. approximating L2 phones with L1 (first-language) phones, or erroneous pronunciations in between) still seems out of reach. In view of this, we propose to conduct MDD with a novel end-to-end automatic speech recognition (E2E-based ASR) approach. In particular, we expand the original L2 phone set with their corresponding anti-phone set, making the E2E-based MDD approach have a better capability to take in both categorical and non-categorial mispronunciations, aiming to provide better mispronunciation detection and diagnosis feedback. Furthermore, a novel transfer-learning paradigm is devised to obtain the initial model estimate of the E2E-based MDD system without resource to any phonological rules. Extensive sets of experimental results on the L2-ARCTIC dataset show that our best system can outperform the existing E2E baseline system and pronunciation scoring based method (GOP) in terms of the F1-score, by 11.05% and 27.71%, respectively.
AB - Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT). Most of the existing MDD approaches focus on dealing with categorical errors (viz. one canonical phone is substituted by another one, aside from those mispronunciations caused by deletions or insertions). However, accurate detection and diagnosis of non-categorial or distortion errors (viz. approximating L2 phones with L1 (first-language) phones, or erroneous pronunciations in between) still seems out of reach. In view of this, we propose to conduct MDD with a novel end-to-end automatic speech recognition (E2E-based ASR) approach. In particular, we expand the original L2 phone set with their corresponding anti-phone set, making the E2E-based MDD approach have a better capability to take in both categorical and non-categorial mispronunciations, aiming to provide better mispronunciation detection and diagnosis feedback. Furthermore, a novel transfer-learning paradigm is devised to obtain the initial model estimate of the E2E-based MDD system without resource to any phonological rules. Extensive sets of experimental results on the L2-ARCTIC dataset show that our best system can outperform the existing E2E baseline system and pronunciation scoring based method (GOP) in terms of the F1-score, by 11.05% and 27.71%, respectively.
KW - Anti-phone model
KW - Computer-assisted pronunciation training (CAPT)
KW - End-to-end ASR
KW - Mispronunciation detection and diagnosis (MDD)
UR - http://www.scopus.com/inward/record.url?scp=85098200883&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098200883&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1616
DO - 10.21437/Interspeech.2020-1616
M3 - Conference contribution
AN - SCOPUS:85098200883
SN - 9781713820697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3032
EP - 3036
BT - Interspeech 2020
PB - International Speech Communication Association
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -