TY - GEN
T1 - Exploring the Integration of E2E ASR and Pronunciation Modeling for English Mispronunciation Detection
AU - Wang, Hsin Wei
AU - Yan, Bi Cheng
AU - Hsu, Yung Chang
AU - Chen, Berlin
N1 - Publisher Copyright:
© 2021 ROCLING 2021 - Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing. All rights reserved.
PY - 2021
Y1 - 2021
N2 - There has been increasing demand to develop effective computer-assisted language training (CAPT) systems, which can provide feedback on mispronunciations and facilitate second-language (L2) learners to improve their speaking proficiency through repeated practice. Due to the shortage of non-native speech for training the automatic speech recognition (ASR) module of a CAPT system, the corresponding mispronunciation detection performance is often affected by imperfect ASR. Recognizing this importance, we in this paper put forward a two-stage mispronunciation detection method. In the first stage, the speech uttered by an L2 learner is processed by an end-to-end ASR module to produce N-best phone sequence hypotheses. In the second stage, these hypotheses are fed into a pronunciation model which seeks to faithfully predict the phone sequence hypothesis that is most likely pronounced by the learner, so as to improve the performance of mispronunciation detection. Empirical experiments conducted a English benchmark dataset seem to confirm the utility of our method.
AB - There has been increasing demand to develop effective computer-assisted language training (CAPT) systems, which can provide feedback on mispronunciations and facilitate second-language (L2) learners to improve their speaking proficiency through repeated practice. Due to the shortage of non-native speech for training the automatic speech recognition (ASR) module of a CAPT system, the corresponding mispronunciation detection performance is often affected by imperfect ASR. Recognizing this importance, we in this paper put forward a two-stage mispronunciation detection method. In the first stage, the speech uttered by an L2 learner is processed by an end-to-end ASR module to produce N-best phone sequence hypotheses. In the second stage, these hypotheses are fed into a pronunciation model which seeks to faithfully predict the phone sequence hypothesis that is most likely pronounced by the learner, so as to improve the performance of mispronunciation detection. Empirical experiments conducted a English benchmark dataset seem to confirm the utility of our method.
KW - End-to-End Speech Recognition
KW - Mispronunciation Detection and Diagnosis
KW - N-best Rescoring
UR - http://www.scopus.com/inward/record.url?scp=85127434474&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127434474&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85127434474
T3 - ROCLING 2021 - Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing
SP - 124
EP - 131
BT - ROCLING 2021 - Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing
A2 - Lee, Lung-Hao
A2 - Chang, Chia-Hui
A2 - Chen, Kuan-Yu
PB - The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
T2 - 33rd Conference on Computational Linguistics and Speech Processing, ROCLING 2021
Y2 - 15 October 2021 through 16 October 2021
ER -