TY - GEN
T1 - Investigating Low-Distortion Speech Enhancement with Discrete Cosine Transform Features for Robust Speech Recognition
AU - Tsao, Yu Sheng
AU - Hung, Jeih Weih
AU - Ho, Kuan Hsun
AU - Chen, Berlin
N1 - Publisher Copyright:
© 2022 Asia-Pacific of Signal and Information Processing Association (APSIPA).
PY - 2022
Y1 - 2022
N2 - This study investigates constructing low-distortion utterances to benefit downstream automatic speech recognition (ASR) systems at the front-end stage based on a speech enhancement (SE) network. With the dual-path Transformer network (DPTNet) as the SE archetype, we make effective use of short-time discrete cosine transform (STDCT) features to infer the respective mask-estimation network. Furthermore, we seek to jointly optimize the spectral-distance loss and the perceptual loss for the training of the model components of our proposed SE model so as to enhance the input utterances without introducing significant distortion. Extensive evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, containing stationary and non-stationary noises, respectively. The corresponding results show that the proposed SE method yields competitive perceptual metric scores on SE but significantly lower word error rates (WER) on ASR in relation to several top-of-the-line methods. Notably, the proposed SE method works remarkably well on the VoiceBank-QUT ASR task, thereby confirming its excellent generalization capability to unseen scenarios.
AB - This study investigates constructing low-distortion utterances to benefit downstream automatic speech recognition (ASR) systems at the front-end stage based on a speech enhancement (SE) network. With the dual-path Transformer network (DPTNet) as the SE archetype, we make effective use of short-time discrete cosine transform (STDCT) features to infer the respective mask-estimation network. Furthermore, we seek to jointly optimize the spectral-distance loss and the perceptual loss for the training of the model components of our proposed SE model so as to enhance the input utterances without introducing significant distortion. Extensive evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, containing stationary and non-stationary noises, respectively. The corresponding results show that the proposed SE method yields competitive perceptual metric scores on SE but significantly lower word error rates (WER) on ASR in relation to several top-of-the-line methods. Notably, the proposed SE method works remarkably well on the VoiceBank-QUT ASR task, thereby confirming its excellent generalization capability to unseen scenarios.
UR - http://www.scopus.com/inward/record.url?scp=85146306910&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85146306910&partnerID=8YFLogxK
U2 - 10.23919/APSIPAASC55919.2022.9980038
DO - 10.23919/APSIPAASC55919.2022.9980038
M3 - Conference contribution
AN - SCOPUS:85146306910
T3 - Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022
SP - 131
EP - 136
BT - Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022
Y2 - 7 November 2022 through 10 November 2022
ER -