TY - JOUR
T1 - Discriminative autoencoders for acoustic modeling
AU - Yang, Ming Han
AU - Lee, Hung Shin
AU - Lu, Yu Ding
AU - Chen, Kuan Yu
AU - Tsao, Yu
AU - Chen, Berlin
AU - Wang, Hsin Min
N1 - Publisher Copyright:
Copyright © 2017 ISCA.
PY - 2017
Y1 - 2017
N2 - Speech data typically contain information irrelevant to automatic speech recognition (ASR), such as speaker variability and channel/environmental noise, lurking deep within acoustic features. Such unwanted information is always mixed together to stunt the development of an ASR system. In this paper, we propose a new framework based on autoencoders for acoustic modeling in ASR. Unlike other variants of autoencoder neural networks, our framework is able to isolate phonetic components from a speech utterance by simultaneously taking two kinds of objectives into consideration. The first one relates to the minimization of reconstruction errors and benefits to learn most salient and useful properties of the data. The second one functions in the middlemost code layer, where the categorical distribution of the context-dependent phone states is estimated for phoneme discrimination and the derivation of acoustic scores, the proximity relationship among utterances spoken by the same speaker are preserved, and the intra-utterance noise is modeled and abstracted away. We describe the implementation of the discriminative autoencoders for training tri-phone acoustic models and present TIMIT phone recognition results, which demonstrate that our proposed method outperforms the conventional DNN-based approach.
AB - Speech data typically contain information irrelevant to automatic speech recognition (ASR), such as speaker variability and channel/environmental noise, lurking deep within acoustic features. Such unwanted information is always mixed together to stunt the development of an ASR system. In this paper, we propose a new framework based on autoencoders for acoustic modeling in ASR. Unlike other variants of autoencoder neural networks, our framework is able to isolate phonetic components from a speech utterance by simultaneously taking two kinds of objectives into consideration. The first one relates to the minimization of reconstruction errors and benefits to learn most salient and useful properties of the data. The second one functions in the middlemost code layer, where the categorical distribution of the context-dependent phone states is estimated for phoneme discrimination and the derivation of acoustic scores, the proximity relationship among utterances spoken by the same speaker are preserved, and the intra-utterance noise is modeled and abstracted away. We describe the implementation of the discriminative autoencoders for training tri-phone acoustic models and present TIMIT phone recognition results, which demonstrate that our proposed method outperforms the conventional DNN-based approach.
KW - Acoustic modeling
KW - Automatic speech recognition
KW - Deep neural networks
KW - Discriminative autoencoders
UR - http://www.scopus.com/inward/record.url?scp=85039171291&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039171291&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2017-221
DO - 10.21437/Interspeech.2017-221
M3 - Conference article
AN - SCOPUS:85039171291
SN - 2308-457X
VL - 2017-August
SP - 3557
EP - 3561
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
Y2 - 20 August 2017 through 24 August 2017
ER -