TY - GEN
T1 - Modulation spectrum augmentation for robust speech recognition
AU - Yan, Bi Cheng
AU - Liu, Shih Hung
AU - Chen, Berlin
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/11/15
Y1 - 2019/11/15
N2 - Data augmentation is a crucial mechanism being employed to increase the diversity of training data in order to avoid overfitting and improve robustness of statistical models in various applications. In the context of automatic speech recognition (ASR), a recent trend has been to develop effective methods to augment training speech data by warping or masking utterances based on their waveforms or spectrograms. Extending this line of research, we make attempts to explore novel ways to generate augmented training speech data, in comparison to the existing state-of-the-art approaches. The main contribution of this paper is at least two-fold. First, we propose to warp the intermediate representation of the cepstral feature vector sequence of an utterance in a holistic manner. This intermediate representation can be embodied in different modulation domains by performing discrete Fourier transform (DFT) along the either the time- or the component-axis of a cepstral feature vector sequence. Second, we also develop a two-stage augmentation approach, which successively conduct perturbation in the waveform domain and warping in different modulation domains of cepstral speech feature vector sequences, to further enhance robustness. A series of experiments are carried out on the Aurora-4 database and task, in conjunction with a typical DNN-HMM based ASR system. The proposed augmentation method that conducts warping in the component-axis modulation domain of cepstral feature vector sequences can yield a word error rate reduction (WERR) of 17.6% and 0.69%, respectively, for the clean- and multi-condition training settings. In addition, the proposed two-stage augmentation method can at best achieve a WERR of 1.13% when using the multi-condition training setup.
AB - Data augmentation is a crucial mechanism being employed to increase the diversity of training data in order to avoid overfitting and improve robustness of statistical models in various applications. In the context of automatic speech recognition (ASR), a recent trend has been to develop effective methods to augment training speech data by warping or masking utterances based on their waveforms or spectrograms. Extending this line of research, we make attempts to explore novel ways to generate augmented training speech data, in comparison to the existing state-of-the-art approaches. The main contribution of this paper is at least two-fold. First, we propose to warp the intermediate representation of the cepstral feature vector sequence of an utterance in a holistic manner. This intermediate representation can be embodied in different modulation domains by performing discrete Fourier transform (DFT) along the either the time- or the component-axis of a cepstral feature vector sequence. Second, we also develop a two-stage augmentation approach, which successively conduct perturbation in the waveform domain and warping in different modulation domains of cepstral speech feature vector sequences, to further enhance robustness. A series of experiments are carried out on the Aurora-4 database and task, in conjunction with a typical DNN-HMM based ASR system. The proposed augmentation method that conducts warping in the component-axis modulation domain of cepstral feature vector sequences can yield a word error rate reduction (WERR) of 17.6% and 0.69%, respectively, for the clean- and multi-condition training settings. In addition, the proposed two-stage augmentation method can at best achieve a WERR of 1.13% when using the multi-condition training setup.
KW - Data augmentation
KW - Modulation spectra
KW - Robustness
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85123040597&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123040597&partnerID=8YFLogxK
U2 - 10.1145/3373477.3373695
DO - 10.1145/3373477.3373695
M3 - Conference contribution
AN - SCOPUS:85123040597
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the International Conference on Advanced Information Science and System, AISS 2019
PB - Association for Computing Machinery
T2 - 2019 International Conference on Advanced Information Science and System, AISS 2019
Y2 - 15 November 2019 through 17 November 2019
ER -