Discriminative autoencoders for acoustic modeling

Ming Han Yang, Hung Shin Lee, Yu Ding Lu, Kuan Yu Chen, Yu Tsao, Berlin Chen, Hsin Min Wang

Research output: Contribution to journalConference article

3 Citations (Scopus)

Abstract

Speech data typically contain information irrelevant to automatic speech recognition (ASR), such as speaker variability and channel/environmental noise, lurking deep within acoustic features. Such unwanted information is always mixed together to stunt the development of an ASR system. In this paper, we propose a new framework based on autoencoders for acoustic modeling in ASR. Unlike other variants of autoencoder neural networks, our framework is able to isolate phonetic components from a speech utterance by simultaneously taking two kinds of objectives into consideration. The first one relates to the minimization of reconstruction errors and benefits to learn most salient and useful properties of the data. The second one functions in the middlemost code layer, where the categorical distribution of the context-dependent phone states is estimated for phoneme discrimination and the derivation of acoustic scores, the proximity relationship among utterances spoken by the same speaker are preserved, and the intra-utterance noise is modeled and abstracted away. We describe the implementation of the discriminative autoencoders for training tri-phone acoustic models and present TIMIT phone recognition results, which demonstrate that our proposed method outperforms the conventional DNN-based approach.

Original languageEnglish
Pages (from-to)3557-3561
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017 Jan 1
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 2017 Aug 202017 Aug 24

Fingerprint

Automatic Speech Recognition
Acoustics
Speech recognition
Modeling
Acoustic Model
Categorical
Proximity
Discrimination
Speech analysis
Acoustic noise
Neural Networks
Dependent
Neural networks
Demonstrate
Phone
Utterance
Speech
Framework

Keywords

  • Acoustic modeling
  • Automatic speech recognition
  • Deep neural networks
  • Discriminative autoencoders

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Discriminative autoencoders for acoustic modeling. / Yang, Ming Han; Lee, Hung Shin; Lu, Yu Ding; Chen, Kuan Yu; Tsao, Yu; Chen, Berlin; Wang, Hsin Min.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2017-August, 01.01.2017, p. 3557-3561.

Research output: Contribution to journalConference article

Yang, Ming Han ; Lee, Hung Shin ; Lu, Yu Ding ; Chen, Kuan Yu ; Tsao, Yu ; Chen, Berlin ; Wang, Hsin Min. / Discriminative autoencoders for acoustic modeling. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2017 ; Vol. 2017-August. pp. 3557-3561.
@article{8c9946e24e0e4876b9df2a3b1ad5c54e,
title = "Discriminative autoencoders for acoustic modeling",
abstract = "Speech data typically contain information irrelevant to automatic speech recognition (ASR), such as speaker variability and channel/environmental noise, lurking deep within acoustic features. Such unwanted information is always mixed together to stunt the development of an ASR system. In this paper, we propose a new framework based on autoencoders for acoustic modeling in ASR. Unlike other variants of autoencoder neural networks, our framework is able to isolate phonetic components from a speech utterance by simultaneously taking two kinds of objectives into consideration. The first one relates to the minimization of reconstruction errors and benefits to learn most salient and useful properties of the data. The second one functions in the middlemost code layer, where the categorical distribution of the context-dependent phone states is estimated for phoneme discrimination and the derivation of acoustic scores, the proximity relationship among utterances spoken by the same speaker are preserved, and the intra-utterance noise is modeled and abstracted away. We describe the implementation of the discriminative autoencoders for training tri-phone acoustic models and present TIMIT phone recognition results, which demonstrate that our proposed method outperforms the conventional DNN-based approach.",
keywords = "Acoustic modeling, Automatic speech recognition, Deep neural networks, Discriminative autoencoders",
author = "Yang, {Ming Han} and Lee, {Hung Shin} and Lu, {Yu Ding} and Chen, {Kuan Yu} and Yu Tsao and Berlin Chen and Wang, {Hsin Min}",
year = "2017",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2017-221",
language = "English",
volume = "2017-August",
pages = "3557--3561",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Discriminative autoencoders for acoustic modeling

AU - Yang, Ming Han

AU - Lee, Hung Shin

AU - Lu, Yu Ding

AU - Chen, Kuan Yu

AU - Tsao, Yu

AU - Chen, Berlin

AU - Wang, Hsin Min

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Speech data typically contain information irrelevant to automatic speech recognition (ASR), such as speaker variability and channel/environmental noise, lurking deep within acoustic features. Such unwanted information is always mixed together to stunt the development of an ASR system. In this paper, we propose a new framework based on autoencoders for acoustic modeling in ASR. Unlike other variants of autoencoder neural networks, our framework is able to isolate phonetic components from a speech utterance by simultaneously taking two kinds of objectives into consideration. The first one relates to the minimization of reconstruction errors and benefits to learn most salient and useful properties of the data. The second one functions in the middlemost code layer, where the categorical distribution of the context-dependent phone states is estimated for phoneme discrimination and the derivation of acoustic scores, the proximity relationship among utterances spoken by the same speaker are preserved, and the intra-utterance noise is modeled and abstracted away. We describe the implementation of the discriminative autoencoders for training tri-phone acoustic models and present TIMIT phone recognition results, which demonstrate that our proposed method outperforms the conventional DNN-based approach.

AB - Speech data typically contain information irrelevant to automatic speech recognition (ASR), such as speaker variability and channel/environmental noise, lurking deep within acoustic features. Such unwanted information is always mixed together to stunt the development of an ASR system. In this paper, we propose a new framework based on autoencoders for acoustic modeling in ASR. Unlike other variants of autoencoder neural networks, our framework is able to isolate phonetic components from a speech utterance by simultaneously taking two kinds of objectives into consideration. The first one relates to the minimization of reconstruction errors and benefits to learn most salient and useful properties of the data. The second one functions in the middlemost code layer, where the categorical distribution of the context-dependent phone states is estimated for phoneme discrimination and the derivation of acoustic scores, the proximity relationship among utterances spoken by the same speaker are preserved, and the intra-utterance noise is modeled and abstracted away. We describe the implementation of the discriminative autoencoders for training tri-phone acoustic models and present TIMIT phone recognition results, which demonstrate that our proposed method outperforms the conventional DNN-based approach.

KW - Acoustic modeling

KW - Automatic speech recognition

KW - Deep neural networks

KW - Discriminative autoencoders

UR - http://www.scopus.com/inward/record.url?scp=85039171291&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039171291&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-221

DO - 10.21437/Interspeech.2017-221

M3 - Conference article

AN - SCOPUS:85039171291

VL - 2017-August

SP - 3557

EP - 3561

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -