An Empirical Study on Transformer-Based End-to-End Speech Recognition with Novel Decoder Masking

Shi Yan Weng, Hsuan Sheng Chiu, Berlin Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The attention-based encoder-decoder modeling paradigm has achieved impressive success on a wide variety of speech and language processing tasks. This paradigm takes advantage of the innate ability of neural networks to learn a direct and streamlined mapping from an input sequence to an output sequence for ASR, without any prior knowledge like audio- alignments or pronunciation lexicons. An ASR model built on this paradigm, however, is inevitably faced with the issue of inadequate generalization especially when the model is not trained with huge amounts of speech data. In view of this, we in this paper propose a decoder masking based training approach for end-to-end (E2E) ASR models, taking inspiration from the celebrated speech input augmentation (viz. SpecAugment) and masked language modeling (viz. BERT). During the training phase, we randomly replace some portions of the decoder's historical input with the symbol [mask] to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked. The proposed approach is instantiated with the top-of-the-line transformer-based E2E ASR model. Extensive experiments conducted on two benchmark datasets (viz. Librispeech960h and TedLium2) seem to demonstrate the efficacy of our approach in relation to some existing E2E ASR systems.

Original languageEnglish
Title of host publication2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages518-522
Number of pages5
ISBN (Electronic)9789881476890
Publication statusPublished - 2021
Event2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Tokyo, Japan
Duration: 2021 Dec 142021 Dec 17

Publication series

Name2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings

Conference

Conference2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021
Country/TerritoryJapan
CityTokyo
Period2021/12/142021/12/17

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Instrumentation

Fingerprint

Dive into the research topics of 'An Empirical Study on Transformer-Based End-to-End Speech Recognition with Novel Decoder Masking'. Together they form a unique fingerprint.

Cite this