An Effective Contextualized Automatic Speech Recognition Approach Leveraging Self-Supervised Phoneme Features

Li Ting Pai*, Yi Cheng Wang, Bi Cheng Yan, Hsin Wei Wang, Jia Liang Lu, Chi Han Lin, Juan Wei Xu, Berlin Chen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Years of scholarly efforts have led to extensive studies on end-to-end automatic speech recognition (E2E ASR), now demonstrating robust performance in everyday applications such as voice assistants, transcription services, and many others. However, E2E ASR struggles to recognize domain-specific phrases, such as keywords or name entities. To address this, contextualized ASR (CASR) has been developed to improve keyword recognition accuracy by incorporating specific contextual information, represented by a keyword list, into the ASR model. Despite their effectiveness, CASR systems still fall short in distinguishing between keywords with similar sounds, as well as generalizing to uncommon keyword pronunciations. Previous studies have focused primarily on enriching keyword representations by integrating keyword phoneme features derived from a simple sequence encoder with keyword grapheme features to overcome these obstacles. However, such phoneme representations are insufficient, as human pronunciation varies in different contexts, involving phenomena like linking and variations. In this paper, we argue that integration of more fine-grained phoneme features instrumental to accurate keyword recognition in CASR. To this end, we propose leveraging a self-supervised learning (SSL) phoneme encoder to provide more subtle phonemic details of keywords, effectively addressing these variations and alleviating the phonetic confusion between keywords. A series of experiments conducted on the SlideSpeech benchmark dataset demonstrates the effectiveness of our approach in alleviating keyword phonemic confusion and enhancing out-of-domain keyword recognition.

Original languageEnglish
Title of host publicationAPSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350367331
DOIs
Publication statusPublished - 2024
Event2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024 - Macau, China
Duration: 2024 Dec 32024 Dec 6

Publication series

NameAPSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024

Conference

Conference2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024
Country/TerritoryChina
CityMacau
Period2024/12/032024/12/06

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'An Effective Contextualized Automatic Speech Recognition Approach Leveraging Self-Supervised Phoneme Features'. Together they form a unique fingerprint.

Cite this