Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.

Original languageEnglish
Title of host publicationProceedings of 2025 International Conference on Asian Language Processing, IALP 2025
EditorsLei Wang, Rong Tong, Sarah Flora Samson Juan, Yanfeng Lu, Ping Ping Tan, Suhaila Saee, Minghui Dong
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages85-90
Number of pages6
ISBN (Electronic)9798331589790
DOIs
Publication statusPublished - 2025
Event29th International Conference on Asian Language Processing, IALP 2025 - Sarawak, Malaysia
Duration: 2025 Aug 42025 Aug 6

Publication series

NameProceedings of 2025 International Conference on Asian Language Processing, IALP 2025

Conference

Conference29th International Conference on Asian Language Processing, IALP 2025
Country/TerritoryMalaysia
CitySarawak
Period2025/08/042025/08/06

Keywords

  • Automatic Speech Recognition
  • Speaker Diarization
  • Speech Emotion Recognition

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Signal Processing

Fingerprint

Dive into the research topics of 'Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization'. Together they form a unique fingerprint.

Cite this