TY - GEN
T1 - Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization
AU - Wang, Hsuan Yu
AU - Lee, Pei Ying
AU - Chen, Berlin
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.
AB - In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.
KW - Automatic Speech Recognition
KW - Speaker Diarization
KW - Speech Emotion Recognition
UR - https://www.scopus.com/pages/publications/105018061186
UR - https://www.scopus.com/pages/publications/105018061186#tab=citedBy
U2 - 10.1109/IALP68296.2024.11156415
DO - 10.1109/IALP68296.2024.11156415
M3 - Conference contribution
AN - SCOPUS:105018061186
T3 - Proceedings of 2025 International Conference on Asian Language Processing, IALP 2025
SP - 85
EP - 90
BT - Proceedings of 2025 International Conference on Asian Language Processing, IALP 2025
A2 - Wang, Lei
A2 - Tong, Rong
A2 - Juan, Sarah Flora Samson
A2 - Lu, Yanfeng
A2 - Tan, Ping Ping
A2 - Saee, Suhaila
A2 - Dong, Minghui
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th International Conference on Asian Language Processing, IALP 2025
Y2 - 4 August 2025 through 6 August 2025
ER -