Exploring Branchformer-Based End-to-End Speaker Diarization with Speaker-Wise VAD Loss

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Speaker diarization involves partitioning an audio stream into segments according to the identity of the speaker. The encoder-decoder based attractors for the end-to-end neural diarization (EEND-EDA) model can handle overlapping speech and has shown promising performance compared to traditional methods. However, EEND-EDA fails to identify the number of speakers accurately. To address this limitation, we first replace the Transformer encoder in EEND-EDA with the Branchformer encoder. Additionally, we introduce speaker-wise VAD Loss (SAD Loss) to the self-attention mechanism of the Branchformer encoder, thereby improving the model's ability to distinguish different speakers. Extensive experimental results on the Mini-Librispeech and simulated dataset Sim2spk benchmark dataset suggest that our approach outperforms existing strong baselines by a substantial margin, achieving a significant improvement of more than 15% Diarization Error Rate (DER). We will release the source code on GitHub1 for future research.

Original languageEnglish
Title of host publication2024 27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024 - Proceedings
EditorsMing-Hsiang Su, Jui-Feng Yeh, Yuan-Fu Liao, Chi-Chun Lee, Yu Taso
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331506032
DOIs
Publication statusPublished - 2024
Event27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024 - Hsinchu, Taiwan
Duration: 2024 Oct 172024 Oct 19

Publication series

Name2024 27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024 - Proceedings

Conference

Conference27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024
Country/TerritoryTaiwan
CityHsinchu
Period2024/10/172024/10/19

Keywords

  • auxiliary loss
  • branchformer
  • end-to-end neural di-arization
  • multi -head attention
  • speaker diarization

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Library and Information Sciences
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Exploring Branchformer-Based End-to-End Speaker Diarization with Speaker-Wise VAD Loss'. Together they form a unique fingerprint.

Cite this