Speaker Conditional Sinc-Extractor for Personal VAD

En Lun Yu, Kuan Hsun Ho, Jeih Weih Hung, Shih Chieh Huang, Berlin Chen

Research output: Contribution to journalConference articlepeer-review

4 Citations (Scopus)

Abstract

This study explores Sinc-convolution's novel application in Personal Voice Activity Detection (PVAD).The Sinc-Extractor (SE) network, developed for PVAD, learns cutoff frequencies and band gains of sinc functions to extract acoustic features.Additionally, the speaker conditional SE (SCSE) module incorporates speaker information from high-dimensional d-vectors into low-dimensional acoustic features.SE-PVAD and Vanilla PVAD have similar model size and computing load, while SCSE-PVAD is more compact with shorter inference time as it excludes speaker embedding.Evaluated with concatenated utterances from the LibriSpeech corpus, SE-PVAD outperforms Vanilla PVAD significantly.SCSE-PVAD matches Vanilla PVAD's performance but reduces input feature dimensionality and network complexity.Thus, SCSE-PVAD can function like a typical VAD, accepting only acoustic features, making it suitable for low-resource wearable devices.

Original languageEnglish
Pages (from-to)2115-2119
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 2024 Sept 12024 Sept 5

Keywords

  • personalized voice activity detection
  • sinc-convolution
  • voice activity detection

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Speaker Conditional Sinc-Extractor for Personal VAD'. Together they form a unique fingerprint.

Cite this