Abstract
This study explores Sinc-convolution's novel application in Personal Voice Activity Detection (PVAD).The Sinc-Extractor (SE) network, developed for PVAD, learns cutoff frequencies and band gains of sinc functions to extract acoustic features.Additionally, the speaker conditional SE (SCSE) module incorporates speaker information from high-dimensional d-vectors into low-dimensional acoustic features.SE-PVAD and Vanilla PVAD have similar model size and computing load, while SCSE-PVAD is more compact with shorter inference time as it excludes speaker embedding.Evaluated with concatenated utterances from the LibriSpeech corpus, SE-PVAD outperforms Vanilla PVAD significantly.SCSE-PVAD matches Vanilla PVAD's performance but reduces input feature dimensionality and network complexity.Thus, SCSE-PVAD can function like a typical VAD, accepting only acoustic features, making it suitable for low-resource wearable devices.
| Original language | English |
|---|---|
| Pages (from-to) | 2115-2119 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| Publication status | Published - 2024 |
| Event | 25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 2024 Sept 1 → 2024 Sept 5 |
Keywords
- personalized voice activity detection
- sinc-convolution
- voice activity detection
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation