TY - JOUR
T1 - Speaker Conditional Sinc-Extractor for Personal VAD
AU - Yu, En Lun
AU - Ho, Kuan Hsun
AU - Hung, Jeih Weih
AU - Huang, Shih Chieh
AU - Chen, Berlin
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - This study explores Sinc-convolution's novel application in Personal Voice Activity Detection (PVAD).The Sinc-Extractor (SE) network, developed for PVAD, learns cutoff frequencies and band gains of sinc functions to extract acoustic features.Additionally, the speaker conditional SE (SCSE) module incorporates speaker information from high-dimensional d-vectors into low-dimensional acoustic features.SE-PVAD and Vanilla PVAD have similar model size and computing load, while SCSE-PVAD is more compact with shorter inference time as it excludes speaker embedding.Evaluated with concatenated utterances from the LibriSpeech corpus, SE-PVAD outperforms Vanilla PVAD significantly.SCSE-PVAD matches Vanilla PVAD's performance but reduces input feature dimensionality and network complexity.Thus, SCSE-PVAD can function like a typical VAD, accepting only acoustic features, making it suitable for low-resource wearable devices.
AB - This study explores Sinc-convolution's novel application in Personal Voice Activity Detection (PVAD).The Sinc-Extractor (SE) network, developed for PVAD, learns cutoff frequencies and band gains of sinc functions to extract acoustic features.Additionally, the speaker conditional SE (SCSE) module incorporates speaker information from high-dimensional d-vectors into low-dimensional acoustic features.SE-PVAD and Vanilla PVAD have similar model size and computing load, while SCSE-PVAD is more compact with shorter inference time as it excludes speaker embedding.Evaluated with concatenated utterances from the LibriSpeech corpus, SE-PVAD outperforms Vanilla PVAD significantly.SCSE-PVAD matches Vanilla PVAD's performance but reduces input feature dimensionality and network complexity.Thus, SCSE-PVAD can function like a typical VAD, accepting only acoustic features, making it suitable for low-resource wearable devices.
KW - personalized voice activity detection
KW - sinc-convolution
KW - voice activity detection
UR - https://www.scopus.com/pages/publications/85214846707
UR - https://www.scopus.com/pages/publications/85214846707#tab=citedBy
U2 - 10.21437/Interspeech.2024-365
DO - 10.21437/Interspeech.2024-365
M3 - Conference article
AN - SCOPUS:85214846707
SN - 2308-457X
SP - 2115
EP - 2119
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -