Abstract
In this paper, we propose Flexible Dynamic Encoder RNN (FDE-RNN), an innovative model capable of seamlessly switching between VAD and PVAD without incurring redundant resource consumption. In static PVAD modeling, performing VAD typically requires either merging categories or omitting speaker embeddings, often resulting in excessively large models that are impractical for VAD tasks. In contrast, FDE-RNN efficiently adapts by removing the personalization module when functioning as VAD, significantly reducing resource demands. Furthermore, on PVAD tasks, FDE-RNN leverages dynamic neural networks with a gating-based skipping mechanism, enabling it to bypass redundant computations during non-speech segments, further optimizing computational efficiency. Extensive experiments demonstrate that FDE-RNN outperforms all other prior arts on both PVAD and VAD tasks in terms of overall performance. Notably, when functioning as a VAD, FDE-RNN merely utilizes 30% of the parameters required by the competitive models, underscoring its remarkable efficiency and scalability.
| Original language | English |
|---|---|
| Pages (from-to) | 5793-5797 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 2025 Aug 17 → 2025 Aug 21 |
Keywords
- Dynamic Neural Networks
- Personalized Voice Activity Detection
- Voice Activity Detection
ASJC Scopus subject areas
- Software
- Signal Processing
- Language and Linguistics
- Modelling and Simulation
- Human-Computer Interaction