TY - JOUR
T1 - A Comparative Experimental Study on Simple Features and Lightweight Models for Voice Activity Detection in Noisy Environments
AU - Su, Bo Yu
AU - Chen, Berlin
AU - Huang, Shih Chieh
AU - Hung, Jeih Weih
N1 - Publisher Copyright:
© 2026 by the authors.
PY - 2026/1
Y1 - 2026/1
N2 - This work presents a comparative study of voice activity detection in noise using simple acoustic features and relatively compact recurrent models within a controlled MATLAB-based framework. For each utterance, 9 baseline spectral-plus-periodicity features, MFCCs, and FBANKs are extracted and passed to several lightweight BiLSTM-based networks, either alone or preceded by a 1D CNN layer. The main experiments are carried out at a fixed SNR to separate the influence of the network structure and the feature type, and an additional series with four SNR levels is used to assess whether the same performance trends hold when the SNR varies. The results show that adding a compact CNN front-end before the BiLSTM consistently improves detection scores, that MFCCs generally outperform the baseline spectral–periodicity features and often give better recall/F1 than FBANKs for the considered lightweight models, and that (Formula presented.) +BiLSTM with 13-dimensional MFCCs offers a favorable trade-off between accuracy, robustness across SNRs, and model size. Because all conditions share a single MATLAB implementation with fixed noise types, SNR values, and evaluation metrics, this work is positioned as a benchmark and practical guideline publication for noise-robust, resource-constrained VAD, rather than as a proposal of a completely new deep-learning architecture.
AB - This work presents a comparative study of voice activity detection in noise using simple acoustic features and relatively compact recurrent models within a controlled MATLAB-based framework. For each utterance, 9 baseline spectral-plus-periodicity features, MFCCs, and FBANKs are extracted and passed to several lightweight BiLSTM-based networks, either alone or preceded by a 1D CNN layer. The main experiments are carried out at a fixed SNR to separate the influence of the network structure and the feature type, and an additional series with four SNR levels is used to assess whether the same performance trends hold when the SNR varies. The results show that adding a compact CNN front-end before the BiLSTM consistently improves detection scores, that MFCCs generally outperform the baseline spectral–periodicity features and often give better recall/F1 than FBANKs for the considered lightweight models, and that (Formula presented.) +BiLSTM with 13-dimensional MFCCs offers a favorable trade-off between accuracy, robustness across SNRs, and model size. Because all conditions share a single MATLAB implementation with fixed noise types, SNR values, and evaluation metrics, this work is positioned as a benchmark and practical guideline publication for noise-robust, resource-constrained VAD, rather than as a proposal of a completely new deep-learning architecture.
KW - convolutional neural network
KW - noise robustness
KW - speech enhancement
KW - voice activity detection
UR - https://www.scopus.com/pages/publications/105029085518
UR - https://www.scopus.com/pages/publications/105029085518#tab=citedBy
U2 - 10.3390/electronics15020263
DO - 10.3390/electronics15020263
M3 - Article
AN - SCOPUS:105029085518
SN - 2079-9292
VL - 15
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 2
M1 - 263
ER -