Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition

Hsin Ju Hsieh, Berlin Chen, Jeih Weih Hung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this paper, we propose to employ median filtering (MF) to the modulation domain of the complex-valued acoustic spectrogram in order to alleviate the noise effect in speech signals and thereby improve noise robustness. Median filtering is well known by its excellent capability of removing the speckle noise in data while preserving the embedded sharp contrasts. When median filtering is applied to the temporal modulation spectrum, which is the Fourier transform for either of real and imaginary acoustic spectrograms along the time axis, we find that the mismatch caused by noise can be significantly reduced, and the resulting speech features can be more noise-robust and provide better accuracy for speech recognition in comparison with the original unprocessed features. More particularly, the proposed method possesses three explicit merits. First, via the median filtering operation, the outliers residing in the modulation spectrum probably caused by noise can be substantially alleviated. Second, by virtue of the individual processing of real and imaginary acoustic spectrograms, the proposed method will not experience the knotty problem of speech-noise cross-term that usually exists in the conventional acoustic spectral enhancement methods, especially when the noise reduction process is inevitable. Third, the median filtering process is completely unsupervised and requires no prior information about the clean speech and noise. All of the evaluation experiments are conducted on the two databases, the connected-digit Aurora-2 database and the median-vocabulary Aurora-4 database. According to the recognition results, we demonstrate that the proposed MF-based method can achieve performance competitive to or better than many state-of-the-art noise robustness methods, including histogram equalization (HEQ), mean and variance normalization (MVN), MVN plus ARMA filtering (MVA), temporal structure normalization (TSN) and advanced front-end (AFE).

Original languageEnglish
Title of host publicationProceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016
EditorsHsin-Min Wang, Qingzhi Hou, Yuan Wei, Tan Lee, Jianguo Wei, Lei Xie, Hui Feng, Jianwu Dang, Jianwu Dang
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509042937
DOIs
Publication statusPublished - 2017 May 2
Event10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016 - Tianjin, China
Duration: 2016 Oct 172016 Oct 20

Publication series

NameProceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016

Other

Other10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016
CountryChina
CityTianjin
Period16/10/1716/10/20

Fingerprint

Speech recognition
Acoustic noise
acoustics
Acoustics
Modulation
normalization
Speckle
Noise abatement
Fourier transforms
Processing
mismatch
Experiments
vocabulary
experiment
evaluation

Keywords

  • Automatic speech recognition
  • Feature extraction
  • Median filter
  • Modulation spectrum
  • Noise robustness
  • Principal component analysis

ASJC Scopus subject areas

  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Linguistics and Language

Cite this

Hsieh, H. J., Chen, B., & Hung, J. W. (2017). Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition. In H-M. Wang, Q. Hou, Y. Wei, T. Lee, J. Wei, L. Xie, H. Feng, J. Dang, ... J. Dang (Eds.), Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016 [7918403] (Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP.2016.7918403

Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition. / Hsieh, Hsin Ju; Chen, Berlin; Hung, Jeih Weih.

Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016. ed. / Hsin-Min Wang; Qingzhi Hou; Yuan Wei; Tan Lee; Jianguo Wei; Lei Xie; Hui Feng; Jianwu Dang; Jianwu Dang. Institute of Electrical and Electronics Engineers Inc., 2017. 7918403 (Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hsieh, HJ, Chen, B & Hung, JW 2017, Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition. in H-M Wang, Q Hou, Y Wei, T Lee, J Wei, L Xie, H Feng, J Dang & J Dang (eds), Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016., 7918403, Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016, Institute of Electrical and Electronics Engineers Inc., 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016, Tianjin, China, 16/10/17. https://doi.org/10.1109/ISCSLP.2016.7918403
Hsieh HJ, Chen B, Hung JW. Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition. In Wang H-M, Hou Q, Wei Y, Lee T, Wei J, Xie L, Feng H, Dang J, Dang J, editors, Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7918403. (Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016). https://doi.org/10.1109/ISCSLP.2016.7918403
Hsieh, Hsin Ju ; Chen, Berlin ; Hung, Jeih Weih. / Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition. Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016. editor / Hsin-Min Wang ; Qingzhi Hou ; Yuan Wei ; Tan Lee ; Jianguo Wei ; Lei Xie ; Hui Feng ; Jianwu Dang ; Jianwu Dang. Institute of Electrical and Electronics Engineers Inc., 2017. (Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016).
@inproceedings{6e24d9166a16436d97f15963de834c3f,
title = "Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition",
abstract = "In this paper, we propose to employ median filtering (MF) to the modulation domain of the complex-valued acoustic spectrogram in order to alleviate the noise effect in speech signals and thereby improve noise robustness. Median filtering is well known by its excellent capability of removing the speckle noise in data while preserving the embedded sharp contrasts. When median filtering is applied to the temporal modulation spectrum, which is the Fourier transform for either of real and imaginary acoustic spectrograms along the time axis, we find that the mismatch caused by noise can be significantly reduced, and the resulting speech features can be more noise-robust and provide better accuracy for speech recognition in comparison with the original unprocessed features. More particularly, the proposed method possesses three explicit merits. First, via the median filtering operation, the outliers residing in the modulation spectrum probably caused by noise can be substantially alleviated. Second, by virtue of the individual processing of real and imaginary acoustic spectrograms, the proposed method will not experience the knotty problem of speech-noise cross-term that usually exists in the conventional acoustic spectral enhancement methods, especially when the noise reduction process is inevitable. Third, the median filtering process is completely unsupervised and requires no prior information about the clean speech and noise. All of the evaluation experiments are conducted on the two databases, the connected-digit Aurora-2 database and the median-vocabulary Aurora-4 database. According to the recognition results, we demonstrate that the proposed MF-based method can achieve performance competitive to or better than many state-of-the-art noise robustness methods, including histogram equalization (HEQ), mean and variance normalization (MVN), MVN plus ARMA filtering (MVA), temporal structure normalization (TSN) and advanced front-end (AFE).",
keywords = "Automatic speech recognition, Feature extraction, Median filter, Modulation spectrum, Noise robustness, Principal component analysis",
author = "Hsieh, {Hsin Ju} and Berlin Chen and Hung, {Jeih Weih}",
year = "2017",
month = "5",
day = "2",
doi = "10.1109/ISCSLP.2016.7918403",
language = "English",
series = "Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
editor = "Hsin-Min Wang and Qingzhi Hou and Yuan Wei and Tan Lee and Jianguo Wei and Lei Xie and Hui Feng and Jianwu Dang and Jianwu Dang",
booktitle = "Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016",

}

TY - GEN

T1 - Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition

AU - Hsieh, Hsin Ju

AU - Chen, Berlin

AU - Hung, Jeih Weih

PY - 2017/5/2

Y1 - 2017/5/2

N2 - In this paper, we propose to employ median filtering (MF) to the modulation domain of the complex-valued acoustic spectrogram in order to alleviate the noise effect in speech signals and thereby improve noise robustness. Median filtering is well known by its excellent capability of removing the speckle noise in data while preserving the embedded sharp contrasts. When median filtering is applied to the temporal modulation spectrum, which is the Fourier transform for either of real and imaginary acoustic spectrograms along the time axis, we find that the mismatch caused by noise can be significantly reduced, and the resulting speech features can be more noise-robust and provide better accuracy for speech recognition in comparison with the original unprocessed features. More particularly, the proposed method possesses three explicit merits. First, via the median filtering operation, the outliers residing in the modulation spectrum probably caused by noise can be substantially alleviated. Second, by virtue of the individual processing of real and imaginary acoustic spectrograms, the proposed method will not experience the knotty problem of speech-noise cross-term that usually exists in the conventional acoustic spectral enhancement methods, especially when the noise reduction process is inevitable. Third, the median filtering process is completely unsupervised and requires no prior information about the clean speech and noise. All of the evaluation experiments are conducted on the two databases, the connected-digit Aurora-2 database and the median-vocabulary Aurora-4 database. According to the recognition results, we demonstrate that the proposed MF-based method can achieve performance competitive to or better than many state-of-the-art noise robustness methods, including histogram equalization (HEQ), mean and variance normalization (MVN), MVN plus ARMA filtering (MVA), temporal structure normalization (TSN) and advanced front-end (AFE).

AB - In this paper, we propose to employ median filtering (MF) to the modulation domain of the complex-valued acoustic spectrogram in order to alleviate the noise effect in speech signals and thereby improve noise robustness. Median filtering is well known by its excellent capability of removing the speckle noise in data while preserving the embedded sharp contrasts. When median filtering is applied to the temporal modulation spectrum, which is the Fourier transform for either of real and imaginary acoustic spectrograms along the time axis, we find that the mismatch caused by noise can be significantly reduced, and the resulting speech features can be more noise-robust and provide better accuracy for speech recognition in comparison with the original unprocessed features. More particularly, the proposed method possesses three explicit merits. First, via the median filtering operation, the outliers residing in the modulation spectrum probably caused by noise can be substantially alleviated. Second, by virtue of the individual processing of real and imaginary acoustic spectrograms, the proposed method will not experience the knotty problem of speech-noise cross-term that usually exists in the conventional acoustic spectral enhancement methods, especially when the noise reduction process is inevitable. Third, the median filtering process is completely unsupervised and requires no prior information about the clean speech and noise. All of the evaluation experiments are conducted on the two databases, the connected-digit Aurora-2 database and the median-vocabulary Aurora-4 database. According to the recognition results, we demonstrate that the proposed MF-based method can achieve performance competitive to or better than many state-of-the-art noise robustness methods, including histogram equalization (HEQ), mean and variance normalization (MVN), MVN plus ARMA filtering (MVA), temporal structure normalization (TSN) and advanced front-end (AFE).

KW - Automatic speech recognition

KW - Feature extraction

KW - Median filter

KW - Modulation spectrum

KW - Noise robustness

KW - Principal component analysis

UR - http://www.scopus.com/inward/record.url?scp=85020176025&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85020176025&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP.2016.7918403

DO - 10.1109/ISCSLP.2016.7918403

M3 - Conference contribution

AN - SCOPUS:85020176025

T3 - Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016

BT - Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016

A2 - Wang, Hsin-Min

A2 - Hou, Qingzhi

A2 - Wei, Yuan

A2 - Lee, Tan

A2 - Wei, Jianguo

A2 - Xie, Lei

A2 - Feng, Hui

A2 - Dang, Jianwu

A2 - Dang, Jianwu

PB - Institute of Electrical and Electronics Engineers Inc.

ER -