摘要
Due to the enormous progress in deep learning, speech enhancement (SE) techniques have shown promising efficacy and play a pivotal role prior to an automatic speech recognition (ASR) system to mitigate the noise effects. By virtue of a wide variety of deep neural network (DNN) approaches, some researches have suggested that the phase information of speech signals is the key to success on SE. In this study, we present a continuation of our previous work and put forward a novel Cross-domain Time-reversal Enhancement NETwork (CD-TENET). CD-TENET leverages the time-reversed version of a speech signal and two features that consider the phase information of a speech signal in the time domain and the frequency domain, respectively, to promote SE performance for noise-robust ASR. Extensive experiments conducted on the Voicebank-DEMAND benchmark dataset show that CD-TENET can not only recover the original speech effectively but also improve both SE and ASR performance simultaneously. More surprisingly, the proposed CD-TENET method can offer a marked relative word error rate (WER) reduction of 43 % on the test set of scenarios contaminated with unseen noises when compared to a strong baseline with the multi-condition training setting.
原文 | 英語 |
---|---|
期刊 | IEEE Multimedia |
DOIs | |
出版狀態 | 接受/付印 - 2021 |
ASJC Scopus subject areas
- 軟體
- 訊號處理
- 媒體技術
- 硬體和架構
- 電腦科學應用