Search Results (1 - 2 of 2 Results)

Sort By  
Sort Dir
 
Results per page  

Chen, JitongOn Generalization of Supervised Speech Separation
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs. Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility. Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises. Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; speech intelligibility; computational auditory scene analysis; mask estimation; supervised learning; deep neural networks; acoustic features; noise generalization; SNR generalization; speaker generalization;

Narayanan, ArunComputational auditory scene analysis and robust automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Automatic speech recognition (ASR) has made great strides over the last decade producing acceptable performance in relatively `clean' conditions. As a result, it is becoming a mainstream technology. But for a system to be useful in everyday conditions, it has to deal with distorting factors like background noise, room reverberation, and recording channel characteristics. A popular approach to improve robustness is to perform speech separation before doing ASR. Most of the current systems treat speech separation and speech recognition as two independent, isolated tasks. But just as separation helps improve recognition, recognition can potentially influence separation. For example, in auditory perception, speech schemas have been known to help improve segregation. An underlying theme of this dissertation is the advocation of a closer integration of these two tasks. We address this in the context of computational auditory scene analysis (CASA), including supervised speech separation. CASA is largely motivated by the principles that guide human auditory `scene analysis'. An important computational goal of CASA systems is to estimate the ideal binary mask (IBM). The IBM identifies speech dominated and noise dominated regions in a time-frequency representation of a noisy signal. Processing noisy signals using the IBM improves ASR performance by a large margin. We start by studying the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs) and vocabulary sizes. Our results show that the mere pattern of the IBM carries important phonetic information. Akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB. In fact, our study shows that there is broad agreement with human performance which is rather surprising. Given the important role that binary mask patterns play, we develop two novel systems that incorporate this information to improve ASR. The first system performs recognition by directly classifying binary masks corresponding to words and phonemes. The method is evaluated on an isolated digit recognition and a phone classification task. Despite dramatic reduction of speech information encoded in a binary mask compared to a typical ASR feature frontend, the proposed system performs surprisingly well. The second approach is a novel framework that performs speech separation and ASR in a unified fashion. Separation is performed via masking using an estimated IBM, and ASR is performed using the standard cepstral features. Most systems perform these tasks in a sequential fashion: separation followed by recognition. The proposed framework, which we call bidirectional speech decoder, unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On a medium-large vocabulary speech recognition task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM. Supervised classification based speech separation has shown a lot of promise recently. We perform an in-depth evaluation of such techniques as a front-end for noise-robust ASR. Comparing performance of supervised binary and ratio mask estimators, we observe that ratio masking significantly outperforms binary masking when it comes to ASR. Consequently, we propose a separation front-end that consists of two stages. The first stage removes additive noise via ratio time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage: A non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks (DNNs) and hidden Markov models. Results show that dFDLR consistently improves performance in all test conditions. We explore alternative ways to using the output of speech separation to improve ASR performance when using DNN based acoustic models. Apart from its use as a frontend, we propose using speech separation for providing smooth estimates of speech and noise which are then passed as additional features. Finally, we develop a unified framework that jointly improves separation and ASR under a supervised learning framework. Our systems obtain the state-of-the-art results in two widely used medium-large vocabulary noisy ASR corpora: Aurora-4 and CHiME-2.

Committee:

DeLiang Wang (Advisor); Mikhail Belkin (Committee Member); Eric Fosler-Lussier (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Automatic speech recognition; noise robustness; computational auditory scene analysis; binary masking; ratio masking; mask estimation; deep neural networks; acoustic modeling; speech separation; speech enhancement; noisy ASR; CHiME-2; Aurora-4;