Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades.
Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation.
We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs.
Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility.
Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises.
Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.