Search Results (1 - 6 of 6 Results)

Sort By  
Sort Dir
 
Results per page  

Hu, KeSpeech Segregation in Background Noise and Competing Speech
Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

In real-world listening environments, speech reaching our ear is often accompanied by acoustic interference such as environmental sounds, music or another voice. Noise distorts speech and poses a substantial difficulty to many applications including hearing aid design and automatic speech recognition. Monaural speech segregation refers to the problem of separating speech based on only one recording and is a widely regarded challenge. In the last decades, significant progress has been made on this problem but the challenge remains.

This dissertation addresses monaural speech segregation from different interference. First, we research the problem of unvoiced speech segregation which is less studied compared to voiced speech segregation probably due to its difficulty. We propose to utilize segregated voiced speech to assist unvoiced speech segregation. Specifically, we remove all periodic signals including voiced speech from the noisy input and then estimate noise energy in unvoiced intervals using noise-dominant time-frequency units in neighboring voiced intervals. The estimated interference is used by a subtraction stage to extract unvoiced segments, which are then grouped by either simple thresholding or classification. We demonstrate that the proposed system performs substantially better than speech enhancement methods.

Interference can be nonspeech signals or other voices. Cochannel speech refers to a mixture of two speech signals. Cochannel speech separation is often addressed by model-based methods, which assume speaker identities and pretrained speaker models. To address this speaker-dependency limitation, we propose an unsupervised approach to cochannel speech separation. We employ a tandem algorithm to perform simultaneous grouping of speech and develop an unsupervised clustering method to group simultaneous streams across time. The proposed objective function for clustering measures the speaker difference of each hypothesized grouping and incorporates pitch constraints. For unvoiced speech segregation, we employ an onset/offset based analysis for segmentation, and then divide the segments into unvoiced-voiced and unvoiced-unvoiced portions for separation. We show that this method achieves considerable SNR gains over a range of input SNR conditions, and despite its unsupervised nature produces competitive performance to model-based and speaker independent methods.

In cochannel speech separation, speaker identities are sometimes known and clean utterances of each speaker are readily available. We can thus describe speakers using models to assist separation. One issue in model-based cochannel speech separation is generalization to different signal levels. We propose an iterative algorithm to separate speech signals and estimate the input SNR jointly. We employ hidden Markov models to describe speaker acoustic characteristics and temporal dynamics. Initially, we use unadapted speaker models to segregate two speech signals and then use them to estimate the input SNR. The input SNR is then utilized to adapt speaker models for re-estimating the speech signals. The two steps iterate until convergence. Systematic evaluations show that our iterative method improves segregation performance significantly and also converges relatively fast. In comparison with related model-based methods, it is computationally simpler and performs better in a number of input SNR conditions, in terms of both SNR gains and hit minus false-alarm rates.

Committee:

DeLiang Wang (Committee Chair); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member)

Subjects:

Computer Science

Keywords:

Monaural Speech Separation; CASA; Unvoiced Speech; Nonspeech Interference; Cochannel Speech Separation; Unsupervised Clustering; Model-based Method; Iterative Estimation

Han, KunSupervised Speech Separation And Processing
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
In real-world environments, speech often occurs simultaneously with acoustic interference, such as background noise or reverberation. The interference usually leads to adverse effects on speech perception, and results in performance degradation in many speech applications, including automatic speech recognition and speaker identification. Monaural speech separation and processing aim to separate or analyze speech from interference based on only one recording. Although significant progress has been made on this problem, it is a widely regarded challenge. Unlike traditional signal processing, this dissertation addresses the speech separation and processing problems using machine learning techniques. We first propose a classification approach to estimate the ideal binary mask (IBM) which is considered as a main goal of sound separation in computational auditory scene analysis (CASA). We employ support vector machines (SVMs) to classify time-frequency (T-F) units as either target-dominant or interference-dominant. A rethresholding method is incorporated to improve classification results and maximize hit minus false alarm rates. Systematic evaluations show that the proposed approach produces accurate estimated IBMs. In a supervised learning framework, the issue of generalization to conditions different from those in training is very important. We then present methods that require only a small training corpus and can generalize to unseen conditions. The system utilizes SVMs to learn classification cues and then employs a rethresholding technique to estimate the IBM. A distribution fitting method is introduced to generalize to unseen signal-to-noise ratio conditions and voice activity detection based adaptation is used to generalize to unseen noise conditions. In addition, we propose to use a novel metric learning method to learn invariant speech features in the kernel space. The learned features encode speech-related information and can generalize to unseen noise conditions. Experiments show that the proposed approaches produce high quality IBM estimates under unseen conditions. Besides background noise, room reverberation is another major source of signal degradation in real environments. Reverberation when combined with background noise is particularly disruptive for speech perception and many applications. We perform dereverberation and denoising using supervised learning. A deep neural network (DNN) is trained to directly learn a spectral mapping from the spectrogram of corrupted speech to that of clean speech. The spectral mapping approach substantially attenuates the distortion caused by reverberation and background noise, leading to improvement of predicted speech intelligibility and quality scores, as well as speech recognition rates. Pitch is one of the most important characteristics of speech signals. Although pitch tracking has been studied for decades, it is still challenging to estimate pitch from speech in the presence of strong noise. We estimate pitch using supervised learning, where probabilistic pitch states are directly learned from noisy speech data. We investigate two alternative neural networks modeling pitch state distribution given observations, i.e., a feedforward DNN and a recurrent deep neural network (RNN). Both DNNs and RNNs produce accurate probabilistic outputs of pitch states, which are then connected into pitch contours by Viterbi decoding. Experiments show that the proposed algorithms are robust to different noise conditions.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member)

Subjects:

Computer Science

Keywords:

Supervised learning; Speech separation; Speech processing; Machine learning; Deep Learning; Pitch estimation; Speech Dereverberation; Deep neural networks; Support vector machines

Wang, YuxuanSupervised Speech Separation Using Deep Neural Networks
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Speech is crucial for human communication. However, speech communication for both humans and automatic devices can be negatively impacted by background noise, which is common in real environments. Due to numerous applications, such as hearing prostheses and automatic speech recognition, separation of target speech from sound mixtures is of great importance. Among many techniques, speech separation using a single microphone is most desirable from an application standpoint. The resulting monaural speech separation problem has been a central problem in speech processing for several decades. However, its success has been limited thus far. Time-frequency (T-F) masking is a proven way to suppress background noise. With T-F masking as the computational goal, speech separation reduces to a mask estimation problem, which can be cast as a supervised learning problem. This opens speech separation to a plethora of machine learning techniques. Deep neural networks (DNN) are particularly suitable to this problem due to their strong representational capacity. This dissertation presents a systematic effort to develop monaural speech separation systems using DNNs. We start by presenting a comparative study on acoustic features for supervised separation. In this relatively early work, we use support vector machine as classifier to predict the ideal binary mask (IBM), which is a primary goal in computational auditory scene analysis. We found that traditional speech and speaker recognition features can actually outperform previously used separation features. Furthermore, we present a feature selection method to systematically select complementary features. The resulting feature set is used throughout the dissertation. DNN has shown success across a range of tasks. We then study IBM estimation using DNN, and show that it is significantly better than previous systems. Once properly trained, the system generalizes reasonably well to unseen conditions. We demonstrate that our system can improve speech intelligibility for hearing-impaired listeners. Furthermore, by considering the structure in the IBM, we show how to improve IBM estimation by employing sequence training and optimizing a speech intelligibility predictor. The IBM is used as the training target in previous work due to its simplicity. DNN based separation is not limited to binary masking, and choosing a suitable training target is obviously important. We study the performance of a number of targets and found that ratio masking can be preferable, and T-F masking in general outperforms spectral mapping. In addition, we propose a new target that encodes structure into ratio masks. Generalization to noises not seen during training is key to supervised separation. A simple and effective way to improve generalization is to train on multiple noisy conditions. Along this line, we demonstrate that the noise mismatch problem can be well remedied by large-scale training. This important result substantiates the practicability of DNN based supervised separation. Aside from speech intelligibility, perceptual quality is also important. In the last part of the dissertation, we propose a new DNN architecture that directly reconstructs time-domain clean speech signal. The resulting system significantly improves objective speech quality over standard mask estimators.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; time-frequency masking; computational auditory scene analysis; acoustic features; deep neural networks; training targets; generalization; speech intelligibility; speech quality

Jin, ZhaozhangMonaural Speech Segregation in Reverberant Environments
Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

Room reverberation is a major source of signal degradation in real environments. While listeners excel in "hearing out" a target source from sound mixtures in noisy and reverberant conditions, simulating this perceptual ability remains a fundamental challenge. The goal of this dissertation is to build a computational auditory scene analysis (CASA) system that separates target voiced speech from its acoustic background in reverberant environments. A supervised learning approach to pitch-based grouping of reverberant speech is proposed, followed by a robust multipitch tracking algorithm based on a hidden Markov model (HMM) framework. Finally, a monaural CASA system for reverberant speech segregation is designed by combining the supervised learning approach and the multipitch tracker.

Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. Assuming that the true target pitch is known, our first study lends to a novel supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The model trained using this objective function yields significantly better T-F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers.

Multipitch tracking in real environments is critical for speech signal processing. Determining pitch in both reverberant and noisy conditions is another difficult task. In the second study, we propose a robust algorithm for multipitch tracking in the presence of background noise and room reverberation. A new channel selection method is utilized to extract periodicity features. We derive pitch scores for each pitch state, which estimate the likelihoods of the observed periodicity features given pitch candidates. An HMM integrates these pitch scores and searches for the best pitch state sequence. Our algorithm can reliably detect single and double pitch contours in noisy and reverberant conditions.

Building on the first two studies, we propose a CASA approach to monaural segregation of reverberant voiced speech, which performs multipitch tracking of reverberant mixtures and supervised classification. Speech and nonspeech models are separately trained, and each learns to map pitch-based features to the posterior probability of a T-F unit being dominated by the source with the given pitch estimate. Because interference can be either speech or nonspeech, a likelihood ratio test is introduced to select the correct model for labeling corresponding T-F units. Experimental results show that the proposed system performs robustly in different types of interference and various reverberant conditions, and has a significant advantage over existing systems.

Committee:

DeLiang Wang, PhD (Advisor); Eric Fosler-Lussier, PhD (Committee Member); Mikhail Belkin, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

computational auditory scene analysis; monaural segregation; multipitch tracking; pitch determination algorithm; room reverberation; speech separation; supervised learning

Chen, JitongOn Generalization of Supervised Speech Separation
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs. Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility. Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises. Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; speech intelligibility; computational auditory scene analysis; mask estimation; supervised learning; deep neural networks; acoustic features; noise generalization; SNR generalization; speaker generalization;

Narayanan, ArunComputational auditory scene analysis and robust automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Automatic speech recognition (ASR) has made great strides over the last decade producing acceptable performance in relatively `clean' conditions. As a result, it is becoming a mainstream technology. But for a system to be useful in everyday conditions, it has to deal with distorting factors like background noise, room reverberation, and recording channel characteristics. A popular approach to improve robustness is to perform speech separation before doing ASR. Most of the current systems treat speech separation and speech recognition as two independent, isolated tasks. But just as separation helps improve recognition, recognition can potentially influence separation. For example, in auditory perception, speech schemas have been known to help improve segregation. An underlying theme of this dissertation is the advocation of a closer integration of these two tasks. We address this in the context of computational auditory scene analysis (CASA), including supervised speech separation. CASA is largely motivated by the principles that guide human auditory `scene analysis'. An important computational goal of CASA systems is to estimate the ideal binary mask (IBM). The IBM identifies speech dominated and noise dominated regions in a time-frequency representation of a noisy signal. Processing noisy signals using the IBM improves ASR performance by a large margin. We start by studying the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs) and vocabulary sizes. Our results show that the mere pattern of the IBM carries important phonetic information. Akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB. In fact, our study shows that there is broad agreement with human performance which is rather surprising. Given the important role that binary mask patterns play, we develop two novel systems that incorporate this information to improve ASR. The first system performs recognition by directly classifying binary masks corresponding to words and phonemes. The method is evaluated on an isolated digit recognition and a phone classification task. Despite dramatic reduction of speech information encoded in a binary mask compared to a typical ASR feature frontend, the proposed system performs surprisingly well. The second approach is a novel framework that performs speech separation and ASR in a unified fashion. Separation is performed via masking using an estimated IBM, and ASR is performed using the standard cepstral features. Most systems perform these tasks in a sequential fashion: separation followed by recognition. The proposed framework, which we call bidirectional speech decoder, unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On a medium-large vocabulary speech recognition task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM. Supervised classification based speech separation has shown a lot of promise recently. We perform an in-depth evaluation of such techniques as a front-end for noise-robust ASR. Comparing performance of supervised binary and ratio mask estimators, we observe that ratio masking significantly outperforms binary masking when it comes to ASR. Consequently, we propose a separation front-end that consists of two stages. The first stage removes additive noise via ratio time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage: A non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks (DNNs) and hidden Markov models. Results show that dFDLR consistently improves performance in all test conditions. We explore alternative ways to using the output of speech separation to improve ASR performance when using DNN based acoustic models. Apart from its use as a frontend, we propose using speech separation for providing smooth estimates of speech and noise which are then passed as additional features. Finally, we develop a unified framework that jointly improves separation and ASR under a supervised learning framework. Our systems obtain the state-of-the-art results in two widely used medium-large vocabulary noisy ASR corpora: Aurora-4 and CHiME-2.

Committee:

DeLiang Wang (Advisor); Mikhail Belkin (Committee Member); Eric Fosler-Lussier (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Automatic speech recognition; noise robustness; computational auditory scene analysis; binary masking; ratio masking; mask estimation; deep neural networks; acoustic modeling; speech separation; speech enhancement; noisy ASR; CHiME-2; Aurora-4;