Search Results (1 - 8 of 8 Results)

Sort By  
Sort Dir
 
Results per page  

Chen, JitongOn Generalization of Supervised Speech Separation
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs. Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility. Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises. Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; speech intelligibility; computational auditory scene analysis; mask estimation; supervised learning; deep neural networks; acoustic features; noise generalization; SNR generalization; speaker generalization;

Jin, ZhaozhangMonaural Speech Segregation in Reverberant Environments
Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

Room reverberation is a major source of signal degradation in real environments. While listeners excel in "hearing out" a target source from sound mixtures in noisy and reverberant conditions, simulating this perceptual ability remains a fundamental challenge. The goal of this dissertation is to build a computational auditory scene analysis (CASA) system that separates target voiced speech from its acoustic background in reverberant environments. A supervised learning approach to pitch-based grouping of reverberant speech is proposed, followed by a robust multipitch tracking algorithm based on a hidden Markov model (HMM) framework. Finally, a monaural CASA system for reverberant speech segregation is designed by combining the supervised learning approach and the multipitch tracker.

Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. Assuming that the true target pitch is known, our first study lends to a novel supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The model trained using this objective function yields significantly better T-F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers.

Multipitch tracking in real environments is critical for speech signal processing. Determining pitch in both reverberant and noisy conditions is another difficult task. In the second study, we propose a robust algorithm for multipitch tracking in the presence of background noise and room reverberation. A new channel selection method is utilized to extract periodicity features. We derive pitch scores for each pitch state, which estimate the likelihoods of the observed periodicity features given pitch candidates. An HMM integrates these pitch scores and searches for the best pitch state sequence. Our algorithm can reliably detect single and double pitch contours in noisy and reverberant conditions.

Building on the first two studies, we propose a CASA approach to monaural segregation of reverberant voiced speech, which performs multipitch tracking of reverberant mixtures and supervised classification. Speech and nonspeech models are separately trained, and each learns to map pitch-based features to the posterior probability of a T-F unit being dominated by the source with the given pitch estimate. Because interference can be either speech or nonspeech, a likelihood ratio test is introduced to select the correct model for labeling corresponding T-F units. Experimental results show that the proposed system performs robustly in different types of interference and various reverberant conditions, and has a significant advantage over existing systems.

Committee:

DeLiang Wang, PhD (Advisor); Eric Fosler-Lussier, PhD (Committee Member); Mikhail Belkin, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

computational auditory scene analysis; monaural segregation; multipitch tracking; pitch determination algorithm; room reverberation; speech separation; supervised learning

Shao, YangSequential organization in computational auditory scene analysis
Doctor of Philosophy, The Ohio State University, 2007, Computer and Information Science

A human listener's ability to organize the time-frequency (T-F) energy of the same sound source into a single stream is termed auditory scene analysis (ASA). Computational auditory scene analysis (CASA) seeks to organize sound based on ASA principles. This dissertation presents a systematic effort on sequential organization in CASA. The organization goal is to group T-F segments from the same speaker that are separated in time into a single stream.

This dissertation proposes a speaker-model-based sequential organization framework and it shows better grouping performance than feature-based methods. Specifically, a computational objective is derived for sequential grouping in the context of speaker recognition for multi-talker mixtures. This formulation leads to a grouping system that searches for the optimal grouping of separated speech segments. A hypothesis pruning method is then proposed that significantly reduces search space and time while achieving performance close to that of exhaustive search. Evaluations show that the proposed system improves both grouping performance and speech recognition accuracy. The proposed system is then extended to handle multi-talker as well as non-speech intrusions using generic models. The system is further extended to deal with noisy inputs from unknown speakers. It employs a speaker quantization method that extracts generic models from a large speaker space. The resulting grouping performance is only moderately lower than that with known speaker models.

In addition, this dissertation presents a systematic effort in robust speaker recognition. A novel usable speech extraction method is proposed that significantly improves recognition performance. A general solution is proposed for speaker recognition under additive-noise conditions. Novel speaker features are derived from auditory filtering, and are used in conjunction with an uncertainty decoder that accounts for mismatch introduced in CASA front-end processing. Evaluations show that the proposed system achieves significant performance improvement over the use of typical speaker features and a state-of-the-art robust front-end processor for noisy speech.

Committee:

DeLiang Wang (Advisor)

Subjects:

Computer Science

Keywords:

Sequential Organization; Sequential Grouping; Auditory Scene Analysis; Computational Auditory Scene Analysis; Speech Organization; Robust Speaker Recognition; Auditory Feature; Speaker Quantization

Wang, YuxuanSupervised Speech Separation Using Deep Neural Networks
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Speech is crucial for human communication. However, speech communication for both humans and automatic devices can be negatively impacted by background noise, which is common in real environments. Due to numerous applications, such as hearing prostheses and automatic speech recognition, separation of target speech from sound mixtures is of great importance. Among many techniques, speech separation using a single microphone is most desirable from an application standpoint. The resulting monaural speech separation problem has been a central problem in speech processing for several decades. However, its success has been limited thus far. Time-frequency (T-F) masking is a proven way to suppress background noise. With T-F masking as the computational goal, speech separation reduces to a mask estimation problem, which can be cast as a supervised learning problem. This opens speech separation to a plethora of machine learning techniques. Deep neural networks (DNN) are particularly suitable to this problem due to their strong representational capacity. This dissertation presents a systematic effort to develop monaural speech separation systems using DNNs. We start by presenting a comparative study on acoustic features for supervised separation. In this relatively early work, we use support vector machine as classifier to predict the ideal binary mask (IBM), which is a primary goal in computational auditory scene analysis. We found that traditional speech and speaker recognition features can actually outperform previously used separation features. Furthermore, we present a feature selection method to systematically select complementary features. The resulting feature set is used throughout the dissertation. DNN has shown success across a range of tasks. We then study IBM estimation using DNN, and show that it is significantly better than previous systems. Once properly trained, the system generalizes reasonably well to unseen conditions. We demonstrate that our system can improve speech intelligibility for hearing-impaired listeners. Furthermore, by considering the structure in the IBM, we show how to improve IBM estimation by employing sequence training and optimizing a speech intelligibility predictor. The IBM is used as the training target in previous work due to its simplicity. DNN based separation is not limited to binary masking, and choosing a suitable training target is obviously important. We study the performance of a number of targets and found that ratio masking can be preferable, and T-F masking in general outperforms spectral mapping. In addition, we propose a new target that encodes structure into ratio masks. Generalization to noises not seen during training is key to supervised separation. A simple and effective way to improve generalization is to train on multiple noisy conditions. Along this line, we demonstrate that the noise mismatch problem can be well remedied by large-scale training. This important result substantiates the practicability of DNN based supervised separation. Aside from speech intelligibility, perceptual quality is also important. In the last part of the dissertation, we propose a new DNN architecture that directly reconstructs time-domain clean speech signal. The resulting system significantly improves objective speech quality over standard mask estimators.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; time-frequency masking; computational auditory scene analysis; acoustic features; deep neural networks; training targets; generalization; speech intelligibility; speech quality

Narayanan, ArunComputational auditory scene analysis and robust automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Automatic speech recognition (ASR) has made great strides over the last decade producing acceptable performance in relatively `clean' conditions. As a result, it is becoming a mainstream technology. But for a system to be useful in everyday conditions, it has to deal with distorting factors like background noise, room reverberation, and recording channel characteristics. A popular approach to improve robustness is to perform speech separation before doing ASR. Most of the current systems treat speech separation and speech recognition as two independent, isolated tasks. But just as separation helps improve recognition, recognition can potentially influence separation. For example, in auditory perception, speech schemas have been known to help improve segregation. An underlying theme of this dissertation is the advocation of a closer integration of these two tasks. We address this in the context of computational auditory scene analysis (CASA), including supervised speech separation. CASA is largely motivated by the principles that guide human auditory `scene analysis'. An important computational goal of CASA systems is to estimate the ideal binary mask (IBM). The IBM identifies speech dominated and noise dominated regions in a time-frequency representation of a noisy signal. Processing noisy signals using the IBM improves ASR performance by a large margin. We start by studying the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs) and vocabulary sizes. Our results show that the mere pattern of the IBM carries important phonetic information. Akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB. In fact, our study shows that there is broad agreement with human performance which is rather surprising. Given the important role that binary mask patterns play, we develop two novel systems that incorporate this information to improve ASR. The first system performs recognition by directly classifying binary masks corresponding to words and phonemes. The method is evaluated on an isolated digit recognition and a phone classification task. Despite dramatic reduction of speech information encoded in a binary mask compared to a typical ASR feature frontend, the proposed system performs surprisingly well. The second approach is a novel framework that performs speech separation and ASR in a unified fashion. Separation is performed via masking using an estimated IBM, and ASR is performed using the standard cepstral features. Most systems perform these tasks in a sequential fashion: separation followed by recognition. The proposed framework, which we call bidirectional speech decoder, unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On a medium-large vocabulary speech recognition task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM. Supervised classification based speech separation has shown a lot of promise recently. We perform an in-depth evaluation of such techniques as a front-end for noise-robust ASR. Comparing performance of supervised binary and ratio mask estimators, we observe that ratio masking significantly outperforms binary masking when it comes to ASR. Consequently, we propose a separation front-end that consists of two stages. The first stage removes additive noise via ratio time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage: A non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks (DNNs) and hidden Markov models. Results show that dFDLR consistently improves performance in all test conditions. We explore alternative ways to using the output of speech separation to improve ASR performance when using DNN based acoustic models. Apart from its use as a frontend, we propose using speech separation for providing smooth estimates of speech and noise which are then passed as additional features. Finally, we develop a unified framework that jointly improves separation and ASR under a supervised learning framework. Our systems obtain the state-of-the-art results in two widely used medium-large vocabulary noisy ASR corpora: Aurora-4 and CHiME-2.

Committee:

DeLiang Wang (Advisor); Mikhail Belkin (Committee Member); Eric Fosler-Lussier (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Automatic speech recognition; noise robustness; computational auditory scene analysis; binary masking; ratio masking; mask estimation; deep neural networks; acoustic modeling; speech separation; speech enhancement; noisy ASR; CHiME-2; Aurora-4;

Srinivasan, SoundararajanIntegrating computational auditory scene analysis and automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2006, Biomedical Engineering
Speech perception studies indicate that robustness of human speech recognition is primarily due to our ability to segregate a target sound source from other interferences. This perceptual process of auditory scene analysis (ASA) is of two types, primitive and schema-driven. This dissertation investigates several aspects of integrating computational ASA (CASA) and automatic speech recognition (ASR). While bottom-up CASA are used as front-end for ASR to improve its robustness, ASR is used to provide top-down information to enhance primitive segregation. Listeners are able to restore masked phonemes by utilizing lexical context. We present a schema-based model for phonemic restoration. The model employs missing-data ASR to decode masked speech and activates word templates via dynamic time warping. A systematic evaluation shows that the model restores both voiced and unvoiced phonemes with a high spectral quality. Missing-data ASR requires a binary mask from bottom-up CASA that identifies speech-dominant time-frequency regions of a noisy mixture. We propose a two-pass system that performs segregation and recognition in tandem. First, an n-best lattice, consistent with bottom-up speech separation, is generated. Second, the lattice is re-scored using a model-based hypothesis test to improve mask estimation and recognition accuracy concurrently. By combining CASA and ASR, we present a model that simulates listeners' ability to attend to a target speaker when degraded by energetic and informational masking. Missing-data ASR is used to account for energetic masking and the output degradation of CASA is used to model informational masking. The model successfully simulates several quantitative aspects of listener performance. The degradation in the output of CASA-based front-ends leads to uncertain ASR inputs. We estimate feature uncertainties in the spectral domain and transform them into the cepstral domain via nonlinear regression. The estimated uncertainty substantially improves recognition accuracy. We also investigate the effect of vocabulary size on conventional and missing-data ASRs. Based on binaural cues, for conventional ASR, we extract the speech signal using a Wiener filter and for missing-data ASR, we estimate a binary mask. We find that while missing-data ASR outperforms conventional ASR on a small vocabulary task, the relative performance reverses on a larger vocabulary task.

Committee:

DeLiang Wang (Advisor)

Keywords:

Computational auditory scene analysis (CASA); Robust automatic speech recognition; Speech segregation; Phonemic restoration; Top-down analysis; Binaural processing; Uncertainty decoding; Multitalker perception

Woodruff, John F.Integrating Monaural and Binaural Cues for Sound Localization and Segregation in Reverberant Environments
Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

The problem of segregating a sound source of interest from an acoustic background has been extensively studied due to applications in hearing prostheses, robust speech/speaker recognition and audio information retrieval. Computational auditory scene analysis (CASA) approaches the segregation problem by utilizing grouping cues involved in the perceptual organization of sound by human listeners. Binaural processing, where input signals resemble those that enter the two ears, is of particular interest in the CASA field. The dominant approach to binaural segregation has been to derive spatially selective filters in order to enhance the signal in a direction of interest. As such, the problems of sound localization and sound segregation are closely tied. While spatial filtering has been widely utilized, substantial performance degradation is incurred in reverberant environments and more fundamentally, segregation cannot be performed without sufficient spatial separation between sources.

This dissertation addresses the problems of binaural localization and segregation in reverberant environments by integrating monaural and binaural cues. Motivated by research in psychoacoustics and by developments in monaural CASA processing, we first develop a probabilistic framework for joint localization and segregation of voiced speech. Pitch cues are used to group sound components across frequency over continuous time intervals. Time-frequency regions resulting from this partial organization are then localized by integrating binaural cues, which enhances robustness to reverberation, and grouped across time based on the estimated locations. We demonstrate that this approach outperforms voiced segregation based on either monaural or binaural analysis alone. We also demonstrate substantial performance gains in terms of multisource localization, particularly for distant sources in reverberant environments and low signal-to-noise ratios. We then develop a binaural system for joint localization and segregation of an unknown and time-varying number of sources that is more flexible and requires less prior information than our initial system. This framework incorporates models trained jointly on pitch and azimuth cues, which improves performance and naturally deals with both voiced and unvoiced speech. Experimental results show that the proposed approach outperforms existing two-microphone systems in spite of less prior information.

We also consider how the computational goal of CASA-based segregation should be defined in reverberant environments. The ideal binary mask (IBM) has been established as a main goal of CASA. While the IBM is defined unambiguously in anechoic conditions, in reverberant environments there is some flexibility in how one might define the target signal itself and therefore, ambiguity is introduced to the notion of the IBM. Due to the perceptual distinction between early and late reflections, we introduce the reflection boundary as a parameter to the IBM definition to allow target reflections to be divided into desirable and undesirable components. We conduct a series of intelligibility tests with normal hearing listeners to compare alternative IBM definitions. Results show that it is vital for the IBM definition to account for the energetic effect of early target reflections, and that late target reflections should be characterized as noise.

Committee:

DeLiang Wang, PhD (Advisor); Mikhail Belkin, PhD (Committee Member); Eric Fosler-Lussier, PhD (Committee Member); Nicoleta Roman, PhD (Committee Member)

Subjects:

Acoustics; Artificial Intelligence; Computer Science; Electrical Engineering

Keywords:

computational auditory scene analysis; speech segregation; sound localization; binaural; monaural; ideal binary masking; speech intelligibility

Roman, NicoletaAuditory-based algorithms for sound segregation in multisource and reverberant environments
Doctor of Philosophy, The Ohio State University, 2005, Computer and Information Science
At a cocktail party, we can selectively attend to a single voice and filter out other interferences. This perceptual ability has motivated a new field of study known as computational auditory scene analysis (CASA) which aims to build speech separation systems that incorporate auditory principles. The psychological process of figure-ground segregation suggests that the target signal should be segregated as foreground while the remaining stimuli are treated as background. Accordingly, the computational goal of CASA should be to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. This dissertation investigates four aspects of CASA processing: location-based speech segregation, binaural tracking of multiple moving sources, binaural sound segregation in reverberation, and monaural segregation of reverberant speech. For localization, the auditory system utilizes the interaural time difference (ITD) and interaural intensity difference (IID) between the ears. We observe that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for ITD and IID resulting in a characteristic clustering. Consequently, we propose a supervised learning approach to estimate the ideal binary mask. A systematic evaluation shows that the resulting system produces masks very close to the ideal binary ones and large speech intelligibility improvements. In realistic environments, source motion requires consideration. Binaural cues are strongly correlated with locations in T-F units dominated by one source resulting in channel-dependent conditional probabilities. Consequently, we propose a multi-channel integration method of these probabilities in order to compute the likelihood function in a target space. Finally, a hidden Markov model is employed for forming continuous tracks and automatically detecting the number of active sources. Reverberation affects the ITD and IID cues. We therefore propose a binaural segregation system that combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal binary mask. A major advantage of the proposed system is that it imposes no restrictions on the interfering sources. Quantitative evaluations show that our system outperforms related beamforming approaches. Psychoacoustic evidence suggests that monaural processing play a vital role in segregation. It is known that reverberation smears the harmonicity of speech signals. We therefore propose a two-stage separation system that combines inverse filtering of target room impulse response with pitch-based segregation. As a result of the first stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other locations are further smeared, and this leads to improved segregation and considerable signal-to-noise ratio gains.

Committee:

DeLiang Wang (Advisor)

Keywords:

computational auditory scene analysis (CASA); binaural speech segregation; monaural speech segregation; robust automatic speech segregation; adaptive filtering; room impulse response; reverberation