Search Results (1 - 5 of 5 Results)

Sort By  
Sort Dir
 
Results per page  

Howard, Shaun MichaelDeep Learning for Sensor Fusion
Master of Sciences (Engineering), Case Western Reserve University, 2017, EECS - Computer and Information Sciences
The use of multiple sensors in modern day vehicular applications is necessary to provide a complete outlook of surroundings for advanced driver assistance systems (ADAS) and automated driving. The fusion of these sensors provides increased certainty in the recognition, localization and prediction of surroundings. A deep learning-based sensor fusion system is proposed to fuse two independent, multi-modal sensor sources. This system is shown to successfully learn the complex capabilities of an existing state-of-the-art sensor fusion system and generalize well to new sensor fusion datasets. It has high precision and recall with minimal confusion after training on several million examples of labeled multi-modal sensor data. It is robust, has a sustainable training time, and has real-time response capabilities on a deep learning PC with a single NVIDIA GeForce GTX 980Ti graphical processing unit (GPU).

Committee:

Wyatt Newman, Dr (Committee Chair); M. Cenk Cavusoglu, Dr (Committee Member); Michael Lewicki, Dr (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

deep learning; sensor fusion; deep neural networks; advanced driver assistance systems; automated driving; multi-stream neural networks; feedforward; multilayer perceptron; recurrent; gated recurrent unit; long-short term memory; camera; radar;

Chen, JitongOn Generalization of Supervised Speech Separation
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs. Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility. Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises. Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; speech intelligibility; computational auditory scene analysis; mask estimation; supervised learning; deep neural networks; acoustic features; noise generalization; SNR generalization; speaker generalization;

Wang, YuxuanSupervised Speech Separation Using Deep Neural Networks
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Speech is crucial for human communication. However, speech communication for both humans and automatic devices can be negatively impacted by background noise, which is common in real environments. Due to numerous applications, such as hearing prostheses and automatic speech recognition, separation of target speech from sound mixtures is of great importance. Among many techniques, speech separation using a single microphone is most desirable from an application standpoint. The resulting monaural speech separation problem has been a central problem in speech processing for several decades. However, its success has been limited thus far. Time-frequency (T-F) masking is a proven way to suppress background noise. With T-F masking as the computational goal, speech separation reduces to a mask estimation problem, which can be cast as a supervised learning problem. This opens speech separation to a plethora of machine learning techniques. Deep neural networks (DNN) are particularly suitable to this problem due to their strong representational capacity. This dissertation presents a systematic effort to develop monaural speech separation systems using DNNs. We start by presenting a comparative study on acoustic features for supervised separation. In this relatively early work, we use support vector machine as classifier to predict the ideal binary mask (IBM), which is a primary goal in computational auditory scene analysis. We found that traditional speech and speaker recognition features can actually outperform previously used separation features. Furthermore, we present a feature selection method to systematically select complementary features. The resulting feature set is used throughout the dissertation. DNN has shown success across a range of tasks. We then study IBM estimation using DNN, and show that it is significantly better than previous systems. Once properly trained, the system generalizes reasonably well to unseen conditions. We demonstrate that our system can improve speech intelligibility for hearing-impaired listeners. Furthermore, by considering the structure in the IBM, we show how to improve IBM estimation by employing sequence training and optimizing a speech intelligibility predictor. The IBM is used as the training target in previous work due to its simplicity. DNN based separation is not limited to binary masking, and choosing a suitable training target is obviously important. We study the performance of a number of targets and found that ratio masking can be preferable, and T-F masking in general outperforms spectral mapping. In addition, we propose a new target that encodes structure into ratio masks. Generalization to noises not seen during training is key to supervised separation. A simple and effective way to improve generalization is to train on multiple noisy conditions. Along this line, we demonstrate that the noise mismatch problem can be well remedied by large-scale training. This important result substantiates the practicability of DNN based supervised separation. Aside from speech intelligibility, perceptual quality is also important. In the last part of the dissertation, we propose a new DNN architecture that directly reconstructs time-domain clean speech signal. The resulting system significantly improves objective speech quality over standard mask estimators.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; time-frequency masking; computational auditory scene analysis; acoustic features; deep neural networks; training targets; generalization; speech intelligibility; speech quality

Han, KunSupervised Speech Separation And Processing
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
In real-world environments, speech often occurs simultaneously with acoustic interference, such as background noise or reverberation. The interference usually leads to adverse effects on speech perception, and results in performance degradation in many speech applications, including automatic speech recognition and speaker identification. Monaural speech separation and processing aim to separate or analyze speech from interference based on only one recording. Although significant progress has been made on this problem, it is a widely regarded challenge. Unlike traditional signal processing, this dissertation addresses the speech separation and processing problems using machine learning techniques. We first propose a classification approach to estimate the ideal binary mask (IBM) which is considered as a main goal of sound separation in computational auditory scene analysis (CASA). We employ support vector machines (SVMs) to classify time-frequency (T-F) units as either target-dominant or interference-dominant. A rethresholding method is incorporated to improve classification results and maximize hit minus false alarm rates. Systematic evaluations show that the proposed approach produces accurate estimated IBMs. In a supervised learning framework, the issue of generalization to conditions different from those in training is very important. We then present methods that require only a small training corpus and can generalize to unseen conditions. The system utilizes SVMs to learn classification cues and then employs a rethresholding technique to estimate the IBM. A distribution fitting method is introduced to generalize to unseen signal-to-noise ratio conditions and voice activity detection based adaptation is used to generalize to unseen noise conditions. In addition, we propose to use a novel metric learning method to learn invariant speech features in the kernel space. The learned features encode speech-related information and can generalize to unseen noise conditions. Experiments show that the proposed approaches produce high quality IBM estimates under unseen conditions. Besides background noise, room reverberation is another major source of signal degradation in real environments. Reverberation when combined with background noise is particularly disruptive for speech perception and many applications. We perform dereverberation and denoising using supervised learning. A deep neural network (DNN) is trained to directly learn a spectral mapping from the spectrogram of corrupted speech to that of clean speech. The spectral mapping approach substantially attenuates the distortion caused by reverberation and background noise, leading to improvement of predicted speech intelligibility and quality scores, as well as speech recognition rates. Pitch is one of the most important characteristics of speech signals. Although pitch tracking has been studied for decades, it is still challenging to estimate pitch from speech in the presence of strong noise. We estimate pitch using supervised learning, where probabilistic pitch states are directly learned from noisy speech data. We investigate two alternative neural networks modeling pitch state distribution given observations, i.e., a feedforward DNN and a recurrent deep neural network (RNN). Both DNNs and RNNs produce accurate probabilistic outputs of pitch states, which are then connected into pitch contours by Viterbi decoding. Experiments show that the proposed algorithms are robust to different noise conditions.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member)

Subjects:

Computer Science

Keywords:

Supervised learning; Speech separation; Speech processing; Machine learning; Deep Learning; Pitch estimation; Speech Dereverberation; Deep neural networks; Support vector machines

Narayanan, ArunComputational auditory scene analysis and robust automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Automatic speech recognition (ASR) has made great strides over the last decade producing acceptable performance in relatively `clean' conditions. As a result, it is becoming a mainstream technology. But for a system to be useful in everyday conditions, it has to deal with distorting factors like background noise, room reverberation, and recording channel characteristics. A popular approach to improve robustness is to perform speech separation before doing ASR. Most of the current systems treat speech separation and speech recognition as two independent, isolated tasks. But just as separation helps improve recognition, recognition can potentially influence separation. For example, in auditory perception, speech schemas have been known to help improve segregation. An underlying theme of this dissertation is the advocation of a closer integration of these two tasks. We address this in the context of computational auditory scene analysis (CASA), including supervised speech separation. CASA is largely motivated by the principles that guide human auditory `scene analysis'. An important computational goal of CASA systems is to estimate the ideal binary mask (IBM). The IBM identifies speech dominated and noise dominated regions in a time-frequency representation of a noisy signal. Processing noisy signals using the IBM improves ASR performance by a large margin. We start by studying the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs) and vocabulary sizes. Our results show that the mere pattern of the IBM carries important phonetic information. Akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB. In fact, our study shows that there is broad agreement with human performance which is rather surprising. Given the important role that binary mask patterns play, we develop two novel systems that incorporate this information to improve ASR. The first system performs recognition by directly classifying binary masks corresponding to words and phonemes. The method is evaluated on an isolated digit recognition and a phone classification task. Despite dramatic reduction of speech information encoded in a binary mask compared to a typical ASR feature frontend, the proposed system performs surprisingly well. The second approach is a novel framework that performs speech separation and ASR in a unified fashion. Separation is performed via masking using an estimated IBM, and ASR is performed using the standard cepstral features. Most systems perform these tasks in a sequential fashion: separation followed by recognition. The proposed framework, which we call bidirectional speech decoder, unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On a medium-large vocabulary speech recognition task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM. Supervised classification based speech separation has shown a lot of promise recently. We perform an in-depth evaluation of such techniques as a front-end for noise-robust ASR. Comparing performance of supervised binary and ratio mask estimators, we observe that ratio masking significantly outperforms binary masking when it comes to ASR. Consequently, we propose a separation front-end that consists of two stages. The first stage removes additive noise via ratio time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage: A non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks (DNNs) and hidden Markov models. Results show that dFDLR consistently improves performance in all test conditions. We explore alternative ways to using the output of speech separation to improve ASR performance when using DNN based acoustic models. Apart from its use as a frontend, we propose using speech separation for providing smooth estimates of speech and noise which are then passed as additional features. Finally, we develop a unified framework that jointly improves separation and ASR under a supervised learning framework. Our systems obtain the state-of-the-art results in two widely used medium-large vocabulary noisy ASR corpora: Aurora-4 and CHiME-2.

Committee:

DeLiang Wang (Advisor); Mikhail Belkin (Committee Member); Eric Fosler-Lussier (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Automatic speech recognition; noise robustness; computational auditory scene analysis; binary masking; ratio masking; mask estimation; deep neural networks; acoustic modeling; speech separation; speech enhancement; noisy ASR; CHiME-2; Aurora-4;