Search Results (1 - 4 of 4 Results)

Sort By  
Sort Dir
 
Results per page  

Chen, JitongOn Generalization of Supervised Speech Separation
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs. Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility. Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises. Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; speech intelligibility; computational auditory scene analysis; mask estimation; supervised learning; deep neural networks; acoustic features; noise generalization; SNR generalization; speaker generalization;

Wang, YuxuanSupervised Speech Separation Using Deep Neural Networks
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Speech is crucial for human communication. However, speech communication for both humans and automatic devices can be negatively impacted by background noise, which is common in real environments. Due to numerous applications, such as hearing prostheses and automatic speech recognition, separation of target speech from sound mixtures is of great importance. Among many techniques, speech separation using a single microphone is most desirable from an application standpoint. The resulting monaural speech separation problem has been a central problem in speech processing for several decades. However, its success has been limited thus far. Time-frequency (T-F) masking is a proven way to suppress background noise. With T-F masking as the computational goal, speech separation reduces to a mask estimation problem, which can be cast as a supervised learning problem. This opens speech separation to a plethora of machine learning techniques. Deep neural networks (DNN) are particularly suitable to this problem due to their strong representational capacity. This dissertation presents a systematic effort to develop monaural speech separation systems using DNNs. We start by presenting a comparative study on acoustic features for supervised separation. In this relatively early work, we use support vector machine as classifier to predict the ideal binary mask (IBM), which is a primary goal in computational auditory scene analysis. We found that traditional speech and speaker recognition features can actually outperform previously used separation features. Furthermore, we present a feature selection method to systematically select complementary features. The resulting feature set is used throughout the dissertation. DNN has shown success across a range of tasks. We then study IBM estimation using DNN, and show that it is significantly better than previous systems. Once properly trained, the system generalizes reasonably well to unseen conditions. We demonstrate that our system can improve speech intelligibility for hearing-impaired listeners. Furthermore, by considering the structure in the IBM, we show how to improve IBM estimation by employing sequence training and optimizing a speech intelligibility predictor. The IBM is used as the training target in previous work due to its simplicity. DNN based separation is not limited to binary masking, and choosing a suitable training target is obviously important. We study the performance of a number of targets and found that ratio masking can be preferable, and T-F masking in general outperforms spectral mapping. In addition, we propose a new target that encodes structure into ratio masks. Generalization to noises not seen during training is key to supervised separation. A simple and effective way to improve generalization is to train on multiple noisy conditions. Along this line, we demonstrate that the noise mismatch problem can be well remedied by large-scale training. This important result substantiates the practicability of DNN based supervised separation. Aside from speech intelligibility, perceptual quality is also important. In the last part of the dissertation, we propose a new DNN architecture that directly reconstructs time-domain clean speech signal. The resulting system significantly improves objective speech quality over standard mask estimators.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; time-frequency masking; computational auditory scene analysis; acoustic features; deep neural networks; training targets; generalization; speech intelligibility; speech quality

Anderson, Jill M.Lateralization Effects of Brainstem Responses and Middle Latency Responses to a Complex Tone and Speech Syllable
PhD, University of Cincinnati, 2011, Allied Health Sciences: Communication Sciences and Disorders

Background: Previous human auditory brainstem response (ABR) studies have suggested that the right ear auditory network preferentially processes a spectrotemporally complex speech syllable and the left ear auditory network preferentially processes temporally devoid spectral stimuli. Human cortical studies also suggest lateralization effects to spectral versus temporal stimuli. However, it remains unclear if the reported brainstem lateralization effects may be due to the spectrotemporal content or the higher order lexical content of the evoking speech stimulus. Also, the lateralization effects observed at the cortical level in late evoked auditory potentials are based upon responses obtained well after the stimulus has arrived to the auditory cortices (~100 ms). Lateralization effects to spectrotemporally complex stimuli are unknown upon first arrival to the auditory cortices or in the auditory middle latency Pa response which occurs approximately 30 ms post-stimulus.

Purpose: The purpose of this study was to gain a better understanding of how the human auditory processing system encodes spectrotemporally complex acoustic stimuli from subcortical levels to first arrival at the bilateral cortices.

Research Design: This study is a comparative analysis of both brainstem frequency following responses (FFRs) and cortical auditory middle latency responses (AMLRs) to spectrotemporally complex speech and spectrally complex nonspeech stimuli evoked from right and left ear stimulation in normal hearing adult females.

Study Sample: ABR and AMLR responses elicited by a spectrotemporally complex speech stimulus /da/ and a spectrally complex nonspeech stimulus were obtained in a group of twenty right-handed normal hearing adult females.

Data Collection and Analysis: Electrophysiological brainstem FFRs and AMLRs were recorded using a 40 ms synthesized speech syllable /da/ presented both forwards and backwards in addition to a 40 ms complex tone. Monaural ipsilateral FFRs and AMLRs were obtained with insert earphones at an intensity of 80 dB SPL.

Results: There were no significant differences in the right or left ear evoked FFRs to the complex tone or the speech stimulus played either forwards or backwards. However, the left ear AMLRs to the speech syllable played both in the forwards and backwards mode were significantly earlier than those obtained from the right ear.

Conclusions: The results from this study do not support previous findings of a subcortical right ear advantage (REA) for any portion of the synthetic syllable /da/ and suggest that the subcortical neural network does not preferentially process short duration spectrotemporally complex acoustic stimuli differently based upon the spectral, temporal or lexical content of the stimulus. However, the AMLR results suggest that the neural mechanisms generating the AMLR Pa response react earlier to the speech syllable played both forwards or backwards during left ear stimulation. It may be deduced that the earlier Pa responses to left ear stimulation are due to the prosodic acoustic features of both the forwards and backwards syllable that are absent in the spectrally complex tone.

Committee:

Fawen Zhang, PhD (Committee Chair); James Eliassen, PhD (Committee Member); Robert Keith, PhD (Committee Member); Peter Scheifele, PhD (Committee Member)

Subjects:

Audiology

Keywords:

Speech evoked auditory middle latency responses; Speech evoked auditory brainstem responses; Auditory middle latency responses to a complex tone; Frequency following responses; Lateralization to prosodic acoustic features

Diekema, Emily DAcoustic Measurements of Clear Speech Cue Fade in Adults with Idiopathic Parkinson Disease
Master of Science (MS), Bowling Green State University, 2016, Communication Disorders
The purpose of this study was to examine the potential fade in the effects of a clear speech cue on selected acoustic features of Parkinsonian speech as participants read a passage. Participants were 12 adults with idiopathic Parkinson disease (mean age = 73 years), reading a passage with the instructions to "Produce the items as clearly as possible, as if I am having trouble hearing or understanding you." The effects of clear speech were measured using speech rate, articulation rate, fundamental frequency, variation in fundamental frequency, intensity difference between stressed and unstressed syllables, and intensity change from the beginning of the passage to the end. Results indicated that the clear speech cue broke down early in the reading as suggested by an increase in speech and articulation rates, a decrease in fundamental frequency standard deviation, and an overall decrease in intensity. There was a negligible decrease in average fundamental frequency and the maintenance of the difference between intensity between the two syllables of "rainbow;" near the beginning and end of the reading. These findings suggest that some prosodic aspects (laryngeal, short-term respiratory) may reflect maintenance of the clear speech cue or general stability, but more global aspects of speech over time (long-term articulation, long-term respiratory control) suggest the lack of maintaining the clear speech cue or relatively little response to the clear speech cue.

Committee:

Ronald C. Scherer (Advisor); Alexander M. Goberman (Committee Member); Jason A. Whitfield (Committee Member)

Subjects:

Acoustics; Speech Therapy

Keywords:

cue fade; parkinson; parkinsons disease; acoustic features; speech rate; articulation rate; percent pause time; intensity; rainbow passage; fundamental frequency; clear speech;