Search Results (1 - 6 of 6 Results)

Sort By  
Sort Dir
 
Results per page  

Chen, JitongOn Generalization of Supervised Speech Separation
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs. Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility. Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises. Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; speech intelligibility; computational auditory scene analysis; mask estimation; supervised learning; deep neural networks; acoustic features; noise generalization; SNR generalization; speaker generalization;

Wang, YuxuanSupervised Speech Separation Using Deep Neural Networks
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Speech is crucial for human communication. However, speech communication for both humans and automatic devices can be negatively impacted by background noise, which is common in real environments. Due to numerous applications, such as hearing prostheses and automatic speech recognition, separation of target speech from sound mixtures is of great importance. Among many techniques, speech separation using a single microphone is most desirable from an application standpoint. The resulting monaural speech separation problem has been a central problem in speech processing for several decades. However, its success has been limited thus far. Time-frequency (T-F) masking is a proven way to suppress background noise. With T-F masking as the computational goal, speech separation reduces to a mask estimation problem, which can be cast as a supervised learning problem. This opens speech separation to a plethora of machine learning techniques. Deep neural networks (DNN) are particularly suitable to this problem due to their strong representational capacity. This dissertation presents a systematic effort to develop monaural speech separation systems using DNNs. We start by presenting a comparative study on acoustic features for supervised separation. In this relatively early work, we use support vector machine as classifier to predict the ideal binary mask (IBM), which is a primary goal in computational auditory scene analysis. We found that traditional speech and speaker recognition features can actually outperform previously used separation features. Furthermore, we present a feature selection method to systematically select complementary features. The resulting feature set is used throughout the dissertation. DNN has shown success across a range of tasks. We then study IBM estimation using DNN, and show that it is significantly better than previous systems. Once properly trained, the system generalizes reasonably well to unseen conditions. We demonstrate that our system can improve speech intelligibility for hearing-impaired listeners. Furthermore, by considering the structure in the IBM, we show how to improve IBM estimation by employing sequence training and optimizing a speech intelligibility predictor. The IBM is used as the training target in previous work due to its simplicity. DNN based separation is not limited to binary masking, and choosing a suitable training target is obviously important. We study the performance of a number of targets and found that ratio masking can be preferable, and T-F masking in general outperforms spectral mapping. In addition, we propose a new target that encodes structure into ratio masks. Generalization to noises not seen during training is key to supervised separation. A simple and effective way to improve generalization is to train on multiple noisy conditions. Along this line, we demonstrate that the noise mismatch problem can be well remedied by large-scale training. This important result substantiates the practicability of DNN based supervised separation. Aside from speech intelligibility, perceptual quality is also important. In the last part of the dissertation, we propose a new DNN architecture that directly reconstructs time-domain clean speech signal. The resulting system significantly improves objective speech quality over standard mask estimators.

Committee:

DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Eric Healy (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Speech separation; time-frequency masking; computational auditory scene analysis; acoustic features; deep neural networks; training targets; generalization; speech intelligibility; speech quality

Verbsky, Babette L.EFFECTS OF CONVENTIONAL PASSIVE EARMUFFS,UNIFORMLY ATTENUATING PASSIVE EARMUFFS, AND HEARING AIDS ON SPEECH INTELLIGIBILITY IN NOISE
Doctor of Philosophy, The Ohio State University, 2002, Speech and Hearing Science
Occupational hearing conservation regulations neither address issues related to speech intelligibility in noise for normal-hearing or hearing-impaired workers, nor do the regulations comment on the safety of hearing aid use by hearing-impaired workers. Do certain types of hearing protection devices (HPDs) allow for better speech intelligibility than others? Would use of hearing aids with earmuffs provide better speech intelligibility for hearing-impaired workers? Is this method of accommodation safe? To answer these questions, a method for evaluating speech intelligibility with HPDs was developed through a series of pilot tests. The test method allows for evaluation of both normal-hearing and hearing-impaired listeners. Speech intelligibility for normal-hearing listeners who wore uniformly attenuating earmuffs was found to be significantly better than for the same listeners who wore conventional earmuffs. Hearing-impaired listeners were tested with each type of earmuff and while wearing their own hearing aids in combination with each earmuff. Unlike the normal hearing listener group, the hearing-impaired listener group did not exhibit better speech intelligibility with the uniformly attenuating earmuffs than with the conventional earmuffs. However, earmuffs worn in combination with hearing aids allowed for significantly better speech intelligibility than with either earmuff alone. To determine the safety of hearing aid use under earmuffs, a model was developed to predict occupational noise exposure for the aided-protected worker. Data from real ear measures with an acoustic mannequin was found to be in agreement with model predictions.

Committee:

Lawrence Feth (Advisor)

Keywords:

hearing conservation; hearing aids; earmuffs; speech intelligibility in noise

Leopold, Sarah YohoFactors Influencing the Prediction of Speech Intelligibility
Doctor of Philosophy, The Ohio State University, 2016, Speech and Hearing Science
The three manuscripts presented here examine the relative importance of various `critical bands’ of speech, as well as their susceptibility to the corrupting influence of background noise. In the first manuscript, band-importance functions derived using a novel technique are compared to the standard functions given by the Speech Intelligibility Index (ANSI, 1997). The functions derived with the novel technique show a complex `microstructure’ not present in previous functions, possibly indicating an increased accuracy of the new method. In the second manuscript, this same technique is used to examine the effects of individual talkers and types of speech material on the shape of the band-importance functions. Results indicate a strong effect of speech material, but a smaller effect of talker. In addition, the use of ten talkers of different genders appears to greatly diminish any effect of individual talker. In the third manuscript, the susceptibility to noise of individual critical bands of speech was determined by systematically varying the signal-to-noise ratio in each band. The signal-to-noise ratio that resulted in a criterion decrement in intelligibility for each band was determined. Results from this study indicate that noise susceptibility is not equal across bands, as has been assumed. Further, noise susceptibility appears to be independent of the relative importance of each band. Implications for future applications of these data are discussed.

Committee:

Eric Healy, Ph.D. (Advisor); Rachael Frush Holt, Ph.D. (Committee Member); DeLiang Wang, Ph.D. (Committee Member)

Subjects:

Audiology

Keywords:

Speech; Hearing; Intelligibility; Hearing Loss; Speech Intelligibility Index, Articulation Index, Band Importance Functions

Soni, Jasminkumar B.Determining The Effect Of Speaker's Gender And Speech Synthesis On Callsign Acquisition Test (CAT) Results
Master of Science in Engineering (MSEgr), Wright State University, 2009, Industrial and Human Factors Engineering
Effective and efficient speech communication is one of the leading factors for success of battlefield operation. With the increases in the levels of gender diversity in military services, it is important to assess the effectiveness of voice for both genders in communication systems. The purpose of this research study was to determine the effect of the speaker's voice (male and female) on the speech intelligibility (SI) performance of the Callsign Acquisition Test (CAT). In addition, the effects of synthesized speech were evaluated. The CAT test is a new SI test that has been developed for military use. A group of 21 listeners with normal hearing participated in the study. Each participant listened to four different lists of CAT (male and female natural recorded speech, and male and female synthetic speech) at two signal-to-noise ratios. White noise was used as a masking noise and various speech files were mixed at signal-to-noise ratios -12 dB and -15 dB. Each wordlist was played at 50dB and 53dB mixed with white noise at 65dB. Each listener participated in a total of 8 tests presented in a random fashion. Testing was performed in a sound treated booth with loud speakers. Test results demonstrated that male speech and natural voice have higher SI results than female speech and synthetic voice respectively. Also statistical analysis concluded that female speech, -15 dB SNR, synthetic voice, and combination effect of female speech and synthetic voice all have significant effect on CAT test results in the presence of white noise. All tests used significance levels alpha = 0.5.

Committee:

Misty Blue, Ph.D. (Advisor); Yan Liu, Ph.D. (Committee Member); Blair Rowley, Ph.D. (Committee Member)

Subjects:

Acoustics; Biomedical Research; Education; Engineering; Industrial Engineering

Keywords:

Callsign Acquisition Test; Speech Intelligibility

Woodruff, John F.Integrating Monaural and Binaural Cues for Sound Localization and Segregation in Reverberant Environments
Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

The problem of segregating a sound source of interest from an acoustic background has been extensively studied due to applications in hearing prostheses, robust speech/speaker recognition and audio information retrieval. Computational auditory scene analysis (CASA) approaches the segregation problem by utilizing grouping cues involved in the perceptual organization of sound by human listeners. Binaural processing, where input signals resemble those that enter the two ears, is of particular interest in the CASA field. The dominant approach to binaural segregation has been to derive spatially selective filters in order to enhance the signal in a direction of interest. As such, the problems of sound localization and sound segregation are closely tied. While spatial filtering has been widely utilized, substantial performance degradation is incurred in reverberant environments and more fundamentally, segregation cannot be performed without sufficient spatial separation between sources.

This dissertation addresses the problems of binaural localization and segregation in reverberant environments by integrating monaural and binaural cues. Motivated by research in psychoacoustics and by developments in monaural CASA processing, we first develop a probabilistic framework for joint localization and segregation of voiced speech. Pitch cues are used to group sound components across frequency over continuous time intervals. Time-frequency regions resulting from this partial organization are then localized by integrating binaural cues, which enhances robustness to reverberation, and grouped across time based on the estimated locations. We demonstrate that this approach outperforms voiced segregation based on either monaural or binaural analysis alone. We also demonstrate substantial performance gains in terms of multisource localization, particularly for distant sources in reverberant environments and low signal-to-noise ratios. We then develop a binaural system for joint localization and segregation of an unknown and time-varying number of sources that is more flexible and requires less prior information than our initial system. This framework incorporates models trained jointly on pitch and azimuth cues, which improves performance and naturally deals with both voiced and unvoiced speech. Experimental results show that the proposed approach outperforms existing two-microphone systems in spite of less prior information.

We also consider how the computational goal of CASA-based segregation should be defined in reverberant environments. The ideal binary mask (IBM) has been established as a main goal of CASA. While the IBM is defined unambiguously in anechoic conditions, in reverberant environments there is some flexibility in how one might define the target signal itself and therefore, ambiguity is introduced to the notion of the IBM. Due to the perceptual distinction between early and late reflections, we introduce the reflection boundary as a parameter to the IBM definition to allow target reflections to be divided into desirable and undesirable components. We conduct a series of intelligibility tests with normal hearing listeners to compare alternative IBM definitions. Results show that it is vital for the IBM definition to account for the energetic effect of early target reflections, and that late target reflections should be characterized as noise.

Committee:

DeLiang Wang, PhD (Advisor); Mikhail Belkin, PhD (Committee Member); Eric Fosler-Lussier, PhD (Committee Member); Nicoleta Roman, PhD (Committee Member)

Subjects:

Acoustics; Artificial Intelligence; Computer Science; Electrical Engineering

Keywords:

computational auditory scene analysis; speech segregation; sound localization; binaural; monaural; ideal binary masking; speech intelligibility