Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 13)

Mini-Tools

 
 

Search Report

  • 1. Hu, Ke Speech Segregation in Background Noise and Competing Speech

    Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

    In real-world listening environments, speech reaching our ear is often accompanied by acoustic interference such as environmental sounds, music or another voice. Noise distorts speech and poses a substantial difficulty to many applications including hearing aid design and automatic speech recognition. Monaural speech segregation refers to the problem of separating speech based on only one recording and is a widely regarded challenge. In the last decades, significant progress has been made on this problem but the challenge remains. This dissertation addresses monaural speech segregation from different interference. First, we research the problem of unvoiced speech segregation which is less studied compared to voiced speech segregation probably due to its difficulty. We propose to utilize segregated voiced speech to assist unvoiced speech segregation. Specifically, we remove all periodic signals including voiced speech from the noisy input and then estimate noise energy in unvoiced intervals using noise-dominant time-frequency units in neighboring voiced intervals. The estimated interference is used by a subtraction stage to extract unvoiced segments, which are then grouped by either simple thresholding or classification. We demonstrate that the proposed system performs substantially better than speech enhancement methods. Interference can be nonspeech signals or other voices. Cochannel speech refers to a mixture of two speech signals. Cochannel speech separation is often addressed by model-based methods, which assume speaker identities and pretrained speaker models. To address this speaker-dependency limitation, we propose an unsupervised approach to cochannel speech separation. We employ a tandem algorithm to perform simultaneous grouping of speech and develop an unsupervised clustering method to group simultaneous streams across time. The proposed objective function for clustering measures the speaker difference of each hypothesized grouping and incorporates pitch (open full item for complete abstract)

    Committee: DeLiang Wang (Committee Chair); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member) Subjects: Computer Science
  • 2. Lakandri, Abishek Exploring GANs With Conv-TasNet: Adversarial Training for Speech Separation

    Master of Science (MS), Ohio University, 2024, Computer Science (Engineering and Technology)

    Generative Adversarial Networks (GANs) were initially developed for computer vision tasks and have shown impressive capabilities in enhancing the performance of various AI solutions. In the field of speech signal processing, numerous studies have utilized GAN models for speech recognition and enhancement. However, the effect of GANs on speech separation has not been extensively explored. This thesis proposes a comprehensive framework to study and compare the efficacy and efficiency of various GAN variants for the task of speech separation. Within our framework, we employ Conv-TasNet as the common generator and propose deep Convolutional Neural Networks as the discriminator in different GAN setups. The GAN models are designed with diverse objective functions, including the original GAN, Least Squares GAN (LSGAN), and MetricGAN. Results from the experiments demonstrate the improved performance of our GAN-enhanced Conv-TasNet models. Among all the models evaluated, MetricGAN appears to be the most effective variant, demonstrating significant improvements in Scale-Invariant Signal-to-Noise Ratio (SI-SNRi) over the baseline model.

    Committee: Jundong Liu (Advisor); Li Xu (Committee Member); Avinash Karanth (Committee Member); Lonnie Welch (Committee Member) Subjects: Computer Science
  • 3. Sun, Tao Time-domain Deep Neural Networks for Speech Separation

    Doctor of Philosophy (PhD), Ohio University, 2022, Electrical Engineering & Computer Science (Engineering and Technology)

    Speech separation separates the speech of interest from background noise (speech enhancement) or interfering speech (speaker separation). While the human auditory system has extraordinary speech separation capabilities, designing artificial models with similar functions has proven to be very challenging. Recently, waveform deep neural network (DNN) has become the dominant approach for speech separation with great success. Improving speech quality and intelligibility is a primary goal for the speech separation tasks. Integrating human speech elements into waveform DNNs has proven to be a simple yet effective strategy to boost objective performance (including speech quality and intelligibility) of speech separation models. In this dissertation, three solutions are proposed to integrate human speech elements into waveform speech separation solutions in an effective manner. First, we propose a knowledge-assisted framework to integrate pretrained self-supervised speech representations to boost the performance of speech enhancement networks. To enhance the output intelligibility, we design auxiliary perceptual loss functions that rely on speech representations pretrained on large datasets, to ensure the denoised network outputs sound like clean human speeches. Our second solution is for speaker separation, where we design a speaker-conditioned model that adopts a pretrained speaker identification model to generate speaker embeddings with rich speech information. Our third solution takes a different approach to improve speaker separation solutions. To suppress information of non-target speakers in auxiliary-loss based solutions, we introduce a loss function that can maximize the distance between speech representations of separated speeches and speeches of clean non-target speakers. In this dissertation, we also address a practical issue in frame-based DNN SE solution: frame stitching, where the input context to be observed in a network is often limited, resulting (open full item for complete abstract)

    Committee: Jundong Liu (Advisor); Razvan Bunescu (Committee Member); Li Xu (Committee Member); Avinash Karanth (Committee Member); Martin J. Mohlenkamp (Committee Member); Jeffrey Dill (Committee Member) Subjects: Computer Science
  • 4. Tan, Ke Convolutional and recurrent neural networks for real-time speech separation in the complex domain

    Doctor of Philosophy, The Ohio State University, 2021, Computer Science and Engineering

    Speech signals are usually distorted by acoustic interference in daily listening environments. Such distortions severely degrade speech intelligibility and quality for human listeners, and make many speech-related tasks, such as automatic speech recognition and speaker identification, very difficult. The use of deep learning has led to tremendous advances in speech enhancement over the last decade. It has been increasingly important to develop deep learning based real-time speech enhancement systems due to the prevalence of many modern smart devices that require real-time processing. The objective of this dissertation is to develop real-time speech enhancement algorithms to improve intelligibility and quality of noisy speech. Our study starts by developing a strong convolutional neural network (CNN) for monaural speech enhancement. The key idea is to systematically aggregate temporal contexts through dilated convolutions, which significantly expand receptive fields. Our experimental results suggest that the proposed model consistently outperforms a feedforward deep neural network (DNN), a unidirectional long short-term memory (LSTM) model and a bidirectional LSTM model in terms of objective speech intelligibility and quality metrics. Although significant progress has been made on deep learning based speech enhancement, most existing studies only exploit magnitude-domain information and enhance the magnitude spectra. We propose to perform complex spectral mapping with a gated convolutional recurrent network (GCRN). Such an approach simultaneously enhances magnitude and phase of speech. Evaluation results show that the proposed GCRN substantially outperforms an existing CNN for complex spectral mapping. Moreover, the proposed approach yields significantly better results than magnitude spectral mapping and complex ratio masking. To achieve strong enhancement performance typically requires a large DNN, making it difficult to deploy such speech enhancement syst (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member) Subjects: Computer Science; Engineering
  • 5. Wang, Zhong-Qiu Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    Microphone arrays are widely deployed in modern speech communication systems. With multiple microphones, spatial information is available in addition to spectral cues to improve speech enhancement, speaker separation and robust automatic speech recognition (ASR) in noisy-reverberant environments. Conventionally, multi-microphone beamforming followed by monaural post-filtering is the dominant approach for multi-channel speech enhancement. This approach requires an accurate estimate of target direction, and power spectral density and covariance matrices of speech and noise. Such estimation algorithms usually cannot achieve satisfactory accuracy in noisy and reverberant conditions. Recently, riding on the development of deep neural networks (DNN), time-frequency (T-F) masking and spectral mapping based approaches have been established as the mainstream methodology for monaural (single-channel) speech separation, including speech enhancement and speaker separation. This dissertation investigates deep learning based microphone array processing and its application to speech separation and localization, and robust ASR. We start our work by exploring various ways of integrating speech enhancement and acoustic modeling for single-channel robust ASR. We propose a training framework that jointly trains enhancement frontends, filterbanks and backend acoustic models. We also apply sequence-discriminative training for sequence modeling and run-time unsupervised adaptation to deal with training and testing mismatches. One essential aspect of multi-channel processing is sound localization. We utilize deep learning based T-F masking to identify T-F units dominated by target speaker and only use these T-F units for speaker localization, as they contain much cleaner phases that are informative for localization. This approach dramatically improves the robustness of conventional cross-correlation, beamforming and subspace based approaches for speaker localization in noisy-reverberant (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Robert Agunga (Other) Subjects: Computer Engineering; Computer Science
  • 6. Chen, Jitong On Generalization of Supervised Speech Separation

    Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering

    Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades. Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation. We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member) Subjects: Computer Science; Engineering
  • 7. Wang, Yuxuan Supervised Speech Separation Using Deep Neural Networks

    Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering

    Speech is crucial for human communication. However, speech communication for both humans and automatic devices can be negatively impacted by background noise, which is common in real environments. Due to numerous applications, such as hearing prostheses and automatic speech recognition, separation of target speech from sound mixtures is of great importance. Among many techniques, speech separation using a single microphone is most desirable from an application standpoint. The resulting monaural speech separation problem has been a central problem in speech processing for several decades. However, its success has been limited thus far. Time-frequency (T-F) masking is a proven way to suppress background noise. With T-F masking as the computational goal, speech separation reduces to a mask estimation problem, which can be cast as a supervised learning problem. This opens speech separation to a plethora of machine learning techniques. Deep neural networks (DNN) are particularly suitable to this problem due to their strong representational capacity. This dissertation presents a systematic effort to develop monaural speech separation systems using DNNs. We start by presenting a comparative study on acoustic features for supervised separation. In this relatively early work, we use support vector machine as classifier to predict the ideal binary mask (IBM), which is a primary goal in computational auditory scene analysis. We found that traditional speech and speaker recognition features can actually outperform previously used separation features. Furthermore, we present a feature selection method to systematically select complementary features. The resulting feature set is used throughout the dissertation. DNN has shown success across a range of tasks. We then study IBM estimation using DNN, and show that it is significantly better than previous systems. Once properly trained, the system generalizes reasonably well to unseen conditions. We demonstrate that our sy (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Eric Healy (Committee Member) Subjects: Computer Science; Engineering
  • 8. Han, Kun Supervised Speech Separation And Processing

    Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering

    In real-world environments, speech often occurs simultaneously with acoustic interference, such as background noise or reverberation. The interference usually leads to adverse effects on speech perception, and results in performance degradation in many speech applications, including automatic speech recognition and speaker identification. Monaural speech separation and processing aim to separate or analyze speech from interference based on only one recording. Although significant progress has been made on this problem, it is a widely regarded challenge. Unlike traditional signal processing, this dissertation addresses the speech separation and processing problems using machine learning techniques. We first propose a classification approach to estimate the ideal binary mask (IBM) which is considered as a main goal of sound separation in computational auditory scene analysis (CASA). We employ support vector machines (SVMs) to classify time-frequency (T-F) units as either target-dominant or interference-dominant. A rethresholding method is incorporated to improve classification results and maximize hit minus false alarm rates. Systematic evaluations show that the proposed approach produces accurate estimated IBMs. In a supervised learning framework, the issue of generalization to conditions different from those in training is very important. We then present methods that require only a small training corpus and can generalize to unseen conditions. The system utilizes SVMs to learn classification cues and then employs a rethresholding technique to estimate the IBM. A distribution fitting method is introduced to generalize to unseen signal-to-noise ratio conditions and voice activity detection based adaptation is used to generalize to unseen noise conditions. In addition, we propose to use a novel metric learning method to learn invariant speech features in the kernel space. The learned features encode speech-related information and can generalize to unseen noise (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member) Subjects: Computer Science
  • 9. Narayanan, Arun Computational auditory scene analysis and robust automatic speech recognition

    Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering

    Automatic speech recognition (ASR) has made great strides over the last decade producing acceptable performance in relatively `clean' conditions. As a result, it is becoming a mainstream technology. But for a system to be useful in everyday conditions, it has to deal with distorting factors like background noise, room reverberation, and recording channel characteristics. A popular approach to improve robustness is to perform speech separation before doing ASR. Most of the current systems treat speech separation and speech recognition as two independent, isolated tasks. But just as separation helps improve recognition, recognition can potentially influence separation. For example, in auditory perception, speech schemas have been known to help improve segregation. An underlying theme of this dissertation is the advocation of a closer integration of these two tasks. We address this in the context of computational auditory scene analysis (CASA), including supervised speech separation. CASA is largely motivated by the principles that guide human auditory `scene analysis'. An important computational goal of CASA systems is to estimate the ideal binary mask (IBM). The IBM identifies speech dominated and noise dominated regions in a time-frequency representation of a noisy signal. Processing noisy signals using the IBM improves ASR performance by a large margin. We start by studying the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs) and vocabulary sizes. Our results show that the mere pattern of the IBM carries important phonetic information. Akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB. In fact, our study shows that there is broad agreement with human performance which is rather surprising. Given the important role that binary mask patterns play, we develop two novel systems that incorporate this information to improve ASR. The first system performs (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Mikhail Belkin (Committee Member); Eric Fosler-Lussier (Committee Member) Subjects: Computer Science; Engineering
  • 10. Medaramitta, Raveendra Evaluating the Performance of Using Speaker Diarization for Speech Separation of In-Person Role-Play Dialogues

    Master of Science in Computer Engineering (MSCE), Wright State University, 2021, Computer Engineering

    Development of professional communication skills, such as motivational interviewing, often requires experiential learning through expert instructor-guided role-plays between the trainee and a standard patient/actor. Due to the growing demand for such skills in practices, e.g., for health care providers in the management of mental health challenges, chronic conditions, substance misuse disorders, etc., there is an urgent need to improve the efficacy and scalability of such role-play based experiential learning, which are often bottlenecked by the time-consuming performance assessment process. WSU is developing ReadMI (Real-time Assessment of Dialogue in Motivational Interviewing) to address this challenge, a mobile AI solution aiming to provide automated performance assessment based on ASR and NLP. The main goal of this thesis research is to investigate current commercially available speaker diarization capabilities and evaluate their performance in separating the speeches between the trainee and the standard patient/actor in an in-person role-play training environment where the crosstalk could interfere with the operation and performance of ReadMI. Specifically, this thesis research has: 1.) identified the major commercially-available speaker diarization systems, such as those from Google, Amazon, IBM, and Rev.ai; 2.) designed and implemented corresponding evaluation systems that integrate these commercially available cloud services for operating in the in-person role-play training environments; and, 3.) completed an experimental study that evaluated and compared the performance of the speaker diarization services from Google and Amazon. The main finding of this thesis is that the current speaker diarization capabilities alone are not able to provide sufficient performance for our particular use case when integrating them into ReadMI for operating in in-person role-play training environments. But this thesis research potentially provides a clear baseline reference (open full item for complete abstract)

    Committee: Yong Pei Ph.D. (Committee Co-Chair); Paul J. Hershberger Ph.D. (Committee Co-Chair); Jack S. Jean Ph.D. (Committee Member) Subjects: Computer Engineering; Computer Science
  • 11. Liu, Yuzhou Deep CASA for Robust Pitch Tracking and Speaker Separation

    Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering

    Speech is the most important means of human communication. In real environments, speech is often corrupted by acoustic inference, including noise, reverberation and competing speakers. Such interference leads to adverse effects on audition, and degrades the performance of speech applications. Inspired by the principles of human auditory scene analysis (ASA), computational auditory scene analysis (CASA) addresses speech separation in two main steps: segmentation and grouping. With noisy speech decomposed into a matrix of time-frequency (T-F) units, segmentation organizes T-F units into segments, each of which corresponds to a contiguous T-F region and is supposed to originate from the same source. Two types of grouping are then performed. Simultaneous grouping aggregates segments overlapping in time to simultaneous streams. In sequential grouping, simultaneous streams are grouped across time into distinct sources. As a traditional speech separation approach, CASA has been successfully applied in various speech-related tasks. In this dissertation, we revisit conventional CASA methods, and perform related tasks from a deep learning perspective. As an intrinsic characteristic of speech, pitch serves as a primary cue in many CASA systems. A reliable estimate of pitch is important not only for extracting harmonic patterns at a frame level, but also for streaming voiced speech in sequential grouping. Based on the types of interference, we can divide pitch tracking in two categories: single pitch tracking in noise and multi-pitch tracking. Pitch tracking in noise is challenging as the harmonic structure of speech can be severely contaminated. To recover the missing harmonic patterns, we propose to use long short-term memory (LSTM) recurrent neural networks (RNNs) to model sequential dynamics. Two architectures are investigated. The first one is conventional LSTM that utilizes recurrent connections to model temporal dynamics. The second one is two-level time-frequency (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Alan Ritter (Committee Member) Subjects: Computer Science; Engineering
  • 12. Delfarah, Masood Deep learning methods for speaker separation in reverberant conditions

    Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering

    Speech separation refers to the problem of separating target speech from acoustic interference such as background noise, room reverberation and other speakers. An effective solution to this problem can improve speech intelligibility of human listeners and speech processing systems. Speaker separation is one kind of speech separation in which the interfering source is also human speech. This dissertation addresses the speaker separation problem in reverberant environments. The goal is to increase the speech intelligibility of hearing-impaired and normal-hearing listeners in those conditions. Speaker separation is traditionally approached using model-based methods such as Gaussian mixture models (GMMs) or hidden Markov models (HMMs). These methods are unable to generalize to challenging cases with unseen speakers or nonstationary noise. We employ supervised learning for the speaker separation problem. The idea is inspired from studies that introduced deep neural networks (DNNs) to speech-nonspeech separation. In this approach, training data is used to learn a mapping function from noisy speech features to an ideal time-frequency (T-F) mask. We start this study by investigating an extensive set of acoustic features extracted in adverse conditions. DNNs are used as the learning machine, and separation performance is evaluated using standard objective speech intelligibility metrics. Separation performance is systematically evaluated in both nonspeech and speech interference, a variety of SNRs, reverberation times, and direct-to-reverberant energy ratios. We construct feature combination sets using a sequential floating forward selection algorithm, and combined features outperform individual ones. Next, we address the problem of separating two-talker mixtures in reverberant conditions. We employ recurrent neural networks (RNNs) with bidirectional long short-term memory (BLSTM) to separate and dereverberate the target speech signal. We propose two-stage networks t (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Fosler-Lussier Eric (Committee Member); Healy Eric (Committee Member) Subjects: Artificial Intelligence; Computer Engineering; Computer Science
  • 13. Jin, Zhaozhang Monaural Speech Segregation in Reverberant Environments

    Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

    Room reverberation is a major source of signal degradation in real environments. While listeners excel in "hearing out" a target source from sound mixtures in noisy and reverberant conditions, simulating this perceptual ability remains a fundamental challenge. The goal of this dissertation is to build a computational auditory scene analysis (CASA) system that separates target voiced speech from its acoustic background in reverberant environments. A supervised learning approach to pitch-based grouping of reverberant speech is proposed, followed by a robust multipitch tracking algorithm based on a hidden Markov model (HMM) framework. Finally, a monaural CASA system for reverberant speech segregation is designed by combining the supervised learning approach and the multipitch tracker. Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. Assuming that the true target pitch is known, our first study lends to a novel supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The model trained using this objective function yields significantly better T-F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers. Multipitch tracki (open full item for complete abstract)

    Committee: DeLiang Wang PhD (Advisor); Eric Fosler-Lussier PhD (Committee Member); Mikhail Belkin PhD (Committee Member) Subjects: Computer Science