Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Time-domain Deep Neural Networks for Speech Separation

Abstract Details

2022, Doctor of Philosophy (PhD), Ohio University, Electrical Engineering & Computer Science (Engineering and Technology).
Speech separation separates the speech of interest from background noise (speech enhancement) or interfering speech (speaker separation). While the human auditory system has extraordinary speech separation capabilities, designing artificial models with similar functions has proven to be very challenging. Recently, waveform deep neural network (DNN) has become the dominant approach for speech separation with great success. Improving speech quality and intelligibility is a primary goal for the speech separation tasks. Integrating human speech elements into waveform DNNs has proven to be a simple yet effective strategy to boost objective performance (including speech quality and intelligibility) of speech separation models. In this dissertation, three solutions are proposed to integrate human speech elements into waveform speech separation solutions in an effective manner. First, we propose a knowledge-assisted framework to integrate pretrained self-supervised speech representations to boost the performance of speech enhancement networks. To enhance the output intelligibility, we design auxiliary perceptual loss functions that rely on speech representations pretrained on large datasets, to ensure the denoised network outputs sound like clean human speeches. Our second solution is for speaker separation, where we design a speaker-conditioned model that adopts a pretrained speaker identification model to generate speaker embeddings with rich speech information. Our third solution takes a different approach to improve speaker separation solutions. To suppress information of non-target speakers in auxiliary-loss based solutions, we introduce a loss function that can maximize the distance between speech representations of separated speeches and speeches of clean non-target speakers. In this dissertation, we also address a practical issue in frame-based DNN SE solution: frame stitching, where the input context to be observed in a network is often limited, resulting in boundary discontinuities in network outputs. We use recurrent neural network (RNN) to connect depthwise fully convolution networks (FCNs), allowing temporal information to be propagated along the networks on individual frames. Our FCN + RNN model demonstrates excellent smoothing effect on short frames, enabling speech enhancement systems with very short delays.
Jundong Liu (Advisor)
Razvan Bunescu (Committee Member)
Li Xu (Committee Member)
Avinash Karanth (Committee Member)
Martin J. Mohlenkamp (Committee Member)
Jeffrey Dill (Committee Member)
101 p.

Recommended Citations

Citations

  • Sun, T. (2022). Time-domain Deep Neural Networks for Speech Separation [Doctoral dissertation, Ohio University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1647344440927022

    APA Style (7th edition)

  • Sun, Tao. Time-domain Deep Neural Networks for Speech Separation. 2022. Ohio University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1647344440927022.

    MLA Style (8th edition)

  • Sun, Tao. "Time-domain Deep Neural Networks for Speech Separation." Doctoral dissertation, Ohio University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1647344440927022

    Chicago Manual of Style (17th edition)