Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 2)

Mini-Tools

 
 

Search Report

  • 1. Taherian, Hassan Multi-channel Conversational Speaker Separation and Diarization

    Doctor of Philosophy, The Ohio State University, 2024, Computer Science and Engineering

    Our daily conversations often occur in acoustic environments filled with background noise, reverberation, and competing speech. In such settings, the performance of speech processing systems drastically declines, as they are typically designed to process clean speech. To address this challenge, speaker separation is employed to segregate speech signals. For real-world applications, speaker separation must be talker-independent to accommodate speakers that are not included in the training data. This dissertation focuses on talker-independent speaker separation in conversational or meeting environments, in single- and multi-microphone scenarios. Conversational speaker separation systems are required to process long audio recordings and handle overlapping speech from a variable number of speakers. Current methods utilize continuous speaker separation (CSS), which divides an audio stream into short, partially overlapped segments of 2-3 seconds, each containing up to two speakers. CSS employs a talker-independent speaker separation model based on deep neural networks (DNN) to process each segment. Training a talker-independent model requires that each output layer of a DNN model associate with a distinct speaker in the mixture. Ambiguity in speaker assignment would lead to conflicting gradients during training. To ensure talker independence, the CSS separation model is trained with permutation invariant training (PIT), exploring all possible output-speaker permutations. Another approach to processing conversational speech involves combining speaker separation with diarization. Speaker diarization is designed to determine "who spoke when" within an audio stream, and when used in conjunction with speaker separation, it enables the creation of a distinct, clean audio stream for each speaker. This process is closely related to speaker recognition, which seeks to identify "who is speaking." This dissertation begins by investigating the impact of single- and multi-ch (open full item for complete abstract)

    Committee: Donald Williamson (Committee Member); Eric Fosler-Lussier (Committee Member); DeLiang Wang (Advisor) Subjects: Artificial Intelligence; Computer Science; Electrical Engineering
  • 2. Liu, Yuzhou Deep CASA for Robust Pitch Tracking and Speaker Separation

    Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering

    Speech is the most important means of human communication. In real environments, speech is often corrupted by acoustic inference, including noise, reverberation and competing speakers. Such interference leads to adverse effects on audition, and degrades the performance of speech applications. Inspired by the principles of human auditory scene analysis (ASA), computational auditory scene analysis (CASA) addresses speech separation in two main steps: segmentation and grouping. With noisy speech decomposed into a matrix of time-frequency (T-F) units, segmentation organizes T-F units into segments, each of which corresponds to a contiguous T-F region and is supposed to originate from the same source. Two types of grouping are then performed. Simultaneous grouping aggregates segments overlapping in time to simultaneous streams. In sequential grouping, simultaneous streams are grouped across time into distinct sources. As a traditional speech separation approach, CASA has been successfully applied in various speech-related tasks. In this dissertation, we revisit conventional CASA methods, and perform related tasks from a deep learning perspective. As an intrinsic characteristic of speech, pitch serves as a primary cue in many CASA systems. A reliable estimate of pitch is important not only for extracting harmonic patterns at a frame level, but also for streaming voiced speech in sequential grouping. Based on the types of interference, we can divide pitch tracking in two categories: single pitch tracking in noise and multi-pitch tracking. Pitch tracking in noise is challenging as the harmonic structure of speech can be severely contaminated. To recover the missing harmonic patterns, we propose to use long short-term memory (LSTM) recurrent neural networks (RNNs) to model sequential dynamics. Two architectures are investigated. The first one is conventional LSTM that utilizes recurrent connections to model temporal dynamics. The second one is two-level time-frequency (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Alan Ritter (Committee Member) Subjects: Computer Science; Engineering