Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering
Speech is the most important means of human communication. In real environments, speech is often corrupted by acoustic inference, including noise, reverberation and competing speakers. Such interference leads to adverse effects on audition, and degrades the performance of speech applications. Inspired by the principles of human auditory scene analysis (ASA), computational auditory scene analysis (CASA) addresses speech separation in two main steps: segmentation and grouping. With noisy speech decomposed into a matrix of time-frequency (T-F) units, segmentation organizes T-F units into segments, each of which corresponds to a contiguous T-F region and is supposed to originate from the same source. Two types of grouping are then performed. Simultaneous grouping aggregates segments overlapping in time to simultaneous streams. In sequential grouping, simultaneous streams are grouped across time into distinct sources. As a traditional speech separation approach, CASA has been successfully applied in various speech-related tasks. In this dissertation, we revisit conventional CASA methods, and perform related tasks from a deep learning perspective.
As an intrinsic characteristic of speech, pitch serves as a primary cue in many CASA systems. A reliable estimate of pitch is important not only for extracting harmonic patterns at a frame level, but also for streaming voiced speech in sequential grouping. Based on the types of interference, we can divide pitch tracking in two categories: single pitch tracking in noise and multi-pitch tracking.
Pitch tracking in noise is challenging as the harmonic structure of speech can be severely contaminated. To recover the missing harmonic patterns, we propose to use long short-term memory (LSTM) recurrent neural networks (RNNs) to model sequential dynamics. Two architectures are investigated. The first one is conventional LSTM that utilizes recurrent connections to model temporal dynamics. The second one is two-level time-frequency (open full item for complete abstract)
Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Alan Ritter (Committee Member)
Subjects: Computer Science; Engineering