Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 518)

Mini-Tools

 
 

Search Report

  • 1. Johnson, Eric Improving Speech Intelligibility Without Sacrificing Environmental Sound Recognition

    Doctor of Philosophy, The Ohio State University, 2022, Speech and Hearing Science

    The three manuscripts presented here examine concepts related to speech perception in noise and ways to overcome poor speech intelligibility without depriving listeners of environmental sound recognition. Because of hearing-impaired (HI) listeners' auditory deficits, there is a substantial need for speech-enhancement (noise reduction) technology. Recent advancements in deep learning have resulted in algorithms that significantly improve the intelligibility of speech in noise, but in order to be suitable for real-world applications such as hearing aids and cochlear implants, these algorithms must be causal, talker independent, corpus independent, and noise independent. Manuscript 1 involves human-subjects testing of a novel, time-domain-based algorithm that fulfills these fundamental requirements. Algorithm processing resulted in significant intelligibility improvements for both HI and normal-hearing (NH) listener groups in each signal-to-noise ratio (SNR) and noise type tested. In Manuscript 2, the range of speech-to-background ratios (SBRs) over which NH and HI listeners can accurately perform both speech and environmental recognition was determined. Separate groups of NH listeners were tested in conditions of selective and divided attention. A single group of HI listeners was tested in the divided attention experiment. Psychometric functions were generated for each listener group and task type. It was found that both NH and HI listeners are capable of high speech intelligibility and high environmental sound recognition over a range of speech-to-background ratios. The range and location of optimal speech-to-background ratios differed across NH and HI listeners. The optimal speech-to-background ratio also depended on the type of environmental sound present. Conventional deep-learning algorithms for speech enhancement target maximum intelligibly by removing as much noise as possible while maintaining the essential characteristics of the target speech signal (open full item for complete abstract)

    Committee: Eric Healy (Advisor); Rachael Holt (Committee Member); DeLiang Wang (Committee Member) Subjects: Acoustics; Artificial Intelligence; Audiology; Behavioral Sciences; Communication; Computer Engineering; Health Sciences
  • 2. Tahamtan, Mahdi The Aerodynamic, Glottographic, and Acoustic Effects of Clear Speech.

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2022, Communication Disorders

    This dissertation investigated aerodynamic, glottographic, and acoustic differences between habitual and clear speech. Nine normal-speaking individuals (five cis female, 4 cis male) were asked to read six short sentences in four reading conditions: habitual reading, habitual reading while holding a mask to the face to capture airflow and oral air pressure, clear reading, and clear reading while holding the mask to the face. Mask-off conditions in both habitual and clear reading manners were used for acoustic analyses, and mask-on conditions were used for aerodynamic and glottographic analyses. The instruction for eliciting habitual speech was “Read each sentence as if you are talking with a friend across the table.” The instruction for eliciting clear speech was “Read the sentences as clearly as possible by enunciating well, as if someone is having trouble understanding you.” Acoustic and time-related results indicated that from habitual to clear speech: (1) sentence duration increased, (2) speaking rate decreased, (3) duration of stressed vowels and unvoiced fricatives increased, (4) voice onset time increased for some unvoiced plosives, (5) stop gap duration increased, (6) fundamental frequency did not change except for two stressed vowels in female speakers for which fo increased, and (7) intensity of stressed vowels and stop consonants increased, but not for unvoiced fricatives (except for /ʃ/). Aerodynamic results indicated that from habitual to clear speech, there was greater (1) oral air pressure, (2) average airflow, (3) total air volume, and (4) peak flow during the release of the voiceless bilabial stop, suggesting the influence of greater subglottal pressure. In contrast, there was little to no change in glottal dynamics such as EGG width, EGG height, EGG contact and open quotients, and glottal airflow timing measures. In this study, it might be inferred that clear speech was a phenomenon that is more related to subglottal pressure and oral cavity kinem (open full item for complete abstract)

    Committee: Ronald Scherer Ph.D. (Committee Chair); Steven Boone M.F.A. (Other); Brent Archer Ph.D. (Committee Member); Jason Whitfield Ph.D. (Committee Member) Subjects: Acoustics; Speech Therapy
  • 3. Wasiuk, Peter The Importance of Glimpsed Audibility for Speech-In-Speech Recognition

    Doctor of Philosophy, Case Western Reserve University, 2022, Communication Sciences

    Purpose: Speech recognition in the presence of competing speech can be challenging, and individuals vary considerably in their ability to accomplish this complex auditory-cognitive task. Speech-in-speech recognition can vary due to factors that are intrinsic to the listener, such as hearing status and cognitive abilities, or due to differences in the short-term audibility of the target speech. The primary goal of the current experiments was to characterize the effects of glimpsed target audibility and intrinsic listener variables on speech-in-speech recognition. Methods: Three experiments were conducted to evaluate the effects of glimpsed target audibility, intrinsic listener variables, and acoustic-perceptual difference cues on speech-in-speech and speech-in-noise recognition. Listeners were young adults (18 to 28 years) with normal hearing. Speech recognition was measured in two stages in each experiment. In Stage 1, speech reception thresholds were measured adaptively to estimate the signal-to-noise ratio (SNR) associated with 50% correct keyword recognition for each listener in each stimulus condition. In Stage 2, keyword recognition was measured at a fixed-SNR in each stimulus condition. All participants completed a battery of cognitive measures that assessed central abilities related to masked-speech recognition. The proportion of audible target glimpses for each target+masker keyword stimulus presented in the fixed-SNR testing was measured using a computational glimpsing model of speech recognition. Results: Results demonstrated that variability in both speech-in-speech and speech-in-noise recognition depends critically on the proportion of audible target glimpses available in the target+masker mixture, even across stimuli presented at the same global SNR. Glimpsed target audibility requirements for successful speech recognition varied systematically as a function of informational masking. Young adult listeners required a greater proportion of audibl (open full item for complete abstract)

    Committee: Lauren Calandruccio (Committee Chair); Christopher Burant (Committee Member); Barbara Lewis (Committee Member); Robert Greene (Committee Member) Subjects: Audiology; Behavioral Sciences; Experimental Psychology
  • 4. Zhao, Yan Deep learning methods for reverberant and noisy speech enhancement

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    In daily listening environments, the speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions can be detrimental to speech intelligibility and quality, and also pose a serious problem for many speech-related applications, including automatic speech and speaker recognition. The objective of this dissertation is to enhance speech signals distorted by reverberation and noise, to benefit both human communications and human-machine interaction. Different from traditional signal processing approaches, we employ deep learning approaches to perform reverberant-noisy speech enhancement. Our study starts with speech dereverberation without background noise. Reverberation consists of sound wave reflections from various surfaces in an enclosed space. This means the reverberant signal at any time step includes the damped and delayed past signals. To explore such relationships at different time steps, we utilize a self-attention mechanism as a pre-processing module to produce dynamic representations. With these enhanced representations, we propose a temporal convolutional network (TCN) based speech dereverberation algorithm. Systematic evaluations demonstrate the effectiveness of the proposed algorithm in a wide range of reverberant conditions. Then we propose a deep learning based time-frequency (T-F) masking algorithm to address both reverberation and noise. Specifically, a deep neural network (DNN) is trained to estimate the ideal ratio mask (IRM), in which the anechoic-clean speech is considered as the desired signal. The enhanced speech is obtained by applying the estimated mask to the reverberant-noisy speech. Listening tests show that the proposed algorithm can improve speech intelligibility for hearing-impaired (HI) listeners substantially, and also benefit normal-hearing (NH) listeners. Considering the different natures of reverberation and noise, we propose to perform speech enhancement using a two-stage (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Eric Healy (Committee Member) Subjects: Computer Science; Engineering
  • 5. Mental, Rebecca Using Realistic Visual Biofeedback for the Treatment of Residual Speech Sound Errors

    Doctor of Philosophy, Case Western Reserve University, 2018, Communication Sciences

    Purpose: Although most children with speech sound disorders are able to remediate their errors, some individuals have errors that persist into late childhood and even adulthood. These individuals are considered to have residual speech sound errors (RSSEs), and they are at risk for social, academic, and employment difficulties. Most individuals with RSSEs have participated in years of traditional speech therapy with little success. Visual biofeedback provides an alternative method of treatment that may be what finally allows these individuals to remediate their errors. This study utilized Opti-Speech, a visual biofeedback software that uses electromagnetic articulography to create a threedimensional rendering of the tongue that moves in real time with the participant's own tongue, for the remediation of RSSEs. Method: This single subject multiple baseline design included 18 participants (11 males and 7 females) who ranged from 8 -22 years of age. Speech sounds addressed in treatment included "r", "s", "sh", "ch", and "l". Participants attended an average of three baseline sessions and ten treatment sessions that utilized Opti-Speech visual biofeedback, and returned for a two-month follow-up. Results: Perceptual measures were based on generalization to untreated words. Eleven of the 18 participants were able to make clinically significant improvements for their target sound by their final treatment session, and 11 of 16 participants who returned for follow19 up measures had made clinically significant improvement on their target sound. When final session perceptual ratings were compared to follow-up, eight of the nine participants who presented with clinically significant improvement for their target sound were able to maintain their progress or presented with significantly improved speech sound skills. However, generalization was not seen at the sentence level. When considered as a group, clinically significant improvements were seen overal (open full item for complete abstract)

    Committee: Jennell Vick Ph.D. (Committee Chair); Barbara Lewis Ph.D. (Committee Member); Elizabeth Short Ph.D. (Committee Member); Gregory Lee Ph.D. (Committee Member); Parrill Fey Ph.D. (Committee Member) Subjects: Speech Therapy
  • 6. Frazer, Brittany Approximating Subglottal Pressure from Oral Pressure: A Methodological Study

    Master of Science (MS), Bowling Green State University, 2014, Communication Disorders/Speech-Language Pathology

    The most frequently used method to estimate subglottal pressure noninvasively is to have a person smoothly utter CVCV; strings such that the subglottal pressure remains nearly constant throughout the utterance of the string, as in smoothly saying /p:i:p:i:p:i:/, and an oral pressure transducer is used to estimate the subglottal air pressure during the vowels by measuring the oral pressures during the consonants. The current investigation sought to determine the accuracy of estimates of subglottal pressure for various conditions, namely, whether or not the subjects are trained in the use of a standard utterance, increasing syllable rate, using a voiced /b/ instead of a voiceless /p/ initial syllable, adding a lip or velar leak, or using a two syllable production instead of a single syllable production. 10 subjects (5 males and 5 females) volunteered for this study (results for 3 males and 3 females are reported here). The subglottal pressure was estimated from the oral pressure during lip occlusion, and the syllable rate and lip closed quotient (the duration the lips are closed divided by the syllable duration) were obtained for all subjects. Lip leak, velar leak, and lack of time to equilibrate air pressure throughout the airway caused estimates of subglottal pressure to be inaccurate. A wide range of syllable rates provided relatively accurate results. In addition, the use of the voiced initial consonant /b/ and the two-syllable word "peeper" appeared to create acceptable estimates of subglottal pressure from oral pressure. Training improved the consistency of the oral pressure profiles and thus the assurance in estimating the subglottal pressure. Numerous pressure profile shapes during lip occlusion are discussed.

    Committee: Ronald Scherer Ph.D. (Advisor); John Folkins Ph.D. (Committee Member); Alexander Goberman Ph.D. (Committee Member) Subjects: Speech Therapy
  • 7. Van Jura, Matthew The Costs of Staying Neutral: How Midlevel Student Affairs Professionals Navigate the Personal and Professional Tensions Associated with Campus Free Speech Events

    Doctor of Philosophy, The Ohio State University, 2021, Educational Studies

    Midlevel student affairs professionals are integral to supporting the mission of higher education institutions. These professionals work closely with a diverse array of campus stakeholders, helping to implement strategy and facilitate information throughout the organization. Yet the midlevel nature of their role can be a source of frustration for these professionals. Despite their talent and expertise, midlevel student affairs professionals often feel as though they have few opportunities to provide input on the policies they are asked to implement and enforce (Donaldson & Rosser, 2007; Rosser, 2004; Wilson et al., 2016). In recent years, many scholars have explored tensions associated with free speech events on college campuses (Ben-Porath, 2017; Chemerinsky & Gillmam, 2017; Morse, 2017; Palfrey, 2017). Few, however, have studied this topic from the perspective of midlevel student affairs professionals. This is an oversight because midlevel professionals comprise the majority of staff in student affairs organizations (M. B. Cooper & Boice‐Pardee, 2011). Furthermore, the midlevel nature of their position within the campus hierarchy suggests that these individuals can illuminate tensions and conflicting priorities associated with campus free speech events in ways that have been previously unseen. The purpose of this grounded theory study was to illustrate how midlevel student affairs professionals navigate the personal and professional tensions that arise through their involvement with campus free speech events. Research questions included: 1) What policies and practices inform the ways in which midlevel student affairs professionals navigate campus free speech events?; 2) In what ways do campus free speech events create conflict for midlevel student affairs professionals concerning their professional roles and individual values?; and 3) How do systems of power shape the ways in which midlevel student affairs professionals negotiate these tensions that arise th (open full item for complete abstract)

    Committee: Susan Jones (Advisor); Tatiana Suspitsyna (Committee Co-Chair); Ann Allen (Committee Member) Subjects: Education Policy; Higher Education Administration
  • 8. Somasundaram, Arunachalam A facial animation model for expressive audio-visual speech

    Doctor of Philosophy, The Ohio State University, 2006, Computer and Information Science

    Expressive facial speech animation is a challenging topic of great interest to the computer graphics community. Adding emotions to audio-visual speech animation is very important for realistic facial animation. The complexity of neutral visual speech synthesis is mainly attributed to co-articulation. Co-articulation is the phenomenon due to which the facial pose of the current segment of speech is affected by the neigboring segments of speech. The inclusion of emotions and fluency effects in speech adds to that complexity because of the corresponding shape and timing modifications brought about in speech. Speech is often accompanied by supportive visual prosodic elements such as motion of the head, eyes, and eyebrow, which improve the intelligibility of speech, and they need to be synthesized. In this dissertation, we present a technique to modify input neutral audio and synthesize visual speech incorporating effects of emotion and fluency. Visemes, which are visual counterpart of phonemes, are used to animate speech. We motion capture 3-D facial motion and extract facial muscle positions of expressive visemes. Our expressive visemes capture the pose of the entire face. The expressive visemes are blended using a novel constraint-based co-articulation technique that can easily accommodate the effects of emotion. We also present a visual prosody model for emotional speech, based on motion capture data, that exhibits non-verbal behaviors such as eyebrow motion and overall head motion.

    Committee: Richard Parent (Advisor) Subjects: Computer Science
  • 9. Bonaventura, Patrizia Invariant patterns in articulatory movements

    Doctor of Philosophy, The Ohio State University, 2003, Speech and Hearing Science

    The purpose of the study is to discover an effective method of characterizing movement patterns of the crucial articulator as the function of an abstract syllable magnitude and the adjacent boundary, and at the same time to investigate effects of prosodic control on utterance organization. In particular, the speed of movement when a flesh-point on the tongue blade or the lower lip crosses a selected position relative to the occlusion plane is examined. The time of such crossing provides an effective measure of syllable timing and syllable duration according to previous work. In the present work, using a very limited vocabulary with only a few consonants and one vowel as the key speech materials, effects of contrastive emphasis on demisyllabic movement patterns were studied. The theoretical framework for this analysis is the C/D model of speech production in relation to the concept of an invariant part of selected articulatory movements. The results show evidence in favor of the existence of ‘iceberg' patterns, but a linear dependence of slope on the total excursion of the demisyllabic movement, instead of the approximate constancy of the threshold crossing speed as suggested in the original proposal of the ‘iceberg', has been found. Accordingly, a revision of the original concept of ‘iceberg' seems necessary. This refinement is consistent with the C/D model assumption on ‘prominence control' that the syllable magnitude determines the movement amplitude, accompanying directly related syllable duration change. In this assumption, the movement of a consonantal component should also be proportional to syllable magnitude. The results suggest, however, systematic outliers deviating from the linear dependence of movement speed on excursion. This deviation may be caused by the effect of the immediately following boundary, often referred to as phrase-final elongation.

    Committee: Osamu Fujimura (Advisor) Subjects:
  • 10. Clopton, Sara Articulation Errors in Childhood Apraxia of Speech

    Master of Arts, Case Western Reserve University, 2008, Communication Sciences

    The purpose of this study was to characterize articulation errors of children with Childhood Apraxia of Speech (CAS) by type and position within the syllable. Errors made by children with CAS were compared against errors of peers with isolated, non-apraxic speech sound disorders (SSD) and combined speech and language disorder (SL) at preschool- and school-age. Results suggested that CAS was different from the other disorders with different profiles at the two ages. Between-group comparisons at same-age stages showed that a) preschool-aged children with CAS made more substitutions of onset consonants than comparison groups and b) school-aged children with CAS made more omissions of coda consonants than comparison groups. A subset of children with CAS was followed longitudinally. Results suggested that speech improved from preschool- to school-age, with substitutions decreasing and the percentage of consonants correct increasing. Despite improvement, coda omission appeared to be a salient characteristic of CAS at school-age.

    Committee: Patrizia Bonaventura PhD (Committee Chair); Angela Ciccia PhD (Committee Member); Stacy Williams PhD (Committee Member); Barbara Lewis PhD (Committee Member); Lisa Freebairn MA (Committee Member) Subjects: Communication; Linguistics; Speech Therapy
  • 11. Ivan, Trevor A Framing Analysis of News Coverage Related to Litigation Connected to Online Student Speech That Originates Off-Campus

    MA, Kent State University, 2013, College of Communication and Information / School of Media and Journalism

    Responding to a growth in technology, young people often turn to social media and online communication as their primary means of expression and interaction. However, some of the content students create and post while at home can negatively affect the school environment. School administrators have, at times, disciplined students for their off-campus online speech. This act has raised legal questions about how much control schools can and should possess over speech that originates away from the school's physical boundaries. Some students and their families have sued their respective school districts when they perceive an overreach in school authority for such discipline. Despite this issue's gravity among First Amendment scholars and advocates, the general public probably has little direct experience with these legal questions beyond what it learns through news reports. Because news is a basic social learning tool, the way journalists present information can profoundly affect the public's understanding of any given issue. This study examined how the news media portrayed four court cases pertinent to this issue: Layshock v. Hermitage School District, J. S. v. Blue Mountain School District, Doninger v. Niehoff, and Kowalski v. Berkeley County Schools. The researcher used textual analysis to investigate the frames found in 76 news stories by examining the way journalists presented the following items: legal context, the actions of the student litigant, the actions of school administrators, and the online speech itself that initially led to school discipline.

    Committee: Candace Bowen M.A. (Advisor); Mark Goodman J.D. (Advisor); Danielle Coombs Ph.D. (Committee Member) Subjects: Education; Educational Leadership; Journalism; Legal Studies; Mass Media
  • 12. Li, Sarah Expanding Articulatory Information Interpreted from Ultrasound Imaging

    PhD, University of Cincinnati, 2024, Engineering and Applied Science: Biomedical Engineering

    Ultrasound imaging provides tongue shape information useful for remediating speech sound disorders, which affect 5% of children and cause long-term deficits in social health and employment in adulthood. However, ultrasound imaging can be difficult to interpret for clinicians and individuals, limiting the understanding of articulatory data and ultrasound biofeedback therapy speech outcomes. This dissertation includes three studies that use different approaches to address and investigate guidelines for improving interpretation of tongue articulation in ultrasound images during speech production. One difficulty is that tongue shapes can be challenging to compare due to their complexity and the fast pace of articulatory movements during speech. To approach this problem, tongue movement was represented as displacement trajectories of tongue parts, and support-vector machine classification models were trained to identify patterns that differentiate accurate versus misarticulated productions of the word “are.” A linear combination of tongue dorsum and blade movement was shown to achieve a classification accuracy of 85%. The resulting simpler representation of tongue movement accuracy would aid interpretation of ultrasound images during biofeedback by allowing easy comparison to movement targets. Another source of difficulty is the articulatory information missing from ultrasound images, such as tongue tip shadowed by sublingual air or by bone, as well as possible confusion between parasagittal and midsagittal tongue contours. By using a novel approach of simulating ultrasound wave propagation in tongue shapes segmented from MRI, ultrasound images were simulated from known /r/ tongue shapes. Simulations from 23 speakers indicated that tongue shapes in the middle of the continuum between bunched and retroflex /r/ had the longest portion of anterior tongue not visible in ultrasound images. Simulations of parasagittal and midsagittal images from 10 speakers su (open full item for complete abstract)

    Committee: T. Douglas Mast Ph.D. (Committee Chair); Steven M. Lulich Ph.D. (Committee Member); Jing Tang Ph.D. (Committee Member); Suzanne Boyce Ph.D. (Committee Member) Subjects: Biomedical Engineering
  • 13. Smith, Erika Speech-Language Pathologists' Feelings and Attitudes Towards the Use of Apps in a School-Based Setting

    Master of Arts, Miami University, 2021, Speech Pathology and Audiology

    Thousands of technological apps have emerged in the past decade. Little research has been done to examine how apps are used by speech-language pathologists (SLPs), their effectiveness, and feelings regarding their use. SLPs must consider current research as a principle of evidenced-based practice when integrating technology into speech and language service delivery. The current study investigates SLPs pattern of app use and feelings towards their use in a school setting. This study aims to uncover correlations between app use and these feelings, as well as considerations made by SLPs prior to implementing apps in their sessions. A survey was distributed to school-based SLPs in Ohio, yielding 69 valid responses. Results showed 77% of SLPs reported using apps in their treatment sessions. SLPs reported generally positive feelings regarding the use of apps. SLPs considered factors such as age, cognitive ability, and disorder of the students with whom they are using apps. For the SLPs who reported not using apps, the most common reasons were personal preference and price. Results of this study carry clinical implications for evidence-based practice as the age of technology continues to develop. These results warrant future research on the efficacy and effectiveness of apps in school settings.

    Committee: Arnold Olszewski Ph.D., CCC-SLP (Advisor); Amber Franklin Ph.D., CCC-SLP (Committee Member); Gerard Poll Ph.D., CCC-SLP (Committee Member) Subjects: Speech Therapy
  • 14. Spencer, Caroline Neural Mechanisms of Intervention in Residual Speech Sound Disorder

    PhD, University of Cincinnati, 2021, Allied Health Sciences: Communication Sciences and Disorders

    In typical child and adult speakers, speech generation requires coordinated activation of a network of inferior frontal, temporal, and subcortical brain regions to carry out multiple linguistic and speech motor processes. However, a portion of children who exhibit speech sound errors in development persist in these errors beyond age 9, which can lead to broader, long-term consequences in scholastic achievement, literacy, and social-emotional well-being. The goal of this project was to investigate the neural underpinnings of residual speech sound disorder (RSSD) and its remediation through a speech therapy program. In Study 1, I investigated the neural activity of children with RSSD in comparison to children with typically-developing speech (TD) at baseline (Time 1). I had anticipated to observe significant differences between RSSD and TD groups. However, in a whole-brain analysis (at p<0.05 and with Bonferroni corrections for multiple comparisons), I did not observe statistically significant differences in activation on either the SRT-Early Sounds or SRT-Late Sounds. In Study 2, I followed up with a region-of-interest approach of activation at Time 1 and Time 2. I did not detect any significant differences across task, group, or time comparisons. While this finding was not expected, it implies that, when task performance is similar, children with RSSD do not show differences in neural activity from their typical peers. I also explored the relationship between change in activation and progress in therapy. I found that children with RSSD who made more progress in therapy tended to show a decrease in activation in the left visual association cortex on the SRT-Late Sounds (R2=0.78). The left visual association cortex is not a core component of the speech production network but may indicate differences in the children's reliance on sensorimotor integration or internal speech visualization processes. Using a seed-to-voxel approach, I also explored function (open full item for complete abstract)

    Committee: Suzanne Boyce Ph.D. (Committee Chair); Edwin Maas Ph.D. (Committee Member); Jonathan Preston Ph.D. (Committee Member); Erin Redle Ph.D. (Committee Member); Jennifer Vannest Ph.D. (Committee Member) Subjects: Speech Therapy
  • 15. Wang, Zhong-Qiu Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    Microphone arrays are widely deployed in modern speech communication systems. With multiple microphones, spatial information is available in addition to spectral cues to improve speech enhancement, speaker separation and robust automatic speech recognition (ASR) in noisy-reverberant environments. Conventionally, multi-microphone beamforming followed by monaural post-filtering is the dominant approach for multi-channel speech enhancement. This approach requires an accurate estimate of target direction, and power spectral density and covariance matrices of speech and noise. Such estimation algorithms usually cannot achieve satisfactory accuracy in noisy and reverberant conditions. Recently, riding on the development of deep neural networks (DNN), time-frequency (T-F) masking and spectral mapping based approaches have been established as the mainstream methodology for monaural (single-channel) speech separation, including speech enhancement and speaker separation. This dissertation investigates deep learning based microphone array processing and its application to speech separation and localization, and robust ASR. We start our work by exploring various ways of integrating speech enhancement and acoustic modeling for single-channel robust ASR. We propose a training framework that jointly trains enhancement frontends, filterbanks and backend acoustic models. We also apply sequence-discriminative training for sequence modeling and run-time unsupervised adaptation to deal with training and testing mismatches. One essential aspect of multi-channel processing is sound localization. We utilize deep learning based T-F masking to identify T-F units dominated by target speaker and only use these T-F units for speaker localization, as they contain much cleaner phases that are informative for localization. This approach dramatically improves the robustness of conventional cross-correlation, beamforming and subspace based approaches for speaker localization in noisy-reverberant (open full item for complete abstract)

    Committee: DeLiang Wang (Advisor); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member); Robert Agunga (Other) Subjects: Computer Engineering; Computer Science
  • 16. Oriti, Taylor Narrative Abilities in Preschool Children with Childhood Apraxia of Speech, Speech Sound Disorder, and Language Impairment

    Master of Arts, Case Western Reserve University, 2020, Communication Sciences

    Purpose: The primary aims of this study were to examine narrative skills in children with childhood apraxia of speech (CAS) compared to children with speech sound disorder with and without language impairment (SSD+LI, SSD-only). Method: Participants were preschool-aged children with diagnosed CAS, SSD-only, and SSD+LI. Diagnoses were confirmed by a certified speech-language pathologist with standardized speech and language testing. Participants completed narrative retell task with the Fox and Bear story. Performance in narrative microstructure, macrostructure and comprehension were compared with analysis of variance between the three groups. Results: Participants with CAS told narratives that contained fewer story sequence items, and limited vocabulary. Analysis revealed slight differences in expressive language skills between participants with CAS and SSD+LI. Conclusions: Children with CAS experience deficits in later literacy predictors. Intervention for children with CAS should focus expressive language skills, in addition to speech sound production.

    Committee: Lewis Barbara PhD (Committee Chair); Mental Rebecca CCC-SLP, PhD. (Committee Member); Short Elizabeth PhD (Committee Member) Subjects: Early Childhood Education; Language; Literacy; Speech Therapy
  • 17. Bagchi, Deblin Transfer learning approaches for feature denoising and low-resource speech recognition

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    Automatic speech recognition has become a part and parcel of everyday life. From the early 2000s, deep neural networks connected to hidden markov models (DNN-HMMs) have single-handedly pushed the performance of clean speech recognition systems to human level. Since then, the simple feedforward architectures have evolved to more sophisticated ones, like convolutional neural networks, which can correlate complex patterns to phones, and recurrent neural networks, which use horizontal connections to better utilize past (and future) context. These modern neural networks have pushed the boundaries of automatic speech recognition, lowering word error rates drastically. However, the improvement in performance comes at the cost of higher train times and decoding speeds because these neural networks are bulky, i.e. they have a large number of trainable parameters compared to feedforward neural networks. They are also intensely data-driven, i.e. high performance accuracy can only be achieved with a large set of training examples. The straightforwardness and simplicity of the feedforward architecture makes it a powerful contender for real-time speech recognition. There has been a growing amount of research which transfers knowledge from a cumbersome, high complexity "teacher" network to a simpler "student" network with lower complexity. The main focus is to make feedforward neural networks imitate the behavior of convolutional or recurrent neural networks. In the course of this dissertation, I am going to walk through some results on knowledge transfer from recurrent and convolutional neural nets to feedforward neural nets in the realm of speech enhancement and multilingual speech recogntion. Spectral mapping is a form of speech denoising which explicitly maps noisy speech to clean speech. In the first part of this dissertation, I describe a plug-and-play spectral mapping system that can be used as a front end feature denoiser for any speech recognition system. The feat (open full item for complete abstract)

    Committee: Eric Fosler-Lussier (Advisor); Deliang Wang (Committee Member); Micha Elsner (Committee Member) Subjects: Acoustics; Computer Engineering; Computer Science
  • 18. Rajashekar, Raksha Speech Enabled Navigation in Virtual Environments

    Master of Science in Computer Engineering (MSCE), Wright State University, 2019, Computer Engineering

    Navigating in a Virtual Environment with traditional input devices such as mouse, joysticks and keyboards provide limited maneuverability and is also time consuming. While working in a virtual environment, changing parameters to obtain the desired visualization requires time to achieve by manually entering parameter values in an algorithm to test outcomes. The following thesis presents an alternate user interface to reduce user efforts, while navigating within the Virtual Environment. The user interface is an Android application which is designed to accommodate spoken commands. This Speech Enabled User Interface termed as the Speech Navigation Application (SNA), provides the user with an option to voice out the commands which they wish to see enacted/reciprocated in a virtual environment. The user can change the parameters to meet their needs. The idea behind this project was to minimize the effort needed to change parameters of any visualization, so as to obtain the desired view as per the requirement. This paper explains in detail the design, implementation and evaluation of the working project. This system is analyzed by simulating the working prototype in the DIVE.

    Committee: Thomas Wischgoll Ph.D. (Advisor); Yong Pei Ph.D. (Committee Member); John Gallagher Ph.D. (Committee Member) Subjects: Computer Engineering; Computer Science
  • 19. Jett, Brandi The role of coarticulation in speech-on-speech recognition

    Master of Arts, Case Western Reserve University, 2019, Communication Sciences

    Listeners take advantage of linkage variables (e.g., talker voice & appropriate syntax) to help recognize target speech from competing background talkers. One potential linkage variable is coarticulation. Two experiments were conducted to investigate the role of coarticulation in speech-on-speech recognition. Experiment 1 results indicated a significant main effect of coarticulation in the target speech, where coarticulation did not benefit the listener. It was unclear, however, how differences in local signal-to-noise ratio (SNR) across keywords affected these results. In Experiment 2, local SNR was controlled for across keyword position. Results indicated no effect of coarticulation for the target speech regardless of local SNR. However, a significant main effect of masker type was observed, with listeners benefiting from the presence of a SSN compared to a two-talker masker. Data suggest that local intensity level plays a role in speech recognition, however the importance of coarticulation for improved speech-in-speech recognition is not evident.

    Committee: Lauren Calandruccio Ph.D., CCC-A (Advisor); Angela Ciccia Ph.D., CCC-SLP (Committee Member); Barbara Lewis Ph.D., CCC-SLP (Committee Member) Subjects: Speech Therapy
  • 20. Vasko, Jordan Speech Intelligibility and Quality Resulting from an Ideal Quantized Mask

    Master of Arts, The Ohio State University, 2017, Speech and Hearing Science

    Speech recognition in noise presents a significant challenge for individuals with hearing loss, and current technologies to remedy this problem are limited. Recently developed machine learning algorithms have, however, proved to be promising solutions to this problem, as they have been able to segregate speech from noise to significantly improve its intelligibility for both normal-hearing and hearing-impaired listeners. The following paper introduces a novel segregation method to be employed by such machine-learning algorithms. The intelligibility and quality of speech-noise mixtures processed via this method were evaluated for normal-hearing listeners. The proposed approach was shown to produce speech intelligibility and quality that were comparable to those produced by the best current technique. Because this approach also has characteristics that may make implementation easier, it potentially represents a better approach than other existing methods.

    Committee: Eric Healy Ph.D. (Advisor); DeLiang Wang Ph.D. (Committee Member); Rachael Frush Holt Ph.D. (Committee Member) Subjects: Acoustics; Audiology