Search Results (1 - 20 of 20 Results)

Sort By  
Sort Dir
 
Results per page  

Cesene, Daniel FredrickThe Completeness of the Electronic Medical Record with the Implementation of Speech Recognition Technology
Master of Health and Human Services, Youngstown State University, 2014, Department of Health Professions
The advent of the electronic medical record (EMR) has transformed the process of clinical documentation. When combined with the speech recognition technology (SRT), EMR completeness has increased over methodologies without this technology. This research examined chart audit completion scores of physicians and scribes working within four Northeastern Ohio Emergency Services departments. SPSS&#174; Statistics were used to perform a Repeated Measure Analysis using paired-samples t tests calculated to compare mean completion scores one month prior versus six months after SRT implementation. The mean completion score of pre-SRT implementation with and without the assistance of scribes was 5.5 ( sd = .8) and the mean completion score of post-SRT implementation without the assistance of scribes was 6.0 (sd = .9) indicating a significant increase from pre-SRT versus post-SRT implementation (t(17) = -3.9, p < 0.5). The mean completion score of pre-SRT implementation without the assistance of scribes was 5.0 (sd = 1.1) and the mean completion score of post-SRT implementation without the assistance of scribes was 6.0 (sd = .9) also indicating a significant increase from pre-SRT versus post-SRT implementation (t(17) = -4.7, p < 0.5). These analyses validated the strong statistical probability that the completeness scores of physicians utilizing SRT will exceed the total completeness scores of physicians and scribes not using this technology. Subsequently, the null hypotheses were rejected in support of the alternative hypotheses, which concluded: 1) The completeness of the EMR will at least remain the same or improve with the implementation of SRT. 2) The completeness of the EMR will at least remain the same or improve when speech recognition technology is used without scribe utilization.

Committee:

Joseph Lyons, PhD (Advisor); Ronald Chordas, PhD (Committee Member); Richard Rogers, PhD (Committee Member)

Subjects:

Audiology; Communication; Health Care; Health Care Management; Health Education; Information Systems; Information Technology; Medicine; Nursing

Keywords:

speech recognition technology; speech recognition; voice recognition; nursing; EMR; electronic medical record; electronic health record; transcription services; scribes; medical transcription; emergency room physicians; emergency services

Mohapatra, PrateetiDeriving Novel Posterior Feature Spaces For Conditional Random Field - Based Phone Recognition
Master of Science, The Ohio State University, 2009, Computer Science and Engineering

Conditional Random Fields (CRFs) are undirected graphical models that can be used to define the joint probability distribution over a label sequences given a set of observation sequences to be labeled. A key advantage of CRFs is their great flexibility to include a wide variety of non-independent features of the input. Faced with this freedom, an important question remains: what features should be used?

This thesis describes two techniques for deriving novel features for use in Conditional Random Fields-based phone recognition, extending previous techniques that incorporated multiclass posteriors of phone classes or phonological features estimated by Multi-Layer Perceptrons.

The first technique investigates the integration of suprasegmental knowledge into the MLP classification system that is part of the CRF recognizer. CRFs are used to integrate MLP posterior estimates, particularly of phonological features or phonetic classes, which stand in as representations of the acoustics; this thesis shows that incorporating suprasegmental information as part of the MLP classification system augments the acoustic space in a beneficial way for phonological feature based CRF models. TIMIT phone recognition experiments show a small but statistically significant improvement due to both techniques.

The second experiment combines phonological feature scores from two different systems that gives a statistically significant improvement in Conditional Random Field-based TIMIT phone recognition, despite a standalone system based on their features performing significantly worse. We then explore the reasons for this improvement by examining different representations of phonological attribute classifiers, in terms of what they are classifying (binary versus n-ary features), the feature definition, the training paradigm and the representation of scoring functions. The analysis leads to the conclusions that different databases gives robustness, and that binary-ness, feature definition and score representation do not help in the improvement of the performance.

Committee:

Eric Fosler-Lussier (Advisor); Chris Brew (Committee Member)

Subjects:

Computer Science

Keywords:

Speech Recognition; Feature Combination; Suprasegmental Information

Muhtar, Abdullahi M.A microcomputer-based digit recognition system
Master of Science (MS), Ohio University, 1984, Electrical Engineering & Computer Science (Engineering and Technology)

A microcomputer-based digit recognition system

Committee:

Harold Klock (Advisor)

Keywords:

Speech Recognition; Mic Preamplifier; Bandpass Filter Stage

Dehdari, JonathanA Neurophysiologically-Inspired Statistical Language Model
Doctor of Philosophy, The Ohio State University, 2014, Linguistics
We describe a statistical language model having components that are inspired by electrophysiological activities in the brain. These components correspond to important language-relevant event-related potentials measured using electroencephalography. We relate neural signals involved in local- and long-distance grammatical processing, as well as local- and long-distance lexical processing to statistical language models that are scalable, cross-linguistic, and incremental. We develop a novel language model component that unifies n-gram, skip, and trigger language models into a generalized model inspired by the long-distance lexical event-related potential (N400). We evaluate this model in textual and speech recognition experiments, showing consistent improvements over 4-gram modified Kneser-Ney language models for large-scale textual datasets in English, Arabic, Croatian, and Hungarian.

Committee:

William Schuler (Advisor); Eric Fosler-Lussier (Committee Member); Per Sederberg (Committee Member)

Subjects:

Artificial Intelligence; Cognitive Psychology; Computer Science; Linguistics; Neurosciences; Technology

Keywords:

statistical language model; language modeling; speech recognition; ASR; ERP; ELAN; N400

Youngdahl, Carla LThe Development of Auditory “Spectral Attention Bands” in Children
Doctor of Philosophy, The Ohio State University, 2015, Speech and Hearing Science
This study seeks to further our understanding of auditory development by investigating “spectral attention bands” (spectral region of attention for an expected target) and the ability to integrate or segregate information across frequency bands in children. The ability to attend to a target signal and discriminate speech from noise is of special importance in children. On a daily basis children must listen and attend to important auditory information in noisy classroom environments. A comparison of spectral attention bandwidth in children and adults might clarify where aspects of processing/listening efficiency breaks down. The current three experiments investigate the shape of spectral attention bands in children age 5 to 8 as compared to adults and indicate that the spectral attending listening strategy may effect understanding speech in noise. This study indicates that children do in fact listen differently than adults, using less efficient listening strategies that may lead them to be more susceptible to noise. This study also shows that between the ages of 5 and 8, enough substantial refinements in listening strategies occur to see a change to more adult-like performance in the older child age range.

Committee:

Eric Healy (Advisor); Rachael Holt (Committee Member); Allison Ellawadi (Committee Member)

Subjects:

Acoustics

Keywords:

acoustics; psychoacoustics; speech perception; speech recognition; speech psychoacoustics; spectral; spectral attention; development; frequency attending; auditory; auditory attention

Emeeshat, Janah S.Isolated Word Speech Recognition System for Children with Down Syndrome
Master of Science in Engineering, Youngstown State University, 2017, Department of Electrical and Computer Engineering
Automatic speech recognition by machine is one of the most effective methods for man-machine communication. Because speech waveform is nonlinear and time-variant, speech recognition requires significant amount of intelligence and fault tolerance in the pattern recognition algorithms. The objective of this work was to develop an isolated word speech recognition system for children with Down syndrome to communicate with others, almost normally. These children are delayed in the use of meaningful speech and slower to obtain a fruitful vocabulary due to their large tongue and other factors. In this thesis, single words were collected from a child with Down syndrome of age 10. Additionally, the same words were collected from a child with typical development in the same age to compare their speech features. Children voices were recorded by a mobile phone first then recorded at a sample rate of 48 kHz using a unidirectional microphone at 24-bits per sample and one recording channel. The model proposed was based on Fast Fourier Transform (FFT) as a feature extraction technique. FFT transforms the sampled data from time domain to frequency domain to investigate features of spoken words and features of different children.

Committee:

Frank Li, PhD (Advisor); Philip Munro, PhD (Committee Member); Faramarz Mossayebi, PhD (Committee Member)

Subjects:

Special Education; Speech Therapy

Keywords:

Speech Recognition; downsyndrome

Srinivasan, SoundararajanIntegrating computational auditory scene analysis and automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2006, Biomedical Engineering
Speech perception studies indicate that robustness of human speech recognition is primarily due to our ability to segregate a target sound source from other interferences. This perceptual process of auditory scene analysis (ASA) is of two types, primitive and schema-driven. This dissertation investigates several aspects of integrating computational ASA (CASA) and automatic speech recognition (ASR). While bottom-up CASA are used as front-end for ASR to improve its robustness, ASR is used to provide top-down information to enhance primitive segregation. Listeners are able to restore masked phonemes by utilizing lexical context. We present a schema-based model for phonemic restoration. The model employs missing-data ASR to decode masked speech and activates word templates via dynamic time warping. A systematic evaluation shows that the model restores both voiced and unvoiced phonemes with a high spectral quality. Missing-data ASR requires a binary mask from bottom-up CASA that identifies speech-dominant time-frequency regions of a noisy mixture. We propose a two-pass system that performs segregation and recognition in tandem. First, an n-best lattice, consistent with bottom-up speech separation, is generated. Second, the lattice is re-scored using a model-based hypothesis test to improve mask estimation and recognition accuracy concurrently. By combining CASA and ASR, we present a model that simulates listeners' ability to attend to a target speaker when degraded by energetic and informational masking. Missing-data ASR is used to account for energetic masking and the output degradation of CASA is used to model informational masking. The model successfully simulates several quantitative aspects of listener performance. The degradation in the output of CASA-based front-ends leads to uncertain ASR inputs. We estimate feature uncertainties in the spectral domain and transform them into the cepstral domain via nonlinear regression. The estimated uncertainty substantially improves recognition accuracy. We also investigate the effect of vocabulary size on conventional and missing-data ASRs. Based on binaural cues, for conventional ASR, we extract the speech signal using a Wiener filter and for missing-data ASR, we estimate a binary mask. We find that while missing-data ASR outperforms conventional ASR on a small vocabulary task, the relative performance reverses on a larger vocabulary task.

Committee:

DeLiang Wang (Advisor)

Keywords:

Computational auditory scene analysis (CASA); Robust automatic speech recognition; Speech segregation; Phonemic restoration; Top-down analysis; Binaural processing; Uncertainty decoding; Multitalker perception

Cho, JaeyounSpeech enhancement using microphone array
Doctor of Philosophy, The Ohio State University, 2005, Electrical Engineering

Speech enhancement is one of the most important issues in the communication and signal processing area. It is typically known as the suppression of additive noise rather than convoluted noise, which makes it easy to separate it from the speech. However, research for speech enhancement so far has had difficulties in enhancing speech or separating it from background noise, mostly because speech is too non-stationary to be modelled.

In this literature, two speech enhancement techniques are introduced as the state of the art. One is spectral subtraction, the most popular method among all single channel speech enhancement techniques, and the other is beamforming, the spatial and temporal filtering with a microphone array. The spectral subtraction has strong attraction because it needs only one channel, its method to remove noise is quite simple, and the processed output confirms its effectiveness in improving the signal-to-noise ratio (SNR). Beamforming is an emerging technique in speech enhancement, which can simply form a beam to a speaker and enhance the speech uttered by him or her. Nevertheless, spectral subtraction has a critical weakness in that it generates unavoidable distortion, so-called musical noise, which is annoying to human ear, and beamforming cannot enhance speech sufficiently without a large number of microphones.

The proposed hybrid method combines spectral subtraction and beamforming to enhance the quality and intelligibility of speech. There are some advantages of combining these methods. Musical noise, the major drawback of spectral subtraction, can be smoothed by beamforming. The speech quality, which is slightly improved by beamforming with the limited number of microphones, can be enhanced further by spectral subtraction. In addition, it is shown that the existing spectral subtraction methods can be enhanced by using psychoacoustic effects, and a voice activity detection using microphone array can be more robust.

By some measures of performance evaluation, the proposed hybrid method is proved to output speech of better quality and intelligibility than each of spectral subtraction and beamforming.

Committee:

Ashok Krishnamurthy (Advisor)

Keywords:

speech enhancement; speech recognition; spectral subtraction; beamforming; microphone array; psychoacoustics

Narayanan, ArunComputational auditory scene analysis and robust automatic speech recognition
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Automatic speech recognition (ASR) has made great strides over the last decade producing acceptable performance in relatively `clean' conditions. As a result, it is becoming a mainstream technology. But for a system to be useful in everyday conditions, it has to deal with distorting factors like background noise, room reverberation, and recording channel characteristics. A popular approach to improve robustness is to perform speech separation before doing ASR. Most of the current systems treat speech separation and speech recognition as two independent, isolated tasks. But just as separation helps improve recognition, recognition can potentially influence separation. For example, in auditory perception, speech schemas have been known to help improve segregation. An underlying theme of this dissertation is the advocation of a closer integration of these two tasks. We address this in the context of computational auditory scene analysis (CASA), including supervised speech separation. CASA is largely motivated by the principles that guide human auditory `scene analysis'. An important computational goal of CASA systems is to estimate the ideal binary mask (IBM). The IBM identifies speech dominated and noise dominated regions in a time-frequency representation of a noisy signal. Processing noisy signals using the IBM improves ASR performance by a large margin. We start by studying the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs) and vocabulary sizes. Our results show that the mere pattern of the IBM carries important phonetic information. Akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB. In fact, our study shows that there is broad agreement with human performance which is rather surprising. Given the important role that binary mask patterns play, we develop two novel systems that incorporate this information to improve ASR. The first system performs recognition by directly classifying binary masks corresponding to words and phonemes. The method is evaluated on an isolated digit recognition and a phone classification task. Despite dramatic reduction of speech information encoded in a binary mask compared to a typical ASR feature frontend, the proposed system performs surprisingly well. The second approach is a novel framework that performs speech separation and ASR in a unified fashion. Separation is performed via masking using an estimated IBM, and ASR is performed using the standard cepstral features. Most systems perform these tasks in a sequential fashion: separation followed by recognition. The proposed framework, which we call bidirectional speech decoder, unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On a medium-large vocabulary speech recognition task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM. Supervised classification based speech separation has shown a lot of promise recently. We perform an in-depth evaluation of such techniques as a front-end for noise-robust ASR. Comparing performance of supervised binary and ratio mask estimators, we observe that ratio masking significantly outperforms binary masking when it comes to ASR. Consequently, we propose a separation front-end that consists of two stages. The first stage removes additive noise via ratio time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage: A non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks (DNNs) and hidden Markov models. Results show that dFDLR consistently improves performance in all test conditions. We explore alternative ways to using the output of speech separation to improve ASR performance when using DNN based acoustic models. Apart from its use as a frontend, we propose using speech separation for providing smooth estimates of speech and noise which are then passed as additional features. Finally, we develop a unified framework that jointly improves separation and ASR under a supervised learning framework. Our systems obtain the state-of-the-art results in two widely used medium-large vocabulary noisy ASR corpora: Aurora-4 and CHiME-2.

Committee:

DeLiang Wang (Advisor); Mikhail Belkin (Committee Member); Eric Fosler-Lussier (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Automatic speech recognition; noise robustness; computational auditory scene analysis; binary masking; ratio masking; mask estimation; deep neural networks; acoustic modeling; speech separation; speech enhancement; noisy ASR; CHiME-2; Aurora-4;

He, YanzhangSegmental Models with an Exploration of Acoustic and Lexical Grouping in Automatic Speech Recognition
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Developing automatic speech recognition (ASR) technologies is of significant importance to facilitate human-machine interactions. The main challenges for ASR development have been centered around designing appropriate statistical models and the modeling targets associated with them. These challenges exist throughout the ASR probabilistic transduction pipeline that aggregrates information from the bottom up: in acoustic modeling, hidden Markov models (HMMs) are used to map the observed speech signals to phoneme targets in a frame-by-frame fashion, suffering from the well-known frame conditional independence assumption and the inability to integrate long-span features; in lexical modeling, phonemes are grouped into a sequence of vocabulary words as a meaningful sentence, where out-of-vocabulary (OOV) words cannot be easily accounted for. The main goal of this dissertation is to apply segmental models - a family of structured prediction models for sequence segmentation and labeling - to tackle these problems by introducing innovative intermediate-level structures into the ASR pipeline via acoustic and lexical grouping. On the acoustic side, we explore discriminative segmental models to overcome some of the limitations of frame-level HMMs, by modeling phonemes as segmental targets with variable length. In particular, we introduce a new type of acoustic model by combining segmental conditional random fields (SCRFs) with deep neural networks (DNNs). In light of recent successful applications of SCRFs to lattice rescoring, we put forward a novel approach to first-pass word recognition that uses SCRFs directly as acoustic models. With the proposed model, we are able to integrate local discriminative classifiers, segmental long-span dependencies like duration, and subword unit transitions as features in a unified framework during recognition. To facilitate the training and decoding, we introduce “Boundary-Factored SCRFs”, a special case of SCRFs, with an efficient inference algorithm. Furthermore, we introduce a WFST-based decoding framework to enable SCRF acoustic models along with language models in direct word recognition. We empirically verify the superiority of the proposed model to frame-level CRFs and hybrid HMM-DNN systems using the same label space. On the lexical side, morphs, as the smallest linguistically meaningful subword units, provide a better balance between lexical confusability and OOV coverage than phonemes when they are used in recognition to recover OOV words. In this dissertation, we study the use of Morfessor, an unsupervised HMM segmental model specialized for morphological segmentation, to derive morphs suitable for handling OOV words in ASR and keyword spotting. We demonstrate that decoding with the automatically-derived morphs are effective in a morphologically rich language in the low-resource setting. However, grapheme-based morphs do not work well on some of the other languages we evaluate in, due to the over-segmentation and incorrect pronunciation issues among the others. We then develop several novel types of morphs based on phonetic representations to better account for pronunciations and confusability. Morfessor is shown to be able to learn the phonetic regularity for the proposed morphs, achieving improved performance in the languages in which the traditional grapheme-based morphs tend to fail. Finally, we show that the phonetically-based morphs are complementary to the grapheme-based morphs across all languages allowing for substantial performance improvement via system combination.

Committee:

Eric Fosler-Lussier (Advisor); Micha Elsner (Committee Member); Brian Kulis (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

automatic speech recognition, segmental models

Nagaraj, Naveen K.Explaining Listening Comprehension in Noise Using Auditory Working Memory, Attention, and Speech Tests
Doctor of Philosophy (PhD), Ohio University, 2014, Hearing Science (Health Sciences and Professions)
Listening in noise requires not just recognizing and interpreting the meaning of words. Successful comprehension involves linguistic analysis with the aid of cognitive resources such as attention and working memory (WM). In the current study, the effect of noise on cognitive and listening tasks was examined. Next, the relative contribution of cognitive and speech recognition measures in predicting listening comprehension in quiet and noise was evaluated. Sixty adults with normal hearing (18-37 years) participated. Novel cognitive tasks were designed to measure auditory WM capacity and attention switching (AS) ability. Speech recognition was measured using standardized speech recognition tests. Listening comprehension measured using the Lecture, Interview, and Spoken Narratives test was the dependent variable. Participants' speed of information processing was significantly faster in WM and AS tasks in noise. This result was consistent with the view that noise may enhance arousal leading to faster information processing during cognitive tasks. While the speed of AS was faster in noise, accuracy of AS updating was significantly poorer. This implied that rapid AS resulted in more errors in updating. In the WM task, recall accuracy was better in noise. In particular, participants who processed information faster in noise, and did so accurately, switched their attention more effectively to refresh/rehearse recall items in WM. More efficient processing deployed in the presence of noise appeared to have improved performance on inference questions in the listening comprehension task. Regression analysis suggested that speech recognition scores were the main predictors of listening comprehension in quiet, whereas WM capacity and sentence recognition were major predictors in noise. Listening in quiet was a relatively simpler process for young normal hearing adults, requiring less cognitive resources than listening in noise, which required more cognitive resources especially WM. Analysis of the WM and listening comprehension relation revealed that it was the controlled attention, not the storage component of WM which was critical for listening in noise. The present study highlights the complex nature of listening in noise and the importance of measuring WM capacity along with speech tests to better understand the communication difficulties faced by individuals in real life situations.

Committee:

Jeffrey DiGiovanni, Dr. (Advisor); James Montgomery, Dr. (Committee Member); Dennis Ries, Dr. (Committee Member); Alex Sergeev, Dr. (Committee Member)

Subjects:

Audiology; Cognitive Psychology

Keywords:

Auditory; Working Memory; Attention; Speech Recognition; Listening Comprehension

Mahajan, OnkarMultimodal interface integrating eye gaze tracking and speech recognition
Master of Science, University of Toledo, 2015, Engineering (Computer Science)
Currently, the most common method of interacting with a computer is through the use of a mouse and keyboard. HCI research includes the development of interactive interfaces that go beyond the desktop Graphical User Interface (GUI) paradigm. The provision of user-computer interfaces through gesturing, facial expression, speaking as well as other forms of human communication have also been the focus of intense studies. Eye Gaze Tracking (EGT) is another type of Human Computer Interface which has proven useful for several different industries, and the rapid introduction of new models by commercial EGT companies has led to more efficient and user-friendly interfaces. Unfortunately, the cost of these commercial trackers have made it difficult for them to gain popularity. In this research, a low cost multi-modal interface is utilized to overcome this issue and help users adapt to new input modalities. The system developed recognizes input from eyes and speech. The eye gaze detection module is based on Opengazer, an open-source gaze tracking application, and is responsible for determining the estimated gaze point coordinates. The images captured during calibration are grey-scaled and averaged to form a single image; they are mapped relative to the position of the user’s pupil and the corresponding point on the screen. These images are then used to train a Gaussian Process which is used to determine the estimated gaze point. The voice recognition module detects voice commands from the user and converts them into mouse events. This interface can be operated in two distinct modes. One mode uses eye gaze as a cursor-positioning tool and voice commands to perform mouse click events. The second mode uses dwell-based gaze interaction, in which focusing for a predetermined amount of time triggers a click event. Both the modules work concurrently when using multimodal input. Several modifications were made to improve the stability and accuracy of gaze, albeit within the constraints of the open-source gaze tracker. The multimodal implementation results were measured in terms of tracking accuracy and stability of estimated gaze point.

Committee:

Jackson Carvalho, PhD (Committee Chair); Mansoor Alam, PhD (Committee Member); Henry Ledgard, PhD (Committee Member)

Subjects:

Computer Science; Engineering

Keywords:

Eye Gaze Tracking, Speech Recognition, Multimodal Interface

Abdelhamied, Kadry A.Automatic identification and recognition of deaf speech /
Doctor of Philosophy, The Ohio State University, 1986, Graduate School

Committee:

Not Provided (Other)

Subjects:

Engineering

Keywords:

Automatic speech recognition;Deaf

Rytting, Christopher AntonPreserving subsegmental variation in modeling word segmentation (or, the raising of baby Mondegreen)
Doctor of Philosophy, The Ohio State University, 2007, Linguistics

Many computational models have been developed to show how infants break apart utterances into words prior to building a vocabulary – the word segmentation task. However, these models have been tested in relatively few languages, with little attention paid to how different phonological structures may affect the relative effectiveness of particular statistical heuristics. Moreover, even for English, since these models generally rely on transcriptions rather than on speech for input, they have shown little regard for the subsegmental variation naturally found in the speech signal. A model using transcriptional input makes unrealistic assumptions which may overestimate the model's effectiveness, relative to how it would perform on more variable input such as that found in speech.

This dissertation addresses the first of these two issues by comparing the performance of two classes of distribution-based statistical cues on a corpus of Modern Greek, a language with a phonotactic structure significantly different from that of English, and shows how these differences change the relative effectiveness of two classes of statistical heuristics, compared to their performance in English.

To address the second issue, this dissertation proposes an improved representation of the input that preserves the subsegmental variation inherently present in natural speech while maintaining sufficient similarity with previous models to allow for straightforward, meaningful comparisons of performance. The proposed input representation uses an automatic phone classifier to replace the transcription-based phone labels in a corpus of English child-directed speech with real-valued phone probability vectors. These vectors are then used to provide input for a previously-proposed connectionist model of word segmentation, in place of the invariant, transcription-based binary input vectors used in the original model.

The performance of the connectionist model as reimplemented here suggests that real-valued inputs present a harder learning task than idealized inputs. In other words, the subsegmental variation hinders the model more than it helps. This may help explain why English-learning infants soon gravitate toward other, potentially more salient cues, such as lexical stress. However, the model still performs above chance even with very noisy input, consistent with studies showing that children can learn from distributional segmental cues alone.

Committee:

Christopher Brew (Advisor)

Keywords:

word segmentation; first language acquisition; automatic speech recognition; connectionist models; Modern Greek

Ore, Brian M.Multilingual Articulatory Features for Speech Recognition
Master of Science in Engineering (MSEgr), Wright State University, 2007, Electrical Engineering
Articulatory features describe the way in which the speech organs are used when producing speech sounds. Research has shown that incorporating this information into speech recognizers can lead to an improvement in system performance. The majority of previous work, however, has been limited to detecting articulatory features in a single language. In this thesis, Gaussian Mixture Models (GMMs) and Multi-Layer Perceptrons (MLPs) were used to detect articulatory features in English, German, Spanish, and Japanese. The outputs of the detectors were used to form the feature set for a Hidden Markov Model (HMM)-based phoneme recognizer. The best overall detection and recognition performance was obtained using MLPs with context. Compared to Mel-Frequency Cepstral Coefficient (MFCC)-based systems, the proposed feature sets yielded an increase of up to 4.39% correct and 5.37% accuracy when using monophone models, and an increase of up to 3.22% correct and 2.60% accuracy with triphone models. On a word recognition task, however, the MFCC systems performed better. Multilingual articulatory feature detectors were also created for all four languages using MLPs. An additional feature set was created using the multilingual detectors and evaluated on the same phoneme recognition task. Compared to the feature sets created with the language-dependent MLP detectors, the maximum decrease in system performance with monophone models was 1.44% correct and 1.72% accuracy on Japanese, and the maximum improvement in system performance with triphone models was 0.75% correct and 0.40% accuracy on Spanish. On a word recognition task, the feature sets created with the multilingual MLP detectors yielded a decrease of up to 3.75% correct and 6.01% accuracy. As a final experiment, two different procedures were investigated for combining the scores from the English GMM and MLP articulatory feature detectors. It was found that the detection performance for each articulatory feature can be improved by combining the scores from all GMM and MLP detectors.

Committee:

Brian Rigling (Advisor)

Keywords:

Speech recognition; Articulatory features

Findlen, Ursula M.DICHOTIC SPEECH DETECTION, IDENTIFICATION, AND RECOGNITION BY CHILDREN, YOUNG ADULTS, AND OLDER ADULTS
Doctor of Philosophy, The Ohio State University, 2009, Speech and Hearing Science
Dichotic speech detection, identification, and recognition were evaluated using speech stimuli with varied in the amount of lexical content in order to examine lexical effects on performance characteristics measured in dichotic tasks. Data was collected for children and young adults with normal hearing, as well as older adults with mild to moderate sensorineural hearing loss for dichotic tasks administered under both free recall and directed recall response conditions in order to also examine cognitive load effects on performance. Results revealed that for dichotic speech recognition, the lexical content of the stimuli and the cognitive load of the task impacted performance measures for children, young adults, and older adults. Results from dichotic detection and identification tasks revealed young adults and older adults to perform at ceiling levels, while children demonstrated a significant difference in ability to detect versus the ability to identify target stimuli. Overall, the present study suggests that both lexical content and cognitive load impacts performance characteristics measured in dichotic tasks for children, young adults, and older adults. Clinical implications for diagnosis of auditory processing disorder are discussed.

Committee:

Christina Roup, PhD (Advisor); Lawrence Feth, PhD (Committee Member); Gail Whitelaw, PhD (Committee Member); Paula Rabidoux, PhD (Committee Member)

Subjects:

Audiology

Keywords:

dichotic speech recognition

Heintz, IlanaArabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition
Doctor of Philosophy, The Ohio State University, 2010, Linguistics

The goal of this dissertation is to introduce a method for deriving morphemes from Arabic words using stem patterns, a feature of Arabic morphology. The motivations are three-fold: modeling with morphemes rather than words should help address the out-of-vocabulary problem; working with stem patterns should prove to be a cross-dialectally valid method for deriving morphemes using a small amount of linguistic knowledge; and the stem patterns should allow for the prediction of short vowel sequences that are missing from the text. The out-of-vocabulary problem is acute in Modern Standard Arabic due to its rich morphology, including a large inventory of inflectional affixes and clitics that combine in many ways to increase the rate of vocabulary growth. The problem of creating tools that work across dialects is challenging due to the many differences between regional dialects and formal Arabic, and because of the lack of text resources on which to train natural language processing (NLP) tools. The short vowels, while missing from standard orthography, provide information that is crucial to both acoustic modeling and grammatical inference, and therefore must be inserted into the text to train the most predictive NLP models. While other morpheme derivation methods exist that address one or two of the above challenges, none addresses all three with a single solution.

The stem pattern derivation method is tested in the task of automatic speech recognition (ASR), and compared to three other morpheme derivation methods as well as word-based language models. We find that the utility of morphemes in increasing word accuracy scores on the ASR task is highly dependent on the ASR system's ability to accommodate the morphemes in the acoustic and pronunciation models. In experiments involving both Modern Standard Arabic and Levantine Conversational Arabic data, we find that knowledge-light methods of morpheme derivation may work as well as knowledge-rich methods. We also find that morpheme derivation methods that result in a single morpheme hypothesis per word result in stronger models than those that spread probability mass across several hypotheses per word, however, the multi-hypothesis model may be strengthened by applying informed weights to the predicted morpheme sequences. Furthermore, we exploit the flexibility of Finite State Machines, with which the stem pattern derivation method is implemented, to predict short vowels. The result is a comprehensive exploration not only of the stem pattern derivation method, but of the use of morphemes in Arabic language modeling for automatic speech recognition.

Committee:

Chris Brew, PhD (Committee Co-Chair); J. Eric Fosler-Lussier, PhD (Committee Co-Chair); Michael White, PhD (Committee Member)

Subjects:

Computer Science; Linguistics; Technology

Keywords:

Arabic; Language Modeling; Automatic Speech Recognition; Morphology

Abraham, AbyContinous Speech Recognition Using Long Term Memory Cells
Master of Science (MS), Ohio University, 2013, Electrical Engineering (Engineering and Technology)
The thesis proposes a continuous speech recognition model using neural network structure which was inspired by long term memory model of human cortex. Speech recognition model extracts and selects the finest representation of speech signal using mel-frequency cepstrum coefficients. The extracted features are fed to neural network with long term memory (LTM) cells which learns the sequence. The LTM cells have the capability to address three main issues of sequence learning including error tolerance, significance of elements and memory decaying which are used to tune the LTM model with parameters that depends on the environment it learns. To validate the model, two datasets have been used - spoken English digits and spoken Arabic digits in speaker dependent mode. The parameters of LTM model have been optimized based on the environment. The results show that the LTM model with fine tuning of parameters is 97% accurate in recognizing the spoken English digits datasets, and 99% accurate in recognizing spoken Arabic digits datasets.

Committee:

Janusz Starzyk (Advisor)

Subjects:

Electrical Engineering

Keywords:

Long Term Memory Cells; Speech Recognition; MFCC; Neural Network; Sequence Learning

Kasrani, ImenDevelopment of a Performance Assessment System for Language Learning
Master of Science (MS), Wright State University, 2017, Computer Science
Recent advances in computer-assisted, language-speaking, learning/training technology have demonstrated its promising potential to improve the outcome of language learning in early education, special education, English as a Second Language (ESL), and foreign language. The growing number of readily available mobile app-based solutions help encourage interest in learning to speak a foreign language, but their effectiveness is limited due to their lack of objective assessment and performance feedback resembling expert judgment. For example, it has been recognized that, in early education, students learn best with one-on-one instructions. Unfortunately, teachers do not have the time, and it is challenging to extend the learning to the home without the assistance of an independent learning/training tool. In this thesis research, our objective is to develop an effective and practical solution that will help people to learn and practice a new language independently at low cost. We have explored the use of real-time speech recognition, language translation, text synthesis, artificial intelligence (AI), and language intelligibility assessment technologies to develop a learning/training system that provides automatic assessment and instantaneous feedback of language-speaking performance in order to achieve an independent-learning workflow. Furthermore, we have designed and implemented a successful prototype system that demonstrates the feasibility and effectiveness of such a computer-assisted independent learning/training solution. This prototype can be easily used on a computer, tablet, smartphone, and other portable devices, and provides a new learning experience that is augmented and enhanced by objective assessment and significant feedback in order to improve the language-speaking proficiency of its user. Additionally, it may be used for real-time translation to support conversation across different languages. Our experimental results demonstrate that the proposed system can sufficiently analyze the intelligibility of one’s speaking, accurately identify mispronounced words, and define a feedback that localizes and highlights errors for continuous practice toward perfection.

Committee:

Yong Pei, Ph.D. (Advisor); Mateen Rizki, Ph.D. (Committee Member); Paul Bender, Ph.D. (Committee Member); Anna Lyon, Ed.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Performance Assessment; Language Learning; Speech Recognition

Brighton, Andrew P.Phoneme Recognition by hidden Markov modeling
Master of Science (MS), Ohio University, 1989, Electrical Engineering & Computer Science (Engineering and Technology)

Phoneme Recognition by hidden Markov modeling

Committee:

John Tague (Advisor)

Keywords:

Phoneme Recognition; Hidden Markov Modeling; Speech Recognition