Doctor of Philosophy, The Ohio State University, 2020, Linguistics
This thesis treats a major challenge for current state-of-the-art natural language processing (NLP) pipelines: morphologically rich languages where many inflected forms or weak form--meaning correspondence lead to data sparsity and noise. For example, if the lexeme TEACHER occurs the same number of times in an English text and an Arabic text, those occurrences will be spread over just four forms in English, teacher, teacher's, teachers' and teachers, versus numerous forms in Arabic, leading to more low frequency and out-of-vocabulary forms at test time.
Furthermore, while the +s suffix of teachers is highly predictable, there is significant entropy involved in predicting how pluralization will realize in Arabic, which can cause models to be noisy. That said, the particular means of realizing pluralization (among other properties) can be informative in Arabic, as the +wn in mdrswn, 'teachers' not only indicates plurality, but also that the referent is human.
To address data sparsity and noise from morphological richness, I propose some practical means of inducing morphological information and/or incorporating morphological information in preprocessing steps or model components, depending on the task at hand. The goals of this intervention are twofold. First, I aim to link variant inflections of the same lexeme to reduce sparsity. Second, I aim to mitigate noise by identifying morphosyntactic properties encoded in complex inflections like mdrswn and leverage them to help models interpret low frequency or out-of-vocabulary forms.
To be practical, morphological modeling should be maximally language agnostic, i.e., portable to new languages or domains with minimal human effort, and maximally cheap, i.e., in terms of the amount/cost of required manual supervision. Thus, I explore morphological modeling strategies and morphological resource creation, progressing toward more language agnostic solutions requiring less supervision over the course of this thesis.
To (open full item for complete abstract)
Committee: Marie-Catherine de Marneffe (Advisor); Micha Elsner (Committee Member); Nizar Habash (Committee Member); Andrea Sims (Committee Member)
Subjects: Computer Science; Linguistics