Doctor of Philosophy, The Ohio State University, 2006, Linguistics
Word segmentation, part-of-speech (POS)tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistical, knowledge-based, and machine learning approaches for Chinese unknown word resolution, including the identification, part-of-speech (POS) tagging, and sense tagging of Chinese unknown words. What makes Chinese unknown word resolution hard is the limited information available for predicting the properties of unknown words, and for this reason it is crucial to make optimal use of information that is available. To this end, this research explores two central ideas and aims to achieve two major goals. First, the morphological, syntactic, and semantic information of the component characters or morphemes of an unknown word provides useful insights into its structural and semantic properties. The first goal of this work is to develop novel algorithms that capture such insights. To integrate unknown word identification with word segmentation, the notion of character-based tagging is adopted to model the tendency of individual characters to combine with adjacent characters to form words in different contexts. To predict the POS categories of unknown words, morphological rules that encode knowledge about the relationship between the POS categories of unknown words and those of their component morphemes are developed. Finally, to classify unknown words into appropriate semantic categories in a Chinese thesaurus, rules that capture the regularities in the relationship between the semantic categories of unknown words and those of their component morphemes are developed; information-theoretical models are used to compute the associations between individual morphemes and semantic categorie (open full item for complete abstract)
Committee: Walt Meurers (Advisor)
Subjects: Language, Linguistics