Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries

Abstract Details

2023, Doctor of Philosophy (PhD), Wright State University, Computer Science and Engineering PhD.
Information Extraction (IE) techniques are essential to gleaning valuable information about entities and their relationships from unstructured text and creating a structured representation of the text for downstream Natural Language Processing (NLP) tasks including question answering, text summarization, and knowledge graph construction. Supervised Machine Learning (ML) techniques have been widely used in IE. While the resulting extraction algorithms are very effective, they require a large amount of annotated data, which can be expensive to acquire and time-consuming to create. Additionally, creating high-quality gold-standard annotations can be challenging, particularly when dealing with new domains or languages that lack sufficient resources to facilitate annotations. This dissertation develops minimally-supervised approaches to extract Named Entities (NEs) from text, specifically addressing the challenges arising from using distantly-supervised techniques for NE extraction from the text in which domain-specific dictionaries are used to automatically match and assign labels to data, which can subsequently be used to train an ML model for the extraction task. A key challenge in learning an effective ML model for distant learning techniques is the incompleteness of the dictionaries being used which can result in incomplete, partial, or noisy annotations. In case of incomplete or missing annotations, training a sequence labeling model for NER may result in suboptimal learning. To address these challenges, in this dissertation, I propose novel approaches to improve dictionary coverage that utilize a state-of-the-art phrase extraction technique and domain-specific dictionary to extract phrases from unlabeled text data. Leveraging the lexical, syntactic, and contextual features of the entities present in the initial dictionaries, I propose headword and span-based classification approaches to categorize the extracted phrases into corresponding entity classes. The span-based approach involves generating synthetic negative examples for cases of incomplete and partial annotation and training the model. Rather than consider all possible non-entity spans or randomly sampling negative spans as in existing span-based NER models, the proposed approach involves a stratified approach to negative sampling to improve the model’s effectiveness and ability to learn the entity boundaries accurately. The experimental evaluation conducted on standard benchmark datasets demonstrates that the proposed approach achieves state-of-the-art performance when compared to other baseline methods.
Michael Raymer, Ph.D. (Advisor)
Krishnaprasad Thirunarayan, Ph.D. (Advisor)
Tanvi Banerjee, Ph.D. (Committee Member)
Charese Smiley, Ph.D. (Committee Member)
111 p.

Recommended Citations

Citations

  • Sain, J. P. (2023). Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries [Doctoral dissertation, Wright State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903

    APA Style (7th edition)

  • Sain, Joy. Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries. 2023. Wright State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903.

    MLA Style (8th edition)

  • Sain, Joy. "Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries." Doctoral dissertation, Wright State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903

    Chicago Manual of Style (17th edition)