Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries

Sain, Joy Prakash

Keyword Search

School Logo

Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries.pdf (2.78 MB)

Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries

Author Info

Sain, Joy Prakash

ORCID® Identifier

http://orcid.org/0000-0001-5605-046X

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903

Year and Degree

2023, Doctor of Philosophy (PhD), Wright State University, Computer Science and Engineering PhD.

Abstract

Information Extraction (IE) techniques are essential to gleaning valuable information about entities and their relationships from unstructured text and creating a structured representation of the text for downstream Natural Language Processing (NLP) tasks including question answering, text summarization, and knowledge graph construction. Supervised Machine Learning (ML) techniques have been widely used in IE. While the resulting extraction algorithms are very effective, they require a large amount of annotated data, which can be expensive to acquire and time-consuming to create. Additionally, creating high-quality gold-standard annotations can be challenging, particularly when dealing with new domains or languages that lack sufficient resources to facilitate annotations. This dissertation develops minimally-supervised approaches to extract Named Entities (NEs) from text, specifically addressing the challenges arising from using distantly-supervised techniques for NE extraction from the text in which domain-specific dictionaries are used to automatically match and assign labels to data, which can subsequently be used to train an ML model for the extraction task. A key challenge in learning an effective ML model for distant learning techniques is the incompleteness of the dictionaries being used which can result in incomplete, partial, or noisy annotations. In case of incomplete or missing annotations, training a sequence labeling model for NER may result in suboptimal learning. To address these challenges, in this dissertation, I propose novel approaches to improve dictionary coverage that utilize a state-of-the-art phrase extraction technique and domain-specific dictionary to extract phrases from unlabeled text data. Leveraging the lexical, syntactic, and contextual features of the entities present in the initial dictionaries, I propose headword and span-based classification approaches to categorize the extracted phrases into corresponding entity classes. The span-based approach involves generating synthetic negative examples for cases of incomplete and partial annotation and training the model. Rather than consider all possible non-entity spans or randomly sampling negative spans as in existing span-based NER models, the proposed approach involves a stratified approach to negative sampling to improve the model’s effectiveness and ability to learn the entity boundaries accurately. The experimental evaluation conducted on standard benchmark datasets demonstrates that the proposed approach achieves state-of-the-art performance when compared to other baseline methods.

Committee

Michael Raymer, Ph.D. (Advisor)
Krishnaprasad Thirunarayan, Ph.D. (Advisor)
Tanvi Banerjee, Ph.D. (Committee Member)
Charese Smiley, Ph.D. (Committee Member)

Pages

111 p.

Subject Headings

Artificial Intelligence; Computer Science

Keywords

Information Extraction; Natural Language Processing; Named Entity Recognition; Domain-Specific Dictionary; Distant Supervision

Sain, J. P. (2023). Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries [Doctoral dissertation, Wright State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903
APA Style (7th edition)
Sain, Joy. Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries. 2023. Wright State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903.
MLA Style (8th edition)
Sain, Joy. "Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries." Doctoral dissertation, Wright State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=wright1693266174263903
Chicago Manual of Style (17th edition)

Document number:

wright1693266174263903

Download Count:

246

Copyright Info

Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries by Joy Prakash Sain is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by Wright State University and OhioLINK.

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations