Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 14)

Mini-Tools

 
 

Search Report

  • 1. Cui, Licong Ontology-guided Health Information Extraction, Organization, and Exploration

    Doctor of Philosophy, Case Western Reserve University, 2014, EECS - Computer and Information Sciences

    Electronic information in unstructured or semi-structured form in health and healthcare has been steadily generated for decades. An explosive growth has occurred since the recent adoption of electronic health records (EHRs). Textual information includes clinical notes recorded in hospitals and health-related information on the web. Such health-related textual data contains an extraordinary amount of underutilized biomedical knowledge. However, the proliferation of such data presents myriad of challenges for information retrieval and access. Manual review of protected clinical documents to find patient cohorts of interest is a time-consuming and cumbersome task. Consumers have also been overwhelmed by the ever-growing public health information on the Internet. Traditional keyword-based search engines such as Google can return hundreds of thousands of links, though only a few may be relevant. Hence effective querying and exploring of both protected and public health data requires new approaches for information extraction, organization, and exploration. This dissertation proposes an ontology-guided approach to health information extraction, organization, and exploration. This approach allows the extraction of key information from textual data, organization in structured formats, and provision of interfaces for their effective search and exploration. This approach is applied to two independent but related domains: (1) Extracting complex epilepsy phenotypes from narrative clinical discharge summaries for effectively querying patient cohort; (2) Information organization based on extracted biomedical concepts from consumer health questions in NetWellness, an online non-profit community service providing high quality health information, for supporting effective consumer health information retrieval and exploration. For (1), a prototyping Epilepsy Data Extraction and Annotation (EpiDEA) system is developed for effective processing of discharge summaries, where patients' s (open full item for complete abstract)

    Committee: Guo-Qiang Zhang (Advisor) Subjects: Computer Science
  • 2. Jiang, Hongbo INFORMATION SEARCH AND EXTRACTION IN WIRELESS AD HOC NETWORKS

    Doctor of Philosophy, Case Western Reserve University, 2008, Computing and Information Science

    Wireless ad hoc networks consist of a set of autonomous nodes which spontaneously create impromptu communication links, and then collectively perform a task with little help from centralized servers or established infrastructure. Because of the stringent constraints on system resources, as well as highly dynamic and even lossy wireless communication environments, wireless ad hoc networks face challenges for providing information search and extraction with good scalability and efficiency. Scalability and efficiency are assessed mainly with respect to the metric of network communication cost. This thesis explores these issues through the development of novel approaches in two types of emerging wireless networks: mobile ad hoc networks and sensor networks.We study the information search problem and use wireless mobile ad hoc networks as an example. We propose adaptive strategies that combine both proactive advertising by the servers and on-demand discovery by the mobile hosts. These adaptive strategies determine the relative rate of proactive advertising and on-demand discovery according to system characteristics such as host mobility level and offered load. We study the information extraction problem and use wireless sensor networks as an example. First, a scalable and robust data aggregation algorithm for information extraction is proposed to obtain the overall data distribution. This data aggregation algorithm exploits the mixture model and the expectation maximization algorithm for parameter estimation. Second, we present energy-aware prediction models, analyze the performance tradeoff between reducing communication cost and limiting prediction cost, and design algorithms to exploit the benefit of energy-aware prediction. Throughout our study, our efforts focused on dealing with various challenges introduced by the dynamic network environments and stringent resource constraints. This thesis shows that with our efforts the communication cost can be significantly red (open full item for complete abstract)

    Committee: Shudong Jin (Committee Chair); Meral Ozsoyoglu (Committee Member); Michael Robinovich (Committee Member); Wei Lin (Committee Member) Subjects: Computer Science
  • 3. Flaute, Dylan Template-Based Document Information Extraction Using Neural Network Keypoint Filtering

    Master of Science in Electrical Engineering, University of Dayton, 2024, Electrical and Computer Engineering

    Documents like invoices, receipts, and forms are essential to many modern business operations. We develop a system for autonomously processing common United States Air Force contract front forms. The system takes in a form and extracts a key-value pair for each box in the form. This task is called key information extraction. In a structured document, the layout is the same from instance to instance (perhaps allowing for rigid transforms). Our documents are semi-structured because, although their layouts are similar, some of the content may be in slightly different places between instances of the form. This makes information extraction harder because the response regions may be in different places from form to form. We demonstrate that, despite the added difficulty, template matching and registration makes for a strong baseline on our semi-structured forms. Additionally, we propose a filtering approach for keypoints based on their position in the layout. Specifically, we use a trained U-Net model to identify intersections and end-points in the form's "wire-frame.'' Then, the pipeline only uses keypoints that are close to those landmarks. We demonstrate that this method improves the registration quality over our baseline, results in a more intuitive distribution of keypoints across the image, and potentially speeds up processing since fewer keypoints need matching.

    Committee: Russell Hardie (Advisor); Barath Narayanan (Committee Member); Vijayan Asari (Committee Member) Subjects: Electrical Engineering
  • 4. Sain, Joy Reliable Named Entity Recognition Using Incomplete Domain-Specific Dictionaries

    Doctor of Philosophy (PhD), Wright State University, 2023, Computer Science and Engineering PhD

    Information Extraction (IE) techniques are essential to gleaning valuable information about entities and their relationships from unstructured text and creating a structured representation of the text for downstream Natural Language Processing (NLP) tasks including question answering, text summarization, and knowledge graph construction. Supervised Machine Learning (ML) techniques have been widely used in IE. While the resulting extraction algorithms are very effective, they require a large amount of annotated data, which can be expensive to acquire and time-consuming to create. Additionally, creating high-quality gold-standard annotations can be challenging, particularly when dealing with new domains or languages that lack sufficient resources to facilitate annotations. This dissertation develops minimally-supervised approaches to extract Named Entities (NEs) from text, specifically addressing the challenges arising from using distantly-supervised techniques for NE extraction from the text in which domain-specific dictionaries are used to automatically match and assign labels to data, which can subsequently be used to train an ML model for the extraction task. A key challenge in learning an effective ML model for distant learning techniques is the incompleteness of the dictionaries being used which can result in incomplete, partial, or noisy annotations. In case of incomplete or missing annotations, training a sequence labeling model for NER may result in suboptimal learning. To address these challenges, in this dissertation, I propose novel approaches to improve dictionary coverage that utilize a state-of-the-art phrase extraction technique and domain-specific dictionary to extract phrases from unlabeled text data. Leveraging the lexical, syntactic, and contextual features of the entities present in the initial dictionaries, I propose headword and span-based classification approaches to categorize the extracted phrases into corresponding entity classes. Th (open full item for complete abstract)

    Committee: Michael Raymer Ph.D. (Advisor); Krishnaprasad Thirunarayan Ph.D. (Advisor); Tanvi Banerjee Ph.D. (Committee Member); Charese Smiley Ph.D. (Committee Member) Subjects: Artificial Intelligence; Computer Science
  • 5. Sarkhel, Ritesh Data Preparation from Visually Rich Documents

    Doctor of Philosophy, The Ohio State University, 2022, Computer Science and Engineering

    Modern information sources are heterogeneous in nature. They utilize a number of modalities to disseminate information effectively. Visually rich documents typify such an information source. A visually rich document refers to a physical or digital document that uses visual cues along with linguistic features to augment or highlight its semantics. Traditional data preparation solutions are inefficient in harvesting knowledge from these sources as they do not take their multimodality into account. They are also cumbersome in terms of the amount of human-effort required in their end-to-end workflow. We describe algorithmic solutions for two fundamental data preparation tasks, namely information extraction and data integration, for visually rich documents. For both tasks, the core element of our solution is a fundamental machine-learning problem – how to represent heterogeneous documents with diverse layouts and/or formats in a unified way? We develop efficient solutions for both tasks on the bedrock of this representation learning problem. In the first part of this dissertation, we describe Artemis – a machine-learning model to extract structured records from visually rich documents. It identifies named entities by representing each visual span as a multimodal feature vector and subsequently classifying it as one of target fields to be extracted. It is a generalized information extraction method, i.e. it does not utilize any prior knowledge about the layout or format of the document in its end-to-end workflow. We describe two utility functions that aid this machine-learning model – VS2, a visual segmentation algorithm that encodes the local context and LadderNet, a convolutional network that encodes document-specific discriminative features in a visual span representation. We establish the efficacy of our machine-learning model on a number of different datasets. We investigate the robustness of our extraction model on an extreme case of our usability spectrum. In th (open full item for complete abstract)

    Committee: Arnab Nandi (Advisor); Srinivasan Parthasarathy (Committee Member); Eric Fosler-Lussier (Committee Member); Jay Gupta (Committee Member) Subjects: Computer Science; Information Science
  • 6. Al-Olimat, Hussein Knowledge-Enabled Entity Extraction

    Doctor of Philosophy (PhD), Wright State University, 2019, Computer Science and Engineering PhD

    Information Extraction (IE) techniques are developed to extract entities, relationships, and other detailed information from unstructured text. The majority of the methods in the literature focus on designing supervised machine learning techniques, which are not very practical due to the high cost of obtaining annotations and the difficulty in creating high quality (in terms of reliability and coverage) gold standard. Therefore, semi-supervised and distantly-supervised techniques are getting more traction lately to overcome some of the challenges, such as bootstrapping the learning quickly. This dissertation focuses on information extraction, and in particular entities, i.e., Named Entity Recognition (NER), from multiple domains, including social media and other grammatical texts such as news and medical documents. This work explores the ways for lowering the cost of building NER pipelines with the help of available knowledge without compromising the quality of extraction and simultaneously taking into consideration feasibility and other concerns such as user-experience. I present a type of distantly supervised (dictionary-based), supervised (with reduced cost using entity set expansion and active learning), and minimally-supervised NER approaches. In addition, I discuss the various aspects of the knowledge-enabled NER approaches and how and why they are a better fit for today's real-world NER pipelines in dealing with and partially overcoming the above-mentioned difficulties. I present two dictionary-based NER approaches. The first technique extracts location mentions from text streams, which proved very effective for stream processing with competitive performance in comparison with ten other techniques. The second is a generic NER approach that scales to multiple domains and is minimally supervised with a human-in-the-loop for online feedback. The two techniques augment and filter the dictionaries to compensate for their incompleteness (due to lexical variat (open full item for complete abstract)

    Committee: Krishnaprasad Thirunarayan Ph.D. (Advisor); Keke Chen Ph.D. (Committee Member); Guozhu Dong Ph.D. (Committee Member); Steven Gustafson Ph.D. (Committee Member); Srinivasan Parthasarathy Ph.D. (Committee Member); Valerie L. Shalin Ph.D. (Committee Member) Subjects: Artificial Intelligence; Computer Science
  • 7. Perera, Pathirage Knowledge-driven Implicit Information Extraction

    Doctor of Philosophy (PhD), Wright State University, 2016, Computer Science and Engineering PhD

    Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding. Consider the tweet “New Sandra Bullock astronaut lost in space movie looks absolutely terrifying” and the text snippet extracted from a clinical narrative “He is suffering from nausea and severe headaches. Dolasteron was prescribed”. The tweet has an implicit mention of the entity “Gravity” and the clinical text snippet has implicit mention of the relationship between medication “Dolasteron” and clinical condition “nausea”. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add value to conversations. However, extracting implicit constructs has not received enough attention in the information extraction literature. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets. When people use implicit constructs in their daily communication, they assume the existence of a shared knowledge with the audience about the subject being discussed. This shared knowledge helps to decode implicitly conveyed information. For example, the above Twitter user assumed that his/her audience knows that the actress “Sandra Bullock” starred in the movie “Gravity” and it is a movie about space exploration. (open full item for complete abstract)

    Committee: Amit Sheth Ph.D. (Advisor); Krishnaprasad Thirunarayan Ph.D. (Committee Member); Michael Raymer Ph.D. (Committee Member); Pablo Mendes Ph.D. (Committee Member) Subjects: Computer Science
  • 8. Jayawardhana, Udaya An ontology-based framework for formulating spatio-temporal influenza (flu) outbreaks from twitter

    Master of Science (MS), Bowling Green State University, 2016, Applied Geospatial Science

    Early detection and locating of influenza outbreaks is one of the key priorities on a national level for preparedness and planning. This study presents the design and implementation of a web-based prototype software framework (Fluwitter) for pseudo real-time detection of influenza outbreaks from Twitter in space and time. Harnessing social media to track real-time influenza outbreaks can provide different perspectives in battling the spread of infectious diseases and lowering the cost of existing assessment methods. Specifically, Fluwitter follows a three-tier architecture system with a thin web client and a resourceful server environment. The server side system is composed of a PostGIS spatial database, a GeoServer instance, a web application for visualizing influenza maps and daemon applications for tweet streaming, pre-processing of data, semantic information extraction based on DBpediaSpotlight and WS4J, and geo-processing. The collected geo-tagged tweets are processed by semantic NLP techniques for detecting and extracting influenza related tweets. The synsets from the extracted influenza related tweets are tagged and ontology based semantic similarity scores produced by WUP and RES algorithms were derived for subsequent information extraction. To ensure better detection, the information extraction was calibrated by different rules produced by the semantic similarity scores. The optimized rule produced a final F-measure value of 0.72 and accuracy (ACC) value of 94.4%. The Twitter generated influenza cases were validated by weekly influenza related hospitalization records issued by ODH. The validation that was based on Pearson's correlations suggested existence of moderate correlations for the Southeast region (r = 0.52), the Northwestern region (r = 0.38), and the Central region (r = 0.33). Although, additional work is needed, the potential strengths and benefits of the prototype are shown through a case study in Ohio that enables spatio-temporal assessment a (open full item for complete abstract)

    Committee: Peter Gorsevski Dr (Advisor); Jeffrey Snyder Dr (Committee Member); Sheila Roberts Dr (Committee Member) Subjects: Computer Science; Geographic Information Science
  • 9. Johnson, Eamon Methods in Text Mining for Diagnostic Radiology

    Doctor of Philosophy, Case Western Reserve University, 2016, EECS - Computer and Information Sciences

    Information extraction from clinical medical text is a challenge in computing to bring structure to the prose produced for communication in medical practice. In diagnostic radiology, prose reports are the primary means for communication of image interpretation to patients and other physicians, yet secondary use of the report requires either costly review by another radiologist or machine interpretation. In this work, we present mechanisms for improving machine interpretation of domain-specific text with large scale semantic analysis, using a corpus of 726,000 real-world radiology reports as a basis for experimentation. We examine the abstract conceptual problem of detection of incidental findings (uncertain or unexpected results) in imaging study reports. We demonstrate that classifiers incorporating semantic metrics can outperform F-measure of prior methods for follow-up classification and also outperform F-measure of incidental findings classification by physicians in-clinic (0.689 versus 0.648). Further, we propose two semantic metrics, focus and divergence, as calculated over the SNOMED-CT ontology graph, for summarization and projection of discrete report concepts into 2-dimensional space which enables both machine classification and physician interpretation of classifications. With understanding of the utility of semantic metrics for classification, we present methods for enhancing extraction of semantic information from clinical corpora. First, we construct a zero-knowledge method for imputation of semantic class for unlabeled terms through maximization of a confidence factor computed using pairwise co-occurrence statistics and rules limiting recall. Experiments with our method on corpora of reduced Mandelbrot information temperature produce accurate labeling of up to 25% of terms not labeled by prior methods. Second, we propose a method for context-sensitive quantification of relative concept salience and an algorithm capable of increasing both salienc (open full item for complete abstract)

    Committee: Gultekin Ozsoyoglu (Committee Chair); Marc Buchner (Committee Member); Adam Perzynski (Committee Member); Andy Podgurski (Committee Member) Subjects: Computer Science
  • 10. John, Zubin Predicting Day-Zero Review Ratings: A Social Web Mining Approach

    Master of Science, The Ohio State University, 2015, Computer Science and Engineering

    Social Web Mining: is a term closely associated with modern day use of the Internet; with large Internet companies Google, Apple, IBM moving towards integration of intelli- gence into their product eco-system, a large number of different applications have popped up in the Social sphere. With the aid of machine learning techniques there is no dearth of learning that is possible from endless streams of user-generated content. One of the tasks in this domain that has seen relatively less research is the task of predicting review scores prospectively i.e. prior to the release of the entity - a movie, electronic product, game or book in question. It is easy to locate this chatter on social streams such as Twitter; what's difficulty is extracting relevant information and facts about these entities and even more - the task of predicting these Day-Zero review rating scores which provide insightful information about these products, prior to their release. In this thesis, we propose just such a framework - a setup capable of extracting facts about reviewable entities. Populating a list of potential objects for a year, we follow an approach similar to boot-strapping in order to learn relevant facts about these prospective entities, all geared towards the task of learning to predict scores in a machine learning setting. Towards the end-goal of predicting review scores for potential products - our system supports alternative strategies which perform competitively on the task problem. All the predictions from the learning framework, within a certain allowable error margin output scores comparable to human judgment. The results bode well for potential large-scale predictive tasks on real-time data streams; in addition this framework proposes alternative feature spaces which in aggregation go on to describe a multi-method approach to achieving higher accuracy on tasks which have previously seen lack-luster results.

    Committee: Alan Ritter (Advisor); Eric Fosler-Lussier (Committee Member) Subjects: Artificial Intelligence; Computer Science
  • 11. Althuru, Dharan Kumar Reddy Distributed Local Trust Propagation Model and its Cloud-based Implementation.

    Master of Science (MS), Wright State University, 2014, Computer Science

    World Wide Web has grown rapidly in the last two decades with user generated content and interactions. Trust plays an important role in providing personalized content recommendations and in improving our confidence in various online interactions. We review trust propagation models in the context of social networks, semantic web, and recommender systems. With an objective to make trust propagation models more flexible, we propose several extensions to the trust propagation models that can be implemented as configurable parameters in the system. We implement Local Partial Order Trust (LPOT) model that considers trust as well as distrust ratings and perform evaluation on Epinions.com dataset to demonstrate the improvement in recommendations obtained by incorporating trust models. We also evaluate in terms of performance of trust propagation models and motivate the need for scalable solution. In addition to variety, real world applications need to deal with volume and velocity of data. Hence, scalability and performance are extremely important. We review techniques for large-scale graph processing, and propose distributed trust aware recommender architectures that can be selected based on application needs. We develop distributed local partial order trust model compatible with Pregel (a system for large-scale graph processing), and implement it using Apache Giraph on a Hadoop cluster. This model computes trust inference ratings for all users accessible within configured depth from all other users in the network in parallel. We provide experimental results illustrating the scalability of this model with number of nodes in the cluster as well as the network size. This enables applications operating on large-scale to integrate with trust propagation models.

    Committee: Krishnaprasad Thirunarayan Ph.D. (Advisor); Keke Chen Ph.D. (Committee Member); Meilin Liu Ph.D. (Committee Member) Subjects: Computer Science
  • 12. Thomas, Christopher Knowledge Acquisition in a System

    Doctor of Philosophy (PhD), Wright State University, 2012, Computer Science and Engineering PhD

    I present a method for growing the amount of knowledge available on the Web using a hermeneutic method that involves background knowledge, Information Extraction techniques and validation through discourse and use of the extracted information. I present the metaphor of the "Circle of Knowledge on the Web". In this context, knowledge acquisition on the web is seen as analogous to the way scientific disciplines gradually increase the knowledge available in their field. Here, formal models of interest domains are created automatically or manually and then validated by implicit and explicit validation methods before the statements in the created models can be added to larger knowledge repositories, such as the Linked open Data cloud. This knowledge is then available for the next iteration of the knowledge acquisition cycle. I will both give a theoretical underpinning as well as practical methods for the acquisition of knowledge in collaborative systems. I will cover both the Knowledge Engineering angle as well as the Information Extraction angle of this problem. Unlike traditional approaches, however, this dissertation will show how Information Extraction can be incorporated into a mostly Knowledge Engineering based approach as well as how an Information Extraction-based approach can make use of engineered concept repositories. Validation is seen as an integral part of this systemic approach to knowledge acquisition. The centerpiece of the dissertation is a domain model extraction framework that implements the idea of the "Circle of Knowledge" to automatically create semantic models for domains of interest. It splits the involved Information Extraction tasks into that of Domain Definition, in which pertinent concepts are identified and categorized, and that of Domain Description, in which facts are extracted from free text that describe the extracted concepts. I then outline a social computing strategy for information validation in order to create knowledge from the (open full item for complete abstract)

    Committee: Amit Sheth PhD (Advisor); Pankaj Mehra PhD (Committee Member); Shaojun Wang PhD (Committee Member); Pascal Hitzler PhD (Committee Member); Gerhard Weikum PhD (Committee Member) Subjects: Artificial Intelligence; Computer Science; Information Science
  • 13. Ramakrishnan, Cartic Extracting, Representing and Mining Semantic Metadata from Text: Facilitating Knowledge Discovery in Biomedicine

    Doctor of Philosophy (PhD), Wright State University, 2008, Computer Science and Engineering PhD

    The information access paradigm offered by most contemporary text information systems is a search-and-sift paradigm where users have to manually glean and aggregate relevant information from the large number of documents that are typically returned in response to keyword queries. Expecting the users to glean and aggregate information has lead to several inadequacies in these information systems. Owing to the size of many text databases, search-and-sift is a very tedious often requiring repeated keyword searches refining or generalizing queries terms. A more serious limitation arises from the lack of automated mechanisms to aggregate content across different documents to discover new knowledge. This dissertation focuses on processing text to assign semantic interpretations to its content (extracting Semantic metadata) and the design of algorithms and heuristics to utilize the extracted semantic metadata to support knowledge discovery operations over text content. Contributions in extracting semantic metadata in this dissertation cover the extraction of compound entities and complex relationships connecting entities. Extraction results are represented using a standard Semantic Web representation language (RDF) and are manually evaluated for accuracy. Knowledge discovery algorithms presented herein operate on RDF data. To further improve access mechanisms to text content, applications supporting semantic browsing and semantic search of text are presented.

    Committee: Amit Sheth PhD (Advisor); Michael Raymer PhD (Committee Member); Shaojun Wang PhD (Committee Member); Guozhu Dong PhD (Committee Member); Thaddeaus Tarpey PhD (Committee Member); Vasant Honavar PhD (Committee Member) Subjects: Computer Science
  • 14. Nepal, Srijan Linguistic Approach to Information Extraction and Sentiment Analysis on Twitter

    MS, University of Cincinnati, 2012, Engineering and Applied Science: Computer Science

    Social media sites are one of the most popular destinations in today's online world. With millions of users visiting social networking sites like Facebook, YouTube, Twitter etc. every day to share social content at their disposal; from simple textual information about what they are doing at any moment of time, to opinions regarding products, people, events, movies to videos and music, these sites have become massive sources of user generated content. In this work we focus on one such social networking site - Twitter, for the task of information extraction and sentiment analysis. This work presents a linguistic framework that first performs syntactic normalization of tweets on top of traditional data cleaning, extracts assertions from each tweet in the form of binary relations, and creates a contextualized knowledge base (KB). We then present a Language Model (LM) based classifier trained on a small set of manually tagged corpus, to perform sentence level sentiment analysis on the collected assertions to eventually create a KB that is backed by sentiment values. We use this approach to implement a contextualized sentiment based yes/no question answering system.

    Committee: Kenneth Berman PhD (Committee Chair); Fred Annexstein PhD (Committee Member); Anca Ralescu PhD (Committee Member) Subjects: Computer Science