Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 2)

Mini-Tools

 
 

Search Report

  • 1. DePero, Andrew Schemalysis: Visualization of a Sub-Schemas in Document NoSQL Databases

    Master of Computer Science, Miami University, 2022, Computer Science and Software Engineering

    NoSQL database systems are useful for managing large and diverse data sets associated with Big Data. Highly diverse data sets contain data with different structures, but often there are no readily available schemas describing the structures. The lack of a uniform structure for data may make it difficult to understand and query a database. Recent research and industry software tools extract some aspects of the structures inherent in a NoSQL database; most tools provide a schema that gives the union of attributes across all objects, termed a union schema. Some provide sample values for attributes. We present Schemalysis, a tool for analyzing and displaying the sub-schemas of a document NoSQL database along with example instances. The web application implements an algorithm that reads objects and detects individual sub-schemas of each document in a document database, as well as the database's union schema. We also conduct three different case studies to validate the functionality of Schemalysis with real-world data and compare and contrast to existing tools for extracting schemas.

    Committee: Karen Davis (Advisor); Alan Ferrenberg (Committee Member); James Kiper (Committee Member) Subjects: Computer Science
  • 2. Sarkhel, Ritesh Data Preparation from Visually Rich Documents

    Doctor of Philosophy, The Ohio State University, 2022, Computer Science and Engineering

    Modern information sources are heterogeneous in nature. They utilize a number of modalities to disseminate information effectively. Visually rich documents typify such an information source. A visually rich document refers to a physical or digital document that uses visual cues along with linguistic features to augment or highlight its semantics. Traditional data preparation solutions are inefficient in harvesting knowledge from these sources as they do not take their multimodality into account. They are also cumbersome in terms of the amount of human-effort required in their end-to-end workflow. We describe algorithmic solutions for two fundamental data preparation tasks, namely information extraction and data integration, for visually rich documents. For both tasks, the core element of our solution is a fundamental machine-learning problem – how to represent heterogeneous documents with diverse layouts and/or formats in a unified way? We develop efficient solutions for both tasks on the bedrock of this representation learning problem. In the first part of this dissertation, we describe Artemis – a machine-learning model to extract structured records from visually rich documents. It identifies named entities by representing each visual span as a multimodal feature vector and subsequently classifying it as one of target fields to be extracted. It is a generalized information extraction method, i.e. it does not utilize any prior knowledge about the layout or format of the document in its end-to-end workflow. We describe two utility functions that aid this machine-learning model – VS2, a visual segmentation algorithm that encodes the local context and LadderNet, a convolutional network that encodes document-specific discriminative features in a visual span representation. We establish the efficacy of our machine-learning model on a number of different datasets. We investigate the robustness of our extraction model on an extreme case of our usability spectrum. In th (open full item for complete abstract)

    Committee: Arnab Nandi (Advisor); Srinivasan Parthasarathy (Committee Member); Eric Fosler-Lussier (Committee Member); Jay Gupta (Committee Member) Subjects: Computer Science; Information Science