Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 2)

Mini-Tools

 
 

Search Report

  • 1. Ebenstein, Roee Supporting Advanced Queries on Scientific Array Data

    Doctor of Philosophy, The Ohio State University, 2018, Computer Science and Engineering

    Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data. In this dissertation, we focus on addressing issues to allow data management, efficient declarative querying, and advanced analytics over array data. We formalize the semantic of array data querying, and introduce distributed querying abilities over these data. We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general. In addition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables - including an operation we refer to as Mutual Range Joins(MRJ), which arises on scientific data that is not only numerical, but also have measurement noise. While working closely with our colleagues to provide them usable analytics over array data, we uncovered a new type of analytical querying - analytics over windows with an inner window ordering (in contrast to the external window ordering, available elsewhere). Last, we adjust our join optimization approach for skewed settings, addressing resource skew observed in real environments as well as data skew that arises while data is processed. Several major contributions are introduced throughout this dissertation. First we formalize querying over scientific array data (basic operators, such as subsettings, as well as complex analytical functions and joins). We focus on distributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) - this framework is used in production environments. Next, we present an optimization approach for join queries over geo-distributed data. This approach considers networking properties such as throughput and latency to optimize the execution of join queries. For such complex optimization, we introduce methods and algorit (open full item for complete abstract)

    Committee: Gagan Agrawal (Advisor); Arnab Nandi (Committee Member); P Sadayappan (Committee Member) Subjects: Computer Science
  • 2. Xing, Haoyuan Optimizing array processing on complex I/O stacks using indices and data summarization

    Doctor of Philosophy, The Ohio State University, 2021, Computer Science and Engineering

    Increasingly, the ability of human beings to understand the universe and ourselves depends on our ability to obtain and process data. With an explosion of data being generated every day, efficiently storing and querying such data, usually multidimensional and can be represented using an array data model, is increasingly vital. Meanwhile, along with more and more powerful CPUs and accelerators adding into the system, most modern computing systems contain an increasingly complex I/O stack, ranging from traditional disk-based file systems to heterogeneous accelerators with individual memory spaces. Efficiently accessing such a complex I/O stack in array processing is essential to utilize the enormous computational power of modern computational platforms. One key to achieving such efficiency is identifying where the data is being generated or stored, and choosing appropriate representation and processing strategies accordingly. This dissertation focuses on optimizing array processing in such complex I/O stacks by studying these two fundamental questions: what data representation should be used, and where the data should be stored and processed. The two basic scenarios of scientific data analytics are considered one-by-one; The first half of the dissertation tackles the problem of efficiently processing array data post-hoc, presents a compact array storage for disk-based data, integrating lossless value-based indexing into it. Such integrated indices improve the value-based filtering operation performance by orders of magnitude without sacrificing storage size or accuracy. The dissertation then demonstrates how complex queries such as equal and similarity array joins can also be performed on such novel storage. The second half of the dissertation focuses on data generated by simulations on accelerators in-situ without storing the generated data. The system generates an improved bitmap representation on GPU to reduce the bandwidth bottleneck between host and accelerat (open full item for complete abstract)

    Committee: Rajiv Ramnath (Advisor); Gagan Agrawal (Advisor); Jason Blevins (Other); Yang Wang (Committee Member); Srinivasan Parthasarathy (Committee Member) Subjects: Computer Engineering; Computer Science