Doctor of Philosophy, The Ohio State University, 2018, Computer Science and Engineering
Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data.
In this dissertation, we focus on addressing issues to allow data management, efficient declarative querying, and advanced analytics over array data.
We formalize the semantic of array data querying, and introduce distributed querying abilities over these data.
We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general.
In addition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables - including an operation we refer to as Mutual Range Joins(MRJ), which arises on scientific data that is not only numerical, but also have measurement noise.
While working closely with our colleagues to provide them usable analytics over array data, we uncovered a new type of analytical querying - analytics over windows with an inner window ordering (in contrast to the external window ordering, available elsewhere).
Last, we adjust our join optimization approach for skewed settings, addressing resource skew observed in real environments as well as data skew that arises while data is processed.
Several major contributions are introduced throughout this dissertation.
First we formalize querying over scientific array data (basic operators, such as subsettings, as well as complex analytical functions and joins).
We focus on distributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) - this framework is used in production environments.
Next, we present an optimization approach for join queries over geo-distributed data.
This approach considers networking properties such as throughput and latency to optimize the execution of join queries.
For such complex optimization, we introduce methods and algorit (open full item for complete abstract)
Committee: Gagan Agrawal (Advisor); Arnab Nandi (Committee Member); P Sadayappan (Committee Member)
Subjects: Computer Science