Search Results

1. Ebenstein, Roee Supporting Advanced Queries on Scientific Array Data

Doctor of Philosophy, The Ohio State University, 2018, Computer Science and Engineering

Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data. In this dissertation, we focus on addressing issues to allow data management, efficient declarative querying, and advanced analytics over array data. We formalize the semantic of array data querying, and introduce distributed querying abilities over these data. We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general. In addition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables - including an operation we refer to as Mutual Range Joins(MRJ), which arises on scientific data that is not only numerical, but also have measurement noise. While working closely with our colleagues to provide them usable analytics over array data, we uncovered a new type of analytical querying - analytics over windows with an inner window ordering (in contrast to the external window ordering, available elsewhere). Last, we adjust our join optimization approach for skewed settings, addressing resource skew observed in real environments as well as data skew that arises while data is processed. Several major contributions are introduced throughout this dissertation. First we formalize querying over scientific array data (basic operators, such as subsettings, as well as complex analytical functions and joins). We focus on distributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) - this framework is used in production environments. Next, we present an optimization approach for join queries over geo-distributed data. This approach considers networking properties such as throughput and latency to optimize the execution of join queries. For such complex optimization, we introduce methods and algorit (open full item for complete abstract)

Committee: Gagan Agrawal (Advisor); Arnab Nandi (Committee Member); P Sadayappan (Committee Member) Subjects: Computer Science

2. Qiao, Shi QUERYING GRAPH STRUCTURED RDF DATA

Doctor of Philosophy, Case Western Reserve University, 2016, EECS - Computer and Information Sciences

Providing an efficient and expressive querying technique for graph structured RDF data is an emergent problem as large amounts of RDF data are available from applications in many areas. Current techniques do not fully satisfy this goal due to the nature of the RDF model which requires highly flexible use of keywords, and a structure expression in query language. Viewing RDF as graphs requires additional graph-based functionalities, such as querying a path or a tree connection. We propose a querying framework, called RDF-h, which uses the query template as a basic query unit, and supports both partially entered keywords and query conditions based on graph-structure. In order to provide efficient query evaluation, signature-based index is utilized. Though most existing techniques which utilize signature-based index claim its benefits on all datasets and queries. The effectiveness of signature-based pruning varies greatly among different RDF datasets and highly related with their dataset characteristics. The performance benefits from signature-based pruning depend not only on the size of the RDF graphs, but also the underlying graph structure and the complexity of queries. We propose several dataset evaluation metrics, namely, coverage and coherence, relationship specialty and literal diversity to understand the query performance differences among real and synthetic RDF datasets. Based on these results, we further propose an application-specific framework, called RBench, to generate RDF benchmarks. By evaluating the characteristics of RDF datasets and the complexity of query templates, RDF-h selectively utilizes signature-based pruning when it is considered to be beneficial. Two aspects of RDF-h framework are evaluated in experiments: 1. extensive query performance evaluation based on randomly generated queries for different datasets; 2. utilization of RDF-h for biomedical applications. For random query evaluation, the RDF-h algorithm can automatically capture freque (open full item for complete abstract)

Committee: Meral Özsoyoglu (Advisor); Gultekin Özsoyoglu (Committee Member); Mehmet Koyutürk (Committee Member); Marc Buchner (Committee Member); Soumya Ray (Committee Member); Andy Podgurski (Committee Member); Xiang Zhang (Committee Member) Subjects: Computer Science

3. Jones, Eric Hastening Write Operations on Read-Optimized Out-of-Core Column-Store Databases Utilizing Timestamped Binary Association Tables

Master of Computing and Information Systems, Youngstown State University, 2015, Department of Computer Science and Information Systems

The purpose of this thesis is to extend previous research on Out-of-Core column-store databases. Following use of the Asynchronous Out-of-Core update, which kept track of data using timestamps, an appendix is created which holds the newest timestamps and updated data by appending entries to the tables as new tuples. The appendix is naturally unsorted and unindexed by nature, causing need for a linear search that is not only slow, but causes ever-increasing query time as the volume of data within the appendix expands. Although measures exist to merge the appendix with the original body of the data, which is sorted and indexed, it only makes searching on the data swifter once the merging of tuples is complete. For this reason, the use of an offset B-Tree index to allow for more efficient searches on the appendix is proposed.

Committee: Feng Yu PhD (Advisor); John Sullins PhD (Committee Member); Yong Zhang PhD (Committee Member) Subjects: Computer Science

4. Huai, Yin Building High Performance Data Analytics Systems based on Scale-out Models

Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering

To respond to the data explosion, new system infrastructures have been built based on scale-out models for the purposes of high data availability and reliable large-scale computations. With an increasing amount of adoptions of data analytics systems, users continuously demand high throughput and high performance on various applications. In this dissertation, we identify three critical issues to achieve high throughput and high performance for data analytics, which are efficient table placement methods (i.e. the method to place structured data), generating high quality distributed query plans without unnecessary data movements, and effective support of out-of-band communications. To address these three issues, we have conducted a comprehensive study on design choices of different table placement methods, designed and implemented two optimizations to remove unnecessary data movements in distributed query plans, and introduced a system facility called {\it SideWalk} to facilitate the implementation of out-of-band communications. In our first work of table placement methods, we comprehensively studied existing table placement methods and generalized the basic structure of table placement methods. Based on the basic structure, we conducted a comprehensive evaluation of different design choices of table placement methods on I/O performance. Based on our evaluation and analysis, we provided a set of guidelines for users and developers to tune their implementations of table placement method. In our second work, we focused on building our optimizations based on Apache Hive, a widely used open source data warehousing system in the Hadoop ecosystem. We analyze operators that may require data movements in the context of the entire query plan. Our optimization methods remove unnecessary data movements from the distributed query plans. Our evaluation shows that these optimization methods can significantly reduce the query execution time. In our thir (open full item for complete abstract)

Committee: Xiaodong Zhang (Advisor); Feng Qin (Committee Member); Spyridon Blanas (Committee Member) Subjects: Computer Science

5. Goyal, Anushree A Framework for XML Index Selection

MS, University of Cincinnati, 2013, Engineering and Applied Science: Computer Science

Data on the web is increasingly being made available in a semi-structured format which presents some challenges for effective extraction of information [ABS00]. Semi-structured data is not rigidly formatted, i.e., it is not strictly table-oriented as in a relational model or other conventional database systems [A97]. The structural or hierarchical relationship between elements in semi-structured data typically needs to be preserved. In traditional databases, the records do not bear any inherent ordering or structural relationships. Moreover, traditional database schemas are not contained within the data itself. Owing to these reasons, one challenge is to store semi-structured data efficiently and another is to query this data effectively. A data model for representing such data is the XML data model [XML]. It supports modeling both the structure and the data contained within it. There are several methods of querying XML data. While some methods use XML data in its native format, others transform it into another format before it can be queried. However, there is an absence of a common ground where an informed decision can be made about the selection of the best query method. There is a need to make this decision based on the comparison of significant factors or parameters for each query method in order to confirm that the selection of a query technique is expected to perform better than another technique. Based on Richardson's algorithm [R09], our work focuses on implementing this recommendation in a cost-effective manner to result in the selection of an appropriate query technique pertaining to the given scenario.

Committee: Karen Davis Ph.D. (Committee Chair); Raj Bhatnagar Ph.D. (Committee Member); Carla Purdy Ph.D. (Committee Member) Subjects: Computer Science

6. Richardson, Bartley A Performance Study of XML Query Optimization Techniques

PhD, University of Cincinnati, 2009, Engineering : Computer Science and Engineering

As computers and technology continue to become more commonplace and essential to everyday life, more data is captured, stored, and analyzed by a variety of institutions in government, education, and the private sector. As this amount of data grows, so does the need for efficient methodologies and tools used to store, retrieve, and transform the data. A common method used to store this schemaless, semi-structured data is through the Extensible Markup Language, XML. In this way, an XML document is viewed as a database. With this sizable amount of data stored in a common format, one problem is how to efficiently query XML documents. While relational database man- agement systems contain built-in query optimizers, no such framework exists for XML databases. A multitude of document shapes, query shapes, index structures, and query techniques exist for XML databases, but the implications of these choices and their effects on query processing have not been investigated in a common framework. This dissertation identifies a set of representative query techniques, document structures, and query styles for XML databases and provides a com- mon framework for classifying the various query techniques, structures, and styles. We identify two broad classifications of query techniques, native XML and non-native XML, and develop a cost-based model for each technique that models query performance from an execution standpoint. We also develop our own query technique, RDBQuery, as an extension and major enhancement to a previously existing non-native XML query technique that leverages a relational database man- agement system to efficiently process XML queries. To evaluate relative query performance, we compare the techniques for various parameters that impact their performance, including query shape and document shape/size, and the results are presented through a series of graphs. These graphs and their underlying cost models are used to present an optimization framework for XML querie (open full item for complete abstract)

Committee: Karen Davis PhD (Committee Chair); Raj Bhatnagar PhD (Committee Member); John Schlipf PhD (Committee Member); Fred Annexstein PhD (Committee Member); Hsiang-Li Chiang PhD (Committee Member) Subjects: Computer Science

7. BRANT, MICHAEL BINDING HASH TECHNIQUE FOR XML QUERY OPTIMIZATION

MS, University of Cincinnati, 2006, Engineering : Computer Science

XML is a format that allows the storage and exchange of information across the World Wide Web. XML is a semi-structured markup language containing recursively-nested elements. Typically the volume of data in an XML file is too large to be human readable, therefore XML query processing (retrieving and combining subtrees) needs to be automated. An XML query processor has to choose an efficient method for a particular query and XML document. In this thesis we develop a method called Binding Hash (BH) that performs a subset of XML query operations. The BH Method focuses on improving performance of the most time-consuming XML query operation called structural join. A performance study indicates that the BH Method outperforms similar techniques for queries that are deeply nested. The BH Method is flexible since it can be integrated into a larger system and selected by an optimizer when it achieves the best performance.

Committee: Dr. Karen Davis (Advisor) Subjects:

8. Wang, Fan SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

A popular trend in data dissemination involves online data sources that are hidden behind query forms, thus forming what is referred to as the deep web. Deep web data is stored in hidden databases. Hidden data can only be acessed after a user submits a query by filling an online form. Currently, hundreds of large, complex and in many cases, related and/or overlapping, deep web data sources have become available. The number of such data sources is still increasing rapidly every year. The emergence of the deep web is posing many new challenges in data integration and query answering. First, the metadata of the deep web and the data records stored in deep web databases are hidden from the data integration system. Second, Multiple deep web data sources may have data redundancy. Furthermore, similar data sources may provide data with different data quality and even conflicting data. Therefore, data source selection is of great importance for a data integration system. Third, deep web data sources in a domain often have inter-dependencies, i.e., the output from one data source may be the input of another data source. Thus, answering a query over a set of deep web data sources often involving accessing a sequence of inter-dependent data sources in an intelligent order. Fourth, the common way of accessing data in deep web data sources is through standardized input interfaces. These interfaces, on one hand, provide a very simple query mechanism. On the other hand, these interfaces significantly constrain the types of queries that could be automatically executed. Finally, all deep web data sources are network based. Both the data source servers and network links are vulnerable to congestion and failures. Therefore, handling with fault tolerance issue is also necessary for a data integration system. In our work, we propose SEEDEEP, an automatic system for exploring and querying deep web data sources. The SEEDEEP system is able to integrate deep web data sources in a particular (open full item for complete abstract)

Committee: Gagan Agrawal PhD (Advisor); Feng Qin PhD (Committee Member); P Sadayappan PhD (Committee Member) Subjects: Computer Science

9. Jin, Ruoming New techniques for efficiently discovering frequent patterns

Doctor of Philosophy, The Ohio State University, 2005, Computer and Information Science

Because of its theoretical and practical importance, the field of frequent pattern mining has been and remain to be one of the most active research area in KDD. In this dissertation, we study three different problems in frequent pattern mining, mining multipledatasets, mining streaming data, and mining large-scale structures from graph datasets. Our study has not only extended the breadth of frequent pattern mining, but also brought new techniques and algorithms into this field. Specifically, our contributions are as follows. 1. Mining Multiple Datasets: We develop a systematic approach to generate efficient query plans for a single mining query across multiple datasets. We also propose methods to simultaneously optimize multiple such queries and utilize the past mining results in a query-intensive KDD environment. Our experimental results have shown a speedup up to two-order of magnitude comparing with the naive methods without these optimizations. 2. Mining Frequent Itemsets over Streaming Data: We propose a new algorithm StreamMining to discover the frequent itemsets over streaming data. In a single pass, StreamMining will guarantee to find a superset of frequent itemsets, but false positive may occur. If the second pass is allowed, StreamMining will be able to remove the false positive and find the exact frequent itemsets. Our detailed evaluation using both synthetic and real datasets has shown our one-pass algorithm is very accurate in practice, and is also very memory efficient. 3. Mining Frequent Large-Scale Structures from Graph Datasets: We develop a new framework to discover the frequent large-scale structures from graph datasets. This framework is derived from a mathematical concept, topological minor. In this framework, we propose a new algorithm TSMiner, which efficiently enumerates all the frequent large-scale structures in a graph dataset, and a new approach called relabeling function to perform constraint mining. We apply our framework to protein str (open full item for complete abstract)

Committee: Gagan Agrawal (Advisor) Subjects: Computer Science

10. Fedyukin, Alexander KEEPING TRACK OF NETWORK FLOWS: AN INEXPENSIVE AND FLEXIBLE SOLUTION

Master of Science (MS), Ohio University, 2005, Electrical Engineering & Computer Science (Engineering and Technology)

Every organization should be able to deploy a lightweight, flexible, and easy to use network monitoring system. Yet no existing system possesses all these qualities. Among the reasons for this is the considerable volume and high granularity of the network traffic data - one border router of a medium-sized network can register a few million network flows per hour. Also, the applications such as intrusion detection have limited tolerance for delays - the system has to deliver results quickly while managing to avoid overloads. This thesis discusses how to overcome these challenges and build an efficient and inexpensive network monitoring system by using innovative approaches to data processing such as stream queries and dynamic query optimization. It covers more basic issues as well: how to acquire data about network traffic and what applications can use it as well as system architecture and performance issues.

Committee: Shawn Ostermann (Advisor) Subjects:

Basic Search

Left Column

Filters

Right Column

Search Results

Search Results

Mini-Tools

Search Report

1. Ebenstein, Roee Supporting Advanced Queries on Scientific Array Data

2. Qiao, Shi QUERYING GRAPH STRUCTURED RDF DATA

3. Jones, Eric Hastening Write Operations on Read-Optimized Out-of-Core Column-Store Databases Utilizing Timestamped Binary Association Tables

4. Huai, Yin Building High Performance Data Analytics Systems based on Scale-out Models

5. Goyal, Anushree A Framework for XML Index Selection

6. Richardson, Bartley A Performance Study of XML Query Optimization Techniques

7. BRANT, MICHAEL BINDING HASH TECHNIQUE FOR XML QUERY OPTIMIZATION

8. Wang, Fan SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

9. Jin, Ruoming New techniques for efficiently discovering frequent patterns

10. Fedyukin, Alexander KEEPING TRACK OF NETWORK FLOWS: AN INEXPENSIVE AND FLEXIBLE SOLUTION

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Basic Search

Left Column

Filters

By Year

Degree Name

Submission Site

Subject

Language

Right Column

Search Results

Search Results

Mini-Tools

Search Report

1. Ebenstein, Roee Supporting Advanced Queries on Scientific Array Data

2. Qiao, Shi QUERYING GRAPH STRUCTURED RDF DATA

3. Jones, Eric Hastening Write Operations on Read-Optimized Out-of-Core Column-Store Databases Utilizing Timestamped Binary Association Tables

4. Huai, Yin Building High Performance Data Analytics Systems based on Scale-out Models

5. Goyal, Anushree A Framework for XML Index Selection

6. Richardson, Bartley A Performance Study of XML Query Optimization Techniques

7. BRANT, MICHAEL BINDING HASH TECHNIQUE FOR XML QUERY OPTIMIZATION

8. Wang, Fan SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

9. Jin, Ruoming New techniques for efficiently discovering frequent patterns

10. Fedyukin, Alexander KEEPING TRACK OF NETWORK FLOWS: AN INEXPENSIVE AND FLEXIBLE SOLUTION

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links