Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 34)

Mini-Tools

 
 

Search Report

  • 1. Ebenstein, Roee Supporting Advanced Queries on Scientific Array Data

    Doctor of Philosophy, The Ohio State University, 2018, Computer Science and Engineering

    Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data. In this dissertation, we focus on addressing issues to allow data management, efficient declarative querying, and advanced analytics over array data. We formalize the semantic of array data querying, and introduce distributed querying abilities over these data. We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general. In addition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables - including an operation we refer to as Mutual Range Joins(MRJ), which arises on scientific data that is not only numerical, but also have measurement noise. While working closely with our colleagues to provide them usable analytics over array data, we uncovered a new type of analytical querying - analytics over windows with an inner window ordering (in contrast to the external window ordering, available elsewhere). Last, we adjust our join optimization approach for skewed settings, addressing resource skew observed in real environments as well as data skew that arises while data is processed. Several major contributions are introduced throughout this dissertation. First we formalize querying over scientific array data (basic operators, such as subsettings, as well as complex analytical functions and joins). We focus on distributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) - this framework is used in production environments. Next, we present an optimization approach for join queries over geo-distributed data. This approach considers networking properties such as throughput and latency to optimize the execution of join queries. For such complex optimization, we introduce methods and algorit (open full item for complete abstract)

    Committee: Gagan Agrawal (Advisor); Arnab Nandi (Committee Member); P Sadayappan (Committee Member) Subjects: Computer Science
  • 2. Moatassem, Nawal A Study of Migrating Biological Data from Relational Databases to NoSQL Databases

    Master of Computing and Information Systems, Youngstown State University, 2015, Department of Computer Science and Information Systems

    The purpose of this research is to conduct a literature survey on various NoSQL Not-only-SQL) architectures. Included along with this literature survey is an experiment to do a comparison between a relational database management system (RDBMS) and a NoSQL DBMS. This work compares specifically MySQL and MongoDB, an RDBMS and NoSQL DBMS respectively, for the purposes of data migration. The migration is run on data sets for Youngstown State University's plantEST biological database. The idea is to demonstrate the need for shifting to NoSQL for management of large amounts of unstructured and semi-structured data, and to observe and record the insertion speeds of both databases using this custom plantEST schema.

    Committee: Feng Yu PhD (Advisor); John Sullins PhD (Committee Member); Yong Zhang PhD (Committee Member) Subjects: Computer Science; Information Science
  • 3. Modi, Amit Matching Based Diversity

    Master of Science, The Ohio State University, 2011, Computer Science and Engineering

    With the increasing size of data and its availability to a diverse community of users, the search model has evolved from accuracy search to exploratory search. The usability of the search results in exploratory search is defined not only by accuracy but also by decreasing redundancy, as well as adding novelty to the result set. Serendipitous results are encouraged as the user does not have a complete knowledge of the data. This problem is known as diversification of search results, where the results are similar to query object but dissimilar among them. Diversification of search results is a crucial task in data mining, information retrieval and recommendation systems. Most of the earlier proposals in the database community model this problem as diversification of the nearest neighbors. These proposals use index-based distance browsing approaches to diversify the nearest neighbors. The drawback of this approach is that it cannot uncover the underlying partial similarities in constrained subspace because indexing the exponential number of subspaces is not possible in high dimensional data. None of the earlier proposals uncover partial similarities as well as guarantee the diversification of search results. We introduce the diverse k-n match problem as diversification of the search results of partial matches of the query object to decrease redundancy and increase novelty/serendipity. The n specifies the number of subspace dimensions of partial similarity and is an integer less than or equal to d, the dimensionality of data, k specifies the size of the result set. We show that the diverse k-n match problem is NP-complete and we propose a greedy heuristic approach based on the ordered sequence of partial matches to address the problem. For that purpose, we introduce three evaluation metrics to measure the diversity of the result set: 1) Novelty, 2) Local Content Dispersion and 3) Global Content Dispersion. Additionally, we introduced a disk-based solution for very large (open full item for complete abstract)

    Committee: Gagan Agrawal PhD (Advisor); Hakan Ferhatosmanoglu PhD (Committee Member); Engin Demir PhD (Committee Member) Subjects: Computer Science
  • 4. Smith, Harrison The smart reconfigurable coprocessor for fuzzy searching of sage generated datasets /

    Master of Science, The Ohio State University, 2006, Graduate School

    Committee: Not Provided (Other) Subjects:
  • 5. DePero, Andrew Schemalysis: Visualization of a Sub-Schemas in Document NoSQL Databases

    Master of Computer Science, Miami University, 2022, Computer Science and Software Engineering

    NoSQL database systems are useful for managing large and diverse data sets associated with Big Data. Highly diverse data sets contain data with different structures, but often there are no readily available schemas describing the structures. The lack of a uniform structure for data may make it difficult to understand and query a database. Recent research and industry software tools extract some aspects of the structures inherent in a NoSQL database; most tools provide a schema that gives the union of attributes across all objects, termed a union schema. Some provide sample values for attributes. We present Schemalysis, a tool for analyzing and displaying the sub-schemas of a document NoSQL database along with example instances. The web application implements an algorithm that reads objects and detects individual sub-schemas of each document in a document database, as well as the database's union schema. We also conduct three different case studies to validate the functionality of Schemalysis with real-world data and compare and contrast to existing tools for extracting schemas.

    Committee: Karen Davis (Advisor); Alan Ferrenberg (Committee Member); James Kiper (Committee Member) Subjects: Computer Science
  • 6. Thapa, Shova Use Case Driven Evaluation of Database Systems for ILDA

    Master of Computer Science, Miami University, 2022, Computer Science and Software Engineering

    Databases are integral parts of many software systems. An increasing number of database systems have unique capabilities and trade-offs with other systems; choosing the right database system for a given application is a challenging problem. In such cases, the input from the end-users about which use cases should be supported is crucial so that database switch, which is not a trivial task, provides the most value. The Indigenous Languages Digital Archive (ILDA) is a web-based system to gather digital copies of different indigenous language documents under one virtual repository. The system provides a reliable framework for organizing, storing, searching, and evaluating archived linguistic materials over the web. The primary objective of ILDA is to support the revitalization of indigenous languages and culture education among tribal communities. This thesis conducts evaluation of database systems for ILDA by analyzing features of different database systems to select the best system to support ILDA use cases. Feedback from end-users and analysis of system features contributes to the development of use cases. Selected database systems are tested in terms of functionality and usability feedback from ILDA users. Even though this methodology is tested only for the ILDA scenario, it can be applied to similar projects.

    Committee: Karen Davis PhD (Advisor); Daniela Inclezan PhD (Committee Member); Douglas Troy PhD (Committee Member) Subjects: Computer Engineering; Computer Science
  • 7. Sarkhel, Ritesh Data Preparation from Visually Rich Documents

    Doctor of Philosophy, The Ohio State University, 2022, Computer Science and Engineering

    Modern information sources are heterogeneous in nature. They utilize a number of modalities to disseminate information effectively. Visually rich documents typify such an information source. A visually rich document refers to a physical or digital document that uses visual cues along with linguistic features to augment or highlight its semantics. Traditional data preparation solutions are inefficient in harvesting knowledge from these sources as they do not take their multimodality into account. They are also cumbersome in terms of the amount of human-effort required in their end-to-end workflow. We describe algorithmic solutions for two fundamental data preparation tasks, namely information extraction and data integration, for visually rich documents. For both tasks, the core element of our solution is a fundamental machine-learning problem – how to represent heterogeneous documents with diverse layouts and/or formats in a unified way? We develop efficient solutions for both tasks on the bedrock of this representation learning problem. In the first part of this dissertation, we describe Artemis – a machine-learning model to extract structured records from visually rich documents. It identifies named entities by representing each visual span as a multimodal feature vector and subsequently classifying it as one of target fields to be extracted. It is a generalized information extraction method, i.e. it does not utilize any prior knowledge about the layout or format of the document in its end-to-end workflow. We describe two utility functions that aid this machine-learning model – VS2, a visual segmentation algorithm that encodes the local context and LadderNet, a convolutional network that encodes document-specific discriminative features in a visual span representation. We establish the efficacy of our machine-learning model on a number of different datasets. We investigate the robustness of our extraction model on an extreme case of our usability spectrum. In th (open full item for complete abstract)

    Committee: Arnab Nandi (Advisor); Srinivasan Parthasarathy (Committee Member); Eric Fosler-Lussier (Committee Member); Jay Gupta (Committee Member) Subjects: Computer Science; Information Science
  • 8. Xing, Haoyuan Optimizing array processing on complex I/O stacks using indices and data summarization

    Doctor of Philosophy, The Ohio State University, 2021, Computer Science and Engineering

    Increasingly, the ability of human beings to understand the universe and ourselves depends on our ability to obtain and process data. With an explosion of data being generated every day, efficiently storing and querying such data, usually multidimensional and can be represented using an array data model, is increasingly vital. Meanwhile, along with more and more powerful CPUs and accelerators adding into the system, most modern computing systems contain an increasingly complex I/O stack, ranging from traditional disk-based file systems to heterogeneous accelerators with individual memory spaces. Efficiently accessing such a complex I/O stack in array processing is essential to utilize the enormous computational power of modern computational platforms. One key to achieving such efficiency is identifying where the data is being generated or stored, and choosing appropriate representation and processing strategies accordingly. This dissertation focuses on optimizing array processing in such complex I/O stacks by studying these two fundamental questions: what data representation should be used, and where the data should be stored and processed. The two basic scenarios of scientific data analytics are considered one-by-one; The first half of the dissertation tackles the problem of efficiently processing array data post-hoc, presents a compact array storage for disk-based data, integrating lossless value-based indexing into it. Such integrated indices improve the value-based filtering operation performance by orders of magnitude without sacrificing storage size or accuracy. The dissertation then demonstrates how complex queries such as equal and similarity array joins can also be performed on such novel storage. The second half of the dissertation focuses on data generated by simulations on accelerators in-situ without storing the generated data. The system generates an improved bitmap representation on GPU to reduce the bandwidth bottleneck between host and accelerat (open full item for complete abstract)

    Committee: Rajiv Ramnath (Advisor); Gagan Agrawal (Advisor); Jason Blevins (Other); Yang Wang (Committee Member); Srinivasan Parthasarathy (Committee Member) Subjects: Computer Engineering; Computer Science
  • 9. Patt, Andrew Integrative and Network-Based Approaches for Functional Interpretation of Metabolomic Data

    Doctor of Philosophy, The Ohio State University, 2021, Biomedical Sciences

    Metabolism is a process that touches all aspects of life, from homeostasis to disease, such that the study of metabolites yields valuable insights into the inner workings of biological systems. Translating the findings of metabolomic and lipidomic experiments into biological insight, biomarkers, or actionable targets associated with disease requires functional interpretation of the data, which is challenging. One common strategy for interpreting metabolomic data is pathway enrichment analysis. Pathway analysis is useful because pathway-level perturbation can be more reproducible across samples than individual metabolite shifts, which are hindered by inconsistent experimental coverage of metabolites and functional redundancy of metabolites. However, pathway analysis of metabolites faces many barriers for success. Issues with metabolite pathway analysis include lack of metabolite pathway annotations, highly overlapping pathway definitions, and (again) lack of reproducibility in metabolite detection between experiments. Here, I present two complementary software resources, RaMP and MetaboSPAN, which I helped to develop in order to address these issues. RaMP is a metabolite annotations database that consolidates pathway, reaction, chemical structure, and other information from multiple publicly available data sources. RaMP's associated R package allows users to query information on metabolites of interest as well as perform pathway enrichment analysis using the Fisher's exact test. MetaboSPAN is an advanced pathway enrichment analysis strategy that infers activity in undetected portions of the metabolome using the vast extent of knowledge in RaMP to expand pathway-level findings and improve reproducibility between experiments. I demonstrate the utility of these tools on a metabolite data set generated in patient-derived cell lines of dedifferentiated liposarcoma with varying amplification of the MDM2 oncogene.

    Committee: Ewy Mathe PhD (Advisor); Kevin Coombes PhD (Advisor); Lang Li PhD (Committee Member); Rachel Kopec PhD (Committee Member) Subjects: Bioinformatics; Biomedical Research
  • 10. Unnava, Vasundhara Query processing in distributed database systems /

    Doctor of Philosophy, The Ohio State University, 1992, Graduate School

    Committee: Not Provided (Other) Subjects: Business Administration
  • 11. Matacic, Tyler A Novel Index Method for Write Optimization on Out-of-Core Column-Store Databases

    Master of Computing and Information Systems, Youngstown State University, 2016, Department of Computer Science and Information Systems

    The purpose of this thesis is to extend previous research on write optimization in out-of-core column storage databases. A new type of storage model titled Timestamped Binary Association Table (TBAT) will be explored, a new update entitled Asynchronous Out-of-Core Update (AOC Update) designed to leverage the TBAT will be explained, and a new type of B-Tree titled Offset B+ Tree (OB-tree) will be examined. The performance of the OB-tree and TBAT when utilized for selection tasks will be demonstrated through experiments comparing TBAT selection with an OB-tree index, TBAT selection without an index, and the traditional method of binary selection on a Binary Association Table (BAT). The selection speed of these three methods will be recorded and conclusions will be drawn.

    Committee: Feng Yu PhD (Committee Chair); Alina Lazar PhD (Committee Member); Yong Zhang PhD (Committee Member) Subjects: Computer Science
  • 12. Wang, Kaibo Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments

    Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering

    Massively data-parallel processors, Graphics Processing Units (GPUs) in particular, have recently entered the main stream of general-purpose computing as powerful hardware accelerators to a large scope of applications including databases, medical informatics, and big data analytics. However, despite their performance benefit and cost effectiveness, the utilization of GPUs in production systems still remains limited. A major reason behind this situation is the slow development of supportive GPU software ecosystem. More specially, (1) CPU-optimized algorithms for some critical computation problems have irregular memory access patterns with intensive control flows, which cannot be easily ported to GPUs to take full advantage of its fine-grained, massively data-parallel architecture; (2) commodity computing environments are inherently concurrent and require coordinated resource sharing to maximize throughput, while existing systems are still mainly designed for dedicated usage of GPU resources. In this Ph.D. dissertation, we develop efficient software solutions to support the adoption of massively data-parallel processors in general-purpose commodity computing systems. Our research mainly focuses on the following areas. First, to make a strong case for GPUs as indispensable accelerators, we apply GPUs to significantly improve the performance of spatial data cross-comparison in digital pathology analysis. Instead of trying to port existing CPU-based algorithms to GPUs, we design a new algorithm and fully optimize it to utilize GPU's hardware architecture for high performance. Second, we propose operating system support for automatic device memory management to improve the usability and performance of GPUs in shared general-purpose computing environments. Several effective optimization techniques are employed to ensure the efficient usage of GPU device memory space and to achieve high throughput. Finally, we develop resource management facilities in GPU database system (open full item for complete abstract)

    Committee: Xiaodong Zhang (Advisor); P. Sadayappan (Committee Member); Christopher Stewart (Committee Member); Harald Vaessin (Committee Member) Subjects: Computer Engineering; Computer Science
  • 13. Zheng, Mai Towards Manifesting Reliability Issues In Modern Computer Systems

    Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering

    Computer Systems are evolving all the time. Particularly, the two most fundamental components, i.e., the compute unit and the storage unit, have witnessed dramatic changes in recent years. For example, on the compute side, the graphics processing units (GPUs) have emerged as an extremely cost-effective means for achieving high performance computing. Similarly, on the storage side, flash-based solid-state drives (SSDs) are revolutionizing the whole IT industry. While these new technologies have improved the performance of computer systems to a new level, they also bring new challenges to the reliability of the systems. As a new computing platform, GPUs enforce a novel multi-threaded programming model. Like any multi-threaded environment, data races on GPUs can severely affect the correctness of the applications and may lead to data loss or corruption. Similarly, as a new storage medium, SSDs also bring potential reliability challenges to the already complicated storage stack. Among other things, the behavior of SSDs during power faults — which happen even in the leading data centers — is an important yet mostly ignored issue in this dependability-critical area. Besides SSDs, another important layer in modern storage stack is databases. The atomicity, consistency, isolation, and durability (ACID) properties modern databases provide make it easy for application developers to create highly reliable applications. However, the ACID properties are far from trivial to provide, particularly when high performance must be achieved. This leads to complex and error-prone code—even at a low defect rate of one bug per thousand lines, the millions of lines of code in a commercial OLTP database can harbor thousands of bugs. As the first step towards building robust modern computer systems, this dissertation proposes novel approaches to detect and manifest the reliability issues in three different layers of computer systems. First, in the application layer, this dissertation (open full item for complete abstract)

    Committee: Feng Qin (Advisor); Gagan Agrawal (Committee Member); Xiaodong Zhang (Committee Member) Subjects: Computer Science
  • 14. Sridharan, Srilakshmi Data Mining-based Fragmentation for Query Optimization

    MS, University of Cincinnati, 2014, Engineering and Applied Science: Computer Science

    A main purpose of a database is to provide requested data efficiently. Query performance can be improved in many ways. One of the efficient ways to handle multiple queries posted simultaneously to the database is to distribute the database across several sites and instead of querying the entire database, only the site that contains the data related to the query is accessed. Distribution of a database involves fragmentation of the data and allocating the fragmented data across various sites. Several research works address the issue of fragmentation of databases based on workload, since the aim of fragmentation is to optimize query response time [MD08]. In particular, clustering the data according to query predicates or attributes is shown to perform well for fragmentation. Mahboubi and Darmont propose the use of a k-means based fragmentation approach [MD08]. The authors do not consider the similarity of query predicates in the workload before performing the k-means clustering in their approach. We cluster similar selection predicates involved in the workload as a pre-processing step for the fragmentation; we expect to further improve the query performance. We investigate clustering techniques and study the resulting performance for a selected case study. We conclude that in general for our workloads and for our experimental parameters, the final clusters obtained using our predicate preprocessing system are tighter and more meaningful. As the number of similar values in the workload decreases, the relative savings of the predicate preprocessing system is reduced. If there are no similar values in the workload, the original fragmentation system is more efficient.

    Committee: Karen Davis Ph.D. (Committee Chair); Raj Bhatnagar Ph.D. (Committee Member); Carla Purdy Ph.D. (Committee Member) Subjects: Computer Science
  • 15. Kucuktunc, Onur Result Diversification on Spatial, Multidimensional, Opinion, and Bibliographic Data

    Doctor of Philosophy, The Ohio State University, 2013, Computer Science and Engineering

    Similarity search methods in the literature produce results based on the ranked degree of similarity to the query. However, the results are typically unsatisfactory, especially if there is an ambiguity in the query, or the search space include redundantly repeating similar documents. Diversity in query results is preferred by a variety of applications since diverse results may give a complete view of the queried topic. In this study, we investigate the result diversification task in various application areas, such as opinion retrieval, paper recommendation, with different types of data, such as spatial, high-dimensional data, opinions, citation graph, and other networks. Although the definitions of diversity will differ from field to field, we propose techniques considering the general objective of result diversification, which is to maximize the similarity of search results to the query while minimizing the pairwise similarity between the results, without neglecting the efficiency. For the diversity on spatial and high-dimensional data, we make an analogy with the concept of natural neighbors and propose geometric methods. We also introduce a diverse browsing method based on the popular distance browsing feature of R-tree index structures. Next, we focus on search and retrieval of opinion data on certain entities, and start our analysis by looking at direct correlations between sentiments of opinions and the demographics (e.g., gender, age, education level, etc.) of people that generate those opinions. Based on the analysis, we argue that opinion diversity can be achieved by diversifying the sources of opinions. Recommendation tasks on academic networks also suffer from the mentioned ambiguity and redundancy issues. To observe those effects, we present a paper recommendation framework called theadvisor (http://theadvisor.osu.edu) which recommends new papers to researchers using only the reference-citation relationships between academic papers. We introduce (open full item for complete abstract)

    Committee: Umit V. Catalyurek (Advisor); Srinivasan Parthasarathy (Committee Member); Arnab Nandi (Committee Member) Subjects: Computer Science
  • 16. Gilder, Jason Computational methods for the objective review of forensic DNA testing results

    Doctor of Philosophy (PhD), Wright State University, 2007, Computer Science and Engineering PhD

    Since the advent of criminal investigations, investigators have sought a "gold standard" for the evaluation of forensic evidence. Currently, deoxyribonucleic acid (DNA) technology is the most reliable method of identification. Short Tandem Repeat (STR) DNA genotyping has the potential for impressive match statistics, but the methodology not infallible. The condition of an evidentiary sample and potential issues with the handling and testing of a sample can lead to significant issues with the interpretation of DNA testing results. Forensic DNA interpretation standards are determined by laboratory validation studies that often involve small sample sizes. This dissertation presents novel methodologies to address several open problems in forensic DNA analysis and demonstrates the improvement of the reported statistics over existent methodologies. Establishing a dynamically calculated RFU threshold specific to each analysis run improves the identification of signal from noise in DNA test data. Objectively identifying data consistent with degraded DNA sample input allows for a better understanding of the nature of an evidentiary sample and affects the potential for identifying allelic dropout (missing data). The interpretation of mixtures of two or more individuals has been problematic and new mathematical frameworks are presented to assist in that interpretation. Assessing the weight of a DNA database match (a cold hit) relies on statistics that assume that all individuals in a database are unrelated – this dissertation explores the statistical consequences of related individuals being present in the database. Finally, this dissertation presents a statistical basis for determining if a DNA database search resulting in a very similar but nonetheless non-matching DNA profile indicates that a close relative of the source of the DNA in the database is likely to be the source of an evidentiary sample.

    Committee: Travis Doom (Advisor) Subjects:
  • 17. MALLADI, RAJESWARI APPLYING MULTIPLE QUERY OPTIMIZATION IN MOBILE DATABASES

    MS, University of Cincinnati, 2001, Engineering : Computer Science

    Mobile computing is a fast growing research and commercial area. An important application of mobile networks is data dissemination over limited bandwidth channels. There are different modes of data dissemination: push-based, pull-based, or a combination of both. In push-based, the data is broadcast in the form of broadcast disks. In pull-based, a mobile unit sends an uplink query to a central server, the server processes the data and sends the answer on a downlink channel. If the number of uplink queries is large, a lot of channel bandwidth is expended in sending the answers on the downlink channels. In this study, we apply multiquery optimization to batches of pull requests in mobile databases. Materialized views are created that can be used to answer several queries at once. The materialized views are then broadcast on a push-pull channel dedicated for this purpose (answers to multiple pull queries). Each mobile unit receives a short message from the server that contains information about when and for how long to tune to the channel to retrieve the requested information. We compare multiple query processing for pull requests (MQPR) with a basic pull request method (PR) in which each query is handled separately. Appropriate algorithms and formulae are given to calculate the bandwidth usage and the wait time for the mobiles sending the requests. A performance study is conducted by simulating different query loads over a testbed schema. The studies indicate a significant savings in the channel bandwidth usage and also a significant reduction in the wait time in MQPR compared to PR.

    Committee: Dr. Karen Davis (Advisor) Subjects: Computer Science
  • 18. Sinha, Aditya Formal Concept Analysis for Search and Traversal in Multiple Databases with Effective Revision

    MS, University of Cincinnati, 2009, Engineering : Computer Science

    With an explosion in datasizes, new search technologies are required to aid the user inbetter understanding the data. We, in this thesis, present F.A.S.T.E.R - Formal concept Analysis for Search and Traversal in multiple databases with Effective Revision, an appli- cation employing techniques of Formal Concept Analysis to better assimilate information from databases and further co-relate it over multiple databases. The application presents the information from the databases as concepts, generating only those parts of the entire lattice which are of interest to the user. We apply F.A.S.T.E.R to two sets of real life data and demonstrate how it can be helpful to understand data. We, in this thesis also present a mathematical model for the application to get a fair idea about how F.A.S.T.E.R responds to different types of databases.

    Committee: Raj Bhatnagar (Committee Chair); John Schlipf (Committee Member); Yizong Cheng (Committee Member) Subjects: Computer Science
  • 19. SHINDE, KAUSTUBH FUNCTION COMPUTING IN VERTICALLY PARTITIONED DISTRIBUTED DATABASES

    MS, University of Cincinnati, 2006, Engineering : Computer Science

    Advances in database and storage technology and need to manage constant flow of information have necessitated use of databases for every organization. These databases are independently owned and operated by respective organizations and privacy of data prevents complete data sharing between entities. Combined data from multiple sources potentially can contribute to mutually beneficial computations. This thesis perceives independent databases as single logical database partitioned and distributed over a number of locations and aims to compute complex mathematical functions over vertically distributed databases while preserving privacy of data. A Global function to be computed over single logical database, as we perceive it, is broken into a set of Local functions that operate on individual data sites. We developed an iterative algorithm and it's variation that performs global computation using summaries of local computations. The working of the algorithms was shown through software simulations and effectiveness of the suggested approach was demonstrated.

    Committee: Dr. Raj Bhatnagar (Advisor) Subjects: Computer Science
  • 20. JHAVER, RISHI DISCOVERY OF LINEAR TRAJECTORIES IN GEOGRAPHICALLY DISTRIBUTED DATASETS

    MS, University of Cincinnati, 2003, Engineering : Computer Science

    We work with temporal data stored in distributed databases that are spread over a region. We have considered a sensor network where a lot of sensor nodes are spread in a grid like manner. These sensor nodes are capable of storing data and thus act as a separate dataset. The entire network of these sensors act as a set of distributed datasets. An algorithm is introduced that mines global temporal patterns from these datasets and results in the discovery of linear trajectories of moving objects under supervision. Each of these datasets has its local temporal dataset along with spatial data and the geographical coordinates of a given object or target. The main objective here is to perform in-network aggregation between the data contained in the various datasets to discover global spatio-temporal patterns; the main constraint is that there should be minimal communication among the participating nodes. We present the algorithm and analyze it in terms of the communication costs. The cost of our algorithm is much smaller than that of the alternative in which the data must be transferred to a single site and then mined. In addition to this, we vary the requirements of our algorithm slightly and present a variant of it that enhances its performance in terms of the overall complexity of computations. We go on to show that the while the efficiency of the algorithm increases in terms of the number of messages exchanged between nodes, the amount of information available to all the nodes in the system decrease. The advantages and drawbacks of this variant of our algorithm is also presented.

    Committee: Dr. Raj Bhatnagar (Advisor) Subjects: Computer Science