Search Results (1 - 3 of 3 Results)

Sort By  
Sort Dir
 
Results per page  

Abu Doleh, AnasHigh Performance and Scalable Matching and Assembly of Biological Sequences
Doctor of Philosophy, The Ohio State University, 2016, Electrical and Computer Engineering
Next Generation Sequencing (NGS), the massive parallel and low-cost sequencing technology, is able to generate an enormous size of sequencing data. This facilitates the discovery of new genomic sequences and expands the biological and medical research. However, these big advancements in this technology also bring big computational challenges. In almost all NGS analysis pipelines, the most crucial and computationally intensive tasks are sequence similarity searching and de novo genome assembly. Thus, in this work, we introduced novel and efficient techniques to utilize the advancements in the High Performance Computing hardware and data computing platforms in order to accelerate these tasks while producing high quality results. For the sequence similarity search, we have studied utilizing the massively multithreaded architectures, such as Graphical Processing Unit (GPU), in accelerating and solving two important problems: reads mapping and maximal exact matching. Firstly, we introduced a new mapping tool, Masher, which processes long~(and short) reads efficiently and accurately. Masher employs a novel indexing technique that produces an index for huge genome, such as the human genome, with a small memory footprint such that it could be stored and efficiently accessed in a restricted-memory device such as a GPU. The results show that Masher is faster than state-of-the-art tools and obtains a good accuracy and sensitivity on sequencing data with various characteristics. Secondly, maximal exact matching problem has been studied because of its importance in detection and evaluating the similarity between sequences. We introduced a novel tool, GPUMEM, which efficiently utilizes GPU in building a lightweight indexing and finding maximal exact matches inside two genome sequences. The index construction is so fast that even by including its time, GPUMEM is faster in practice than state-of-the-art tools that use a pre-built index. De novo genome assembly is a crucial step in NGS analysis because of the novelty of discovered sequences. Firstly, we have studied parallelizing the de Bruijn graph based de novo genome assembly on distributed memory systems using Spark framework and GraphX API. We proposed a new tool, Spaler, which assembles short reads efficiently and accurately. Spaler starts with the de Bruijn graph construction. Then, it applies an iterative graph reduction and simplification techniques to generate contigs. After that, Spaler uses the reads mapping information to produce scaffolds. Spaler employs smart parallelism level tuning technique to improve the performance in each of these steps independently. The experiments show promising results in term of scalability, execution time and quality. Secondly, we addressed the problem of de novo metagenomics assembly. Spaler may not properly assemble the sequenced data extracted from environmental samples. This is because of the complexity and diversity of the living microbial communities. Thus, we introduced meta-Spaler, an extension of Spaler, to handle metagenomics dataset. meta-Spaler partitions the reads based on their expected coverage and applies an iterative assembly. The results show an improving in the assembly quality of meta-Spaler in comparison to the assembly of Spaler.

Committee:

Umit Catalyurek (Advisor); Kun Huang (Committee Member); Fusun Ozguner (Committee Member)

Subjects:

Bioinformatics; Computer Engineering

Keywords:

bioinformatics;sequence similarity;indexing;graphical processing unit;Apache Spark;de Bruijn graph;de novo assembly;metagenomics

Mutharaju, RaghavaDistributed Rule-Based Ontology Reasoning
Doctor of Philosophy (PhD), Wright State University, 2016, Computer Science and Engineering PhD
The vision of the Semantic Web is to provide structure and meaning to the data on the Web. Knowledge representation and reasoning play a crucial role in accomplishing this vision. OWL (Web Ontology Language), a W3C standard, is used for representing knowledge. Reasoning over the ontologies is used to derive logical consequences. A fixed set of rules are run on an ontology iteratively until no new logical consequences can be derived. All existing reasoners run on a single machine, possibly using multiple cores. Ontologies (sometimes loosely referred to as knowledge bases) that are automatically constructed can be very large. Single machine reasoners will not be able to handle these large ontologies. They are constrained by the memory and computing resources available on a single machine. In this dissertation, we use distributed computing to find scalable approaches to ontology reasoning. In particular, we explore four approaches that use a cluster of machines for ontology reasoning -- 1) A MapReduce approach named MR-EL where reasoning happens in the form of a series of map and reduce jobs. Termination is achieved by eliminating the duplicate consequences. The MapReduce approach is simple, fault tolerant and less error-prone due to the usage of a framework that handles aspects such as communication, synchronization etc. But it is very slow and does not scale well with large ontologies. 2) Our second approach named DQuEL is a distributed version of a sequential reasoning algorithm used in the CEL reasoner. Each node in the cluster applies all of the rules and generates partial results. The reasoning process terminates when each node in the cluster has no more work to do. DQuEL works well on small and medium sized ontologies but does not perform well on large ontologies. 3) The third approach, named DistEL, is a distributed fixpoint iteration approach where each node in the cluster applies only one rule to a subset of the ontology. This happens iteratively until all of the nodes cannot generate any more new logical consequences. This is the most scalable of all of the approaches. 4) Our fourth approach, named SparkEL, is based on the Apache Spark framework where each reasoning rule is translated into a form that is suitable for Apache Spark. Several algorithmic and framework related optimizations were considered. SparkEL works very well on small and medium sized ontologies, but it does not scale to large ontologies. All four distributed reasoning systems work on a subset of OWL 2 EL which is a tractable profile of OWL with a polynomial reasoning time. Along with the description of the algorithms, optimizations and evaluation results of the four distributed reasoners, we also provide recommendations for the best choice of reasoners for different scenarios.

Committee:

Pascal Hitzler, Ph.D. (Advisor); Prabhaker Mateti, Ph.D. (Committee Member); Derek Doran, Ph.D. (Committee Member); Freddy Lecue, Ph.D. (Committee Member); Frederick Maier, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

distributed reasoning;ontology reasoning;distributed OWL 2 EL reasoning;Semantic Web;scalable ontology reasoning; OWL 2 EL reasoning;MapReduce reasoning;Apache Spark reasoning

Madeti, PreethamUsing Apache Spark's MLlib to Predict Closed Questions on Stack Overflow
Master of Computing and Information Systems, Youngstown State University, 2016, Department of Computer Science and Information Systems
Monitoring posts quality on the Stack Overflow website is of critical importance to make the experience smooth for its users. It strongly disapproves unproductive discussion and un-related questions being posted. Questions can get closed for several reasons ranging from questions that are un-related to programming, to questions that do not lead to a productive answer. Manual moderation of the site's content is a tedious task as approximately seventeen thousand new questions are posted every day. Therefore, leveraging machine learning algorithms to identify the bad questions would be a very smart and time-saving method for the community. The goal of this thesis is to build a machine learning classifier that could predict if a question will be closed or not, given the various textual and post related features. A training model was created using Apache Spark's Machine Learning Libraries. This model could not only predict the closed questions with good accuracy, but computes the result in a very small time-frame.

Committee:

Alina Lazar, PhD (Advisor); Bonita Sharif, PhD (Committee Member); Yong Zhang, PhD (Committee Member)

Subjects:

Computer Science; Information Systems

Keywords:

Machine learning; Feature Extraction; Apache Spark; Stack Overflow; Textual Analysis