Search Results (1 - 25 of 28 Results)

Sort By  
Sort Dir
 
Results per page  

Sanghvi, Niraj D.Parallel Computation of the Meddis MATLAB Auditory Periphery Model
Master of Science, The Ohio State University, 2012, Electrical and Computer Engineering

The Meddis MATLAB Auditory Periphery (MAP) model is a computational model of the mammalian auditory system made using MATLAB. The model simulates the physiological processes taking place at different stages within the auditory system and provides an insight on how a sound wave incident on the ear is modified as it passes through the auditory system. The stages of the auditory system included in the model are the middle ear, the inner ear and parts of the brain stem.

Calculating the response of each stage is a computationally intensive and time consuming task. It takes more than 7 minutes to just calculate the response of all auditory nerves, each producing an action potential about 300 times a second for a sound signal of 1 second duration when 1000 best frequencies are considered. This is a major disadvantage especially when the model has to be run multiple times in speech and hearing experiments.

The thesis describes how the runtime of the MAP model can be reduced by modifying the code, enabling it to run in parallel on multiple processing cores using tools provided by MATLAB. The thesis also describes how GPUs can be utilized to further reduce runtimes. The thesis concludes with an application of the MAP model in detecting differences between FM signals.

Committee:

Ashok Krishnamurthy, PhD (Advisor); Yuan Zheng, PhD (Committee Member); Lawrence Feth, PhD (Committee Member)

Keywords:

Parallel Computing; Auditory Periphery; High Performance Computing; GPU; CUDA; Meddis Auditory Periphery; MATLAB; Parallel Computing Toolbox;

wang, zhiqiangSTUDYING COMPUTATIONAL METHODS FOR BIOMEDICAL GEOMETRY EXTRACTION AND PATIENT SPECIFIC HEMODYNAMICS
PHD, Kent State University, 2017, College of Arts and Sciences / Department of Computer Science
With the development of medical imaging and numerical simulation techniques, image based Patient Specific Computational Hemodynamics (PSCH) has become a powerful approach to non-invasively quantify vascular fluid mechanics in human arteries. However, most existing PSCH methods are currently impractical for real clinic applications due to time consuming computation. In clinic applications, an image based efficient PSCH approach is needed to offer Patient-Specific diagnosis information in a timely manner and also to make large population studies possible. Without these studies, the correlation between Patient-Specific clinical symptom and hemodynamic patterns cannot be assessed. In the image based PSCH processing procedure, there are three main parts responsible for the bulk of the computation. The first one is obtaining the anatomic geometry. The second is converting the geometry into a suitable mesh or grid for simulation. The third hotspot is the large numbers of calculations in hemodynamics simulation. To address above problems, in this thesis, some efficient algorithms have been proposed to speed up image based PSCH computations. First, a fully parallel numerical method is proposed to quickly extract anatomical geometry from medical images. This method can efficiently solve the level set equations of active contour based on Lattice Boltzmann model (LBM) for image segmentation. And a parallel distance field regularization algorithm is integrated to the LBM computing scheme to keep computation stable. This approach avoids external regularization which has been a major impediment to direct parallelization of level set evolution with LBM. It allows the whole computing process to be efficiently executed on Graphics Processing Unit (GPU). Further, this method can be incorporated with different image features for various image segmentation tasks. Second, a new method is proposed by utilizing fluid features, particularly the mean flow intensity, to extract the blood flow field in a target vessel from 4D flow MRI images which encode blood velocity information in vessels. This approach is computational efficient and robust to blood flow changes in a cardiac cycle even in relatively small arteries. The extracted velocity field can be used as inlet and outlet conditions facilitating the hemodynamics simulation and for evaluating simulation result. Moreover, the target artery wall deformation factors at a few spatial locations of the artery and time steps in one cardiac cycle can be estimated. These factors enable the generation of high quality and deforming artery walls by morphing the wall acquired from accurate but static Time of Flight (TOF) MRI images. The dynamic artery wall benefits blood flow visualization tasks and also has potential to be used as moving boundary for hemodynamic simulations. Third, after image segmentation, the anatomical geometry is implicitly represented by zero level set of distance field. Based on segmentation result, a method is first proposed to directly generate simulation grids for Volumetric Lattice Boltzmann Method (VLBM) based hemodynamic simulation. There is no surface reconstruction and mesh generation needed. Therefore, it can avoid extra computational cost and inaccuracy during the two transforms: from volume to mesh, then from mesh to volume grids. Further, surface normal direction at artery wall can also be directly estimated to calculate Wall Shear Stress (WSS) . Finally, VLBM has been GPU accelerated to simulate hemodynamics in human arteries by using a uniform computing scheme for both fluid and boundary grids. For traditional computational fluid simulation method, its boundary conditions have to be separately performed over boundary nodes from fluid nodes, where large number of branching operations are inefficient for GPU parallel computation due to its Single Instruction Multiple Data (SIMD) architecture. For complicated biomechanics structure, this situation would be worse, as there are a lot of boundary cells in the computation domain. The proposed parallel VLBM implementation does not need to distinguish fluid and boundary cells in the computation so that branching is minimized and the GPU kernel execution is accelerated.

Committee:

Ye Zhao (Advisor); Jun Li (Committee Member); ChengChang Lu (Committee Co-Chair); Robert Clements (Committee Chair); Arden Rutten (Committee Member)

Subjects:

Biomedical Engineering; Computer Science

Keywords:

Image segmentation, Hemodynamic simulation, Geometry processing, Parallel computing

Huo, XinSupporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Parallel computing architectures have become ubiquitous nowadays. Particularly, Graphics Processing Unit (GPU), Intel Xeon Phi co-processor (many integrated core architecture), and heterogeneous coupled and decoupled CPU-GPU architectures have emerged as a very significant player in high performance computing due to their performance, cost, and energy efficiency. Based on this trend, an important research issue is how to efficiently utilize these varieties of architectures to accelerate different kinds of applications. Our overall goal is to provide parallel strategies and runtime support for modern GPUs, Intel Xeon Phi, and heterogeneous architectures to accelerate the applications with different communication patterns. The challenges rise from two aspects, application and architecture respectively. In the application aspect, not all kinds of applications can be easily ported to GPUs or Intel Xeon Phi with optimal high performance. From the architecture aspect, the SIMT (Single Instruction Multiple Thread) and SIMD (Single Instruction Multiple Data) execution models employed by GPUs and Intel Xeon Phi respectively do not favor applications with data and control dependencies. Moreover, heterogeneous CPU-GPU architectures, i.e., the integrated CPU-GPU, bring new models and challenges for data sharing. Our efforts focus on four kinds of application patterns: generalized reduction, irregular reduction, stencil computation, and recursive applications, and explore the mechanism of parallelizing them on various parallel architectures, including GPU, Intel Xeon Phi architecture, and heterogeneous CPU-GPU architectures. We first study the parallel strategies for generalized reductions on modern GPUs. The traditional mechanism to parallelize generalized reductions on GPUs is referred to as full replication method, which assigns each thread an independent data copy to avoid data racing and maximize parallelism. However, it introduces significant memory and result combination overhead, and is inapplicable for a large number of threads. In order to reduce memory overhead, we propose a locking method, which allows all threads per block to share the same data copy, and supports fine-grained and coarse-grained locking access. The locking method succeeds in reducing memory overhead, but introduces thread competition. To achieve a tradeoff between memory overhead and thread competition, we also present a hybrid scheme. Next, we investigate the strategies for irregular or unstructured applications. One of the key challenges is how to utilize the limited-sized shared memory on GPUs. We present a partitioning-based locking scheme, by choosing the appropriate partitioning space which can guarantee all the data can be put into the shared memory. Moreover, we propose an efficient partitioning method and data reordering mechanism. Further, to better utilize the computing capability on both host and accelerator, we extend our irregular parallel scheme to the heterogeneous CPU-GPU environment by introducing a multi-level partitioning scheme and a dynamic task scheduling framework, which supports pipelining between partitioning and computation, and work stealing between CPU and GPU. Then, motivated by the emerging of the integrated CPU-GPU architecture and its new shared physical memory, we develop the thread block level scheduling framework for three communication patterns, generalized reduction, irregular reduction, and stencil computation. This novel scheduling framework introduces locking free access between CPU and GPU, reduces command launching and synchronization overhead by removing extra synchronizations, and improves load balance. Although the support of unstructured control follow has been included in GPUs, the performance of recursive applications on modern GPUs is limited due to the current thread reconvergence method in the SIMT execution model. To efficiently port recursive applications to GPUs, we present our novel dynamic thread reconvergence method, which allows threads to reconverge either before or after the static reconvergence point, and improves instructions per cycle (IPC) significantly. Intel Xeon Phi architecture is an emerging x86 based many-core coprocessor architecture with wide SIMD vectors. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. We consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reorganization and our methods to effectively manage control flow. In addition, all proposed methods have been extensively evaluated with multiple applications and different data inputs.

Committee:

Gagan Agrawal (Advisor); P. Sadayappan (Committee Member); Feng Qin (Committee Member)

Subjects:

Computer Science

Keywords:

Parallel Computing, GPU, SIMD, Irregular Application, Recursive Control Flow

LAWRENCE, WILLIAM ERICPARALLEL SOLUTION OF THE TOPOLOGY OPTIMIZATION PROBLEM FOR ELASTIC CONTINUA
MS, University of Cincinnati, 2003, Engineering : Mechanical Engineering
Topology optimization problems require the repeated solution of finite element problems that are often extremely ill-conditioned. This ill-conditioning is due to the highly heterogenous material properties that result as the topology optimization problem converges. It can make the use of iterative solvers inefficient if proper reconditioning is not applied to the stiffness matrix or if steps are not taken to reduce the solve time of these systems. This problem is addressed by considering the solution of the topology optimization problem for elastic continua using parallel computing and domain decomposition techniques. A parallel algorithm for the SIMP (Solid Isotropic Material with Penalization) formulation of the topology optimization problem is presented. This includes a study of the efficiency of the parallel linear solvers used to solve the equilibrium problem at each step of the iterative optimization problem. The solvers that are studied include two preconditioned conjugate gradient (PCG) methods: one using a diagonal preconditioner and one using an incomplete LU factorization preconditioner with a drop tolerance. A third condensation solver that employs a hybrid of direct and iterative (PCG) techniques is also studied. This condensation solver is shown to be the most effective of the three solvers studied both in terms of parallel e±ciency and in terms of its ability to mitigate the effects of the ill-conditioned systems. In addition to examining the parallel linear solvers, the heuristic design variable update scheme, commonly used in topology optimization problems, is also implemented and parallelized. The parallelization of this update scheme is found to be quite simple and straightforward. More difficult is the parallelization of the mesh-independency filter which is also addressed here. A modified filtering approach is presented that limits the number of "ghost" elements needed to be exchanged across parallel sub-domains.

Committee:

Dr. Kumar Vemaganti (Advisor)

Subjects:

Engineering, Mechanical

Keywords:

topology optimization; parallel computing

Bokhari, Saniyah S.Parallel Solution of the Subset-sum Problem: An Empirical Study
Master of Science, The Ohio State University, 2011, Computer Science and Engineering

We investigate the parallelization of an algorithm on three very different architectures. These are: a 128-processor Cray XMT massively multithreaded machine, a 16-processor IBM x3755 shared memory machine and a 240-core NVIDIA FX 5800 graphics processor unit (GPU). The problem we use in our investigation is the well-known subset-sum problem. While this is known to be NP-complete, it is solvable in pseudo-polynomial time, i.e., time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. The hypothesis that we wish to test is that the Cray, with its specialized hardware and large uniform shared memory, is suitable for very large problems, the IBM x3755 is suitable for intermediate sized problems and the NVIDIA FX 5800 can give superior performance only for problems that fit within its modest internal memory.

We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines we present an alternating word algorithm that can implement an efficient solution. The timings of our respective codes were carefully measured over a comprehensive range of problem sizes. On the Cray XMT we observe very good scaling for large problems and see sustained performance as the problem size increases. However this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The IBM x3755 performs very well on medium sized problems, but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The NVIDIA GPU performs well for problems whose tables t within its 4GB device memory. This corresponds to tables of size approximately 1010/.

The experimental measurements support our hypothesis and show that each machine gives superior performance for distinct ranges of problem sizes and demonstrate the strengths and weaknesses of the three architectures.

Committee:

Ten H. Lai, PhD (Advisor); Dong Xuan, PhD (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

Cray XMT; Dynamic Programming; IBM x3755; Multicore; Multithreading; NVIDIA FX 5800; OMP; Opteron; Parallel Algorithms; Parallel Computing; Shared Memory; Subset-sum problem

Bas, Erdeniz OzgunLoad-Balancing Spatially Located Computations using Rectangular Partitions
Master of Science, The Ohio State University, 2011, Computer Science and Engineering
Distributing spatially located heterogeneous workloads is an important problem in parallel scientific computing. Particle-in-cell simulators, ray tracing and partial differential equations are some of the applications with spatially located workload. We investigate the problem of partitioning such workloads (represented as a matrix of non-negative integers) into rectangles, such that the load of the most loaded rectangle (processor) is minimized. Since finding the optimal arbitrary rectangle-based partition is an NP-hard problem, we investigate particular classes of solutions: rectilinear, jagged and hierarchical. We present a new class of solutions called m-way jagged partitions, propose new optimal algorithms for m-way jagged partitions and hierarchical partitions, propose new heuristic algorithms, and provide worst case performance analyses for some existing and new heuristics. Balancing the load does not guarantee to minimize the total runtime of an application. In order to achieve that, one must also take into account the communication cost. Rectangle shaped partitioning inherently keeps communications small, yet one should proactively minimize them. The algorithms we propose are tested in simulation on a wide set of instances and compared to state of the art algorithm. Results show that m-way jagged partitions are low in total communication cost and practical to use.

Committee:

Umit V. Catalyurek, PhD (Advisor); Radu Teodorescu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

matrix partitioning; load balancing; rectangular partitioning; scientific computing; high performance computing; parallel computing

Lu, KeweiDistribution-based Exploration and Visualization of Large-scale Vector and Multivariate Fields
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Due to the ever increasing of computing power in the last few decades, the size of scientific data produced by various scientific simulations has been growing rapidly. As a result, effective techniques to visualize and explore those large-scale scientific data are becoming more and more important in understanding the data. However, for data at such a large scale, effective analysis and visualization is a non-trivial task due to several reasons. First, it is often time consuming and memory intensive to perform visualization and analysis directly on the original data. Second, as the data become large and complex, visualization usually suffers from visual cluttering and occlusion, which makes it difficult for users to understand the data. In order to address the aforementioned challenges, in this dissertation, a distribution-based query-driven framework to visualize and analyze large-scale scientific data is proposed. We propose to use statistical distributions to summarize large-scale data sets. The summarized data is then used to substitute the original data to support efficient and interactive query-driven visualization which is often free of occlusion. In this dissertation, the proposed framework is applied to flow fields and multivariate scalar fields. We first demonstrate the application of the proposed framework to flow fields. For a flow field, the statistical data summarization is computed from geometries such as streamlines and stream surfaces computed from the flow field. Stream surfaces and streamlines are two popular methods for visualizing flow fields. When the data size is large, distributed memory parallelism usually is needed. In this dissertation, a new scalable algorithm is proposed to compute stream surfaces from large-scale flow fields efficiently on distributed memory machines. After we obtain a large number of computed streamlines or stream surfaces, a direct visualization of all the densely computed geometries is seldom useful due to visual cluttering and occlusion. To solve the visual cluttering problem, a distribution-based query-driven framework to explore those densely computed streamlines is presented. Then, the proposed framework is applied to multivariate scalar fields. When dealing with multivariate data, in order to understand the data, it is often useful to show the regions of interest based on user specified criteria. In the presence of large-scale multivariate data, efficient techniques to summarize the data and answer users’ queries are needed. In this dissertation, we first propose to use multivariate histograms to summarize the data and demonstrate how effective query-driven visualization can be achieved based on those multivariate histograms. However, storing multivariate histograms in the form of multi-dimensional arrays is very expensive. To enable efficient visualization and exploration of multivariate data sets, we present a compact structure to store multivariate histograms to reduce their huge space cost while supporting different kinds of histogram query operations efficiently. We also present an interactive system to assist users to effectively design multivariate transfer functions. Multiple regions of interest could be highlighted through multivariate volume rendering based on the user specified multivariate transfer function.

Committee:

Han-Wei Shen (Advisor); Yusu Wang (Committee Member); Ponnuswamy Sadayappan (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

Large Data Analysis; Visualization; Query Driven Visualization; Parallel Computing

Shaker, Alfred MCOMPARISON OF THE PERFORMANCE OF NVIDIA ACCELERATORS WITH SIMD AND ASSOCIATIVE PROCESSORS ON REAL-TIME APPLICATIONS
MS, Kent State University, College of Arts and Sciences / Department of Computer Science
Basic tasks for Air Traffic Control will be implemented using NVIDIA’s CUDA language on a NVIDIA device and compared to the performance of an Associative SIMD processor doing the same tasks. To do this, we create a simulation of an airfield with constantly moving aircrafts. The tasks that will be used in the evaluation are: tracking and correlation, collision detection, and collision resolution. These are the most compute intensive of the Air Traffic Control tasks, so they will give us a good measure of the capabilities of the NVIDIA device. The first task is tracking and correlation of the aircrafts in a 256 nautical mile by 256 nautical mile bounding area on a 2D plane with varying altitudes. This task is executed once each half second during each 8 second major cycle period and uses radar to correlate the exact location of the aircraft and its flight records. During every 8 second cycle, Batcher’s algorithm is used to check if any aircraft’s projected path has a possibility for collision. If a potential collision is possible within the next 20 minutes, we first locate a collision free path for one of them and then have it switch to this path. In previous research, the ability of a multicore system to perform basic ATC tasks was investigated. The graph showing its performance increased rapidly as the number of aircraft increased, which is consistent with the general belief that all large real-time systems require exponential time. In contrast, in our earlier research, an associative SIMD system was shown to be able to execute these basic tasks in linear time with a graph that had a very small slope. Additionally, the multicore regularly missed a large number of deadlines while the SIMD system did not miss a single deadline. Our goal here was to determine whether we could get SIMD-like results using a CUDA implementation of the same real-time system involving basic ATC tasks on a NVIDIA accelerator. Our research shows that our NVIDIA accelerators can provide a SIMD-like implementation of this real-time system. Moreover, using curve-fitting with MATLAB, the graph showing the NVIDIA accelerators performance increases only slightly faster than a linear graph.

Committee:

Johnnie Baker, Dr. (Advisor); Gokarna Sharma, Dr. (Committee Member); Ye Zhao, Dr. (Committee Member)

Subjects:

Computer Science

Keywords:

SIMD, CUDA, NVIDIA, C, ATC, AP, Air Traffic Control, NVIDIA-CUDA, Associative Processor, Parallel Computing, Parallel Programming, Real-Time Systems

Chen, ChongAcceleration of Computer Based Simulation, Image Processing, and Data Analysis Using Computer Clusters with Heterogeneous Accelerators
Doctor of Philosophy (Ph.D.), University of Dayton, 2016, Electrical Engineering
With the limits to frequency scaling in microprocessors due to power constraints, many-core and multi-core architectures have become the norm over the past decade. The goal of this work is the acceleration of key computer simulation tools, data processing, and data analysis algorithms in multi-core and many-core computer clusters and the analysis of their accelerated performances. The main contributions of this dissertation are: 1. Acceleration of vector bilateral filtering for hyperspectral imaging with GPGPU: a GPGPU based acceleration for vector bilateral filtering called vBF_GPU was implemented in this dissertation. vBF_GPU use multiple threads to processing one pixel of a hyperspectral image to improve the efficiency of the cache memory. The memory access operation of vBF_GPU was fully optimized to reduce the data transfer cost of the GPGPU program. The experiment results indicate that vBF_GPU can provide up to 19× speedup when compared with a multi-core CPU implementation and up to 3× speedup when compared with a naive GPGPU implementation of vector bilateral filtering. vBF_GPU can process hyperspectral imaging with up to 266 spectrums, and the window size of the bilateral filter is unlimited. 2. Optimization of acceleration of alternative least square algorithm using GPGPU cluster: this study presented an optimized implementation for Alternative Least Square Algorithm (ALS) to realize large-scale matrix factorization based recommendation system. In this study, a GPGPU optimized implementation is developed to conduct the batch solver in ALS algorithm. An equivalent mathematical form of equations was used to simplify the computation complexity of ALS algorithm. A distributed version of this implementation was also developed and tested using a cluster of GPGPUs The experiment results in this study indicates that our application running at a GPGPU can achieve up to 3.8× speedup when compared with an 8-core CPU. And the distributed implementation made excellent scalability at a computer cluster with multiple GPGPU accelerators. 3. Accelerating a preconditioned iterative solver for a very large sparse linear system on clusters with heterogeneous accelerators: this study presents a parallelized preconditioned conjugate gradient solver for large sparse linear systems on clusters with heterogeneous accelerators. The primary accelerator we examined in this study is Intel Xeon Phi accelerator with Many Integrated Core (MIC) architecture. We also realized a highly optimized parallel solver on clusters with the NVIDIA GPGPU accelerators, and clusters of Intel Xeon CPU with the Sandy Bridge architecture. Several approaches are applied to reduce the communication cost between different compute nodes in the cluster. A lightweight balancer was developed for the Xeon Phi based solver. Our results show that the Xeon Phi based iterative solver is faster than the GPGPU based solver for one to two compute nodes when the balancer was applied, particularly when the number of non-zero elements was unevenly distributed. For a larger number of compute nodes, however, the GPGPU cluster performed better. An analysis of the scalability of our iterative solver on a distributed memory systems is presented. The experiment results and analysis indicate that the acceleration completed in this dissertation leads to impressive speed up. And the optimization method I used in this study will benefit future research work in the high-performance computing area.

Committee:

Tarek Taha (Committee Chair); Keigo Hirakawa (Committee Member); Eric Balster (Committee Member); Muhammad Usman (Committee Member)

Subjects:

Computer Engineering

Keywords:

parallel computing; distributed computing; GPGPU; Xeon Phi; Preconditioned Iterative Solver; ALS; bilateral filtering;

Sammartino, Michael TParallel GPS Signal Acquisition and Tracking Design and Analysis using an FPGA
Master of Science in Engineering, Youngstown State University, 2015, Department of Electrical and Computer Engineering
The purpose of this research is to design and model a digital system to acquire, lock, and track a GPS signal in parallel on an FPGA. This project aims to reduce the receiver’s time to first fix from a cold start. Applications for this research include high precision targets requiring worldwide coverage where a stored almanac is not acceptable and a fast time to first fix is needed. It can also be utilized in aviation and safety of life applications where it is imperative that faulty data is immediately known and excluded. The uniqueness of this project allows ultra-low energy applications with only volatile memory and sleep currents in the µA range. Computer simulations have been performed for: (1) replicating GPS satellite C/A code, (2) determining which satellites are visible, (3) coding a hardware description of parallel correlation, locking, and tracking. The system will be distinctive for applications where there is no need or possibility for storage of almanac information to obtain time to first fix (TTFF) with high speed. During these trials an `out of the box’ cold start scenario is performed each time the code is run. The results have shown that successful GPS signal acquisition and decoding time is 1.7 seconds using hardware coding and an FPGA.

Committee:

Frank Li, PhD (Advisor); Philip Munro, PhD (Committee Member); Faramarz Mossayebi, PhD (Committee Member)

Subjects:

Computer Engineering; Electrical Engineering; Engineering; Systems Design

Keywords:

embedded system;FPGA;GPS;parallel computing

Gao, ZhenningParallel and Distributed Implementation of A Multilayer Perceptron Neural Network on A Wireless Sensor Network
Master of Science, University of Toledo, Engineering (Computer Science)
This thesis presents a study on implementing the multilayer perceptron neural network on the wireless sensor network in a parallel and distributed way. We take advantage of the topological resemblance between the multilayer perceptron and wireless sensor network. A single neuron in the multilayer perceptron neural network is implemented on a wireless sensor node, and the connections between neurons are achieved by the wireless links between nodes. While the computation of the multilayer perceptron benefits from the massive parallelism and the fully distribution when the wireless sensor network is serving as the hardware platform, it is still unknown whether the delay and drop phenomena for message packets carrying neuron outputs would prohibit the multilayer perceptron from getting a decent performance. A simulation-based empirical study is conducted to assess the performance profile of the multilayer perceptron on a number of different problems. Simulation study is performed using a simulator which is developed in-house for the unique requirements of the study proposed herein. The simulator only simulates the major effects of wireless sensor network operation which influence the running of multilayer perceptron. A model for delay and drop in wireless sensor network is proposed for creating the simulator. The setting of the simulation is well defined. Back-Propagation with Momentum learning is employed as the learning algorithms for the neural network. A formula for the number of neurons in the hidden layer neuron is chosen by empirical study. The simulation is done under different network topology and condition of delay and drop for the wireless sensor network. Seven data sets, namely Iris, Wine, Ionosphere, Dermatology, Handwritten Numerical, Isolet and Gisette, with the attributes counts up to 5000 and instances counts up to 7797 are employed to profile the performance. The simulation results are compared with those from the literature and through the non-distributed multilayer perceptron. Comparative performance evaluation suggests that the performance of multilayer perceptron using wireless sensor network as the hardware platform is comparable with other machine learning algorithms and as good as the non-distributed multilayer perceptron. The time and message complexity have been analyzed and it shows the scalability of the proposed method is promising.

Committee:

Gursel Serpen (Advisor); Mohsin Jamali (Committee Member); Ezzatollah Salari (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

Artificial Intelligence; Artificial Neural Network; Machine Learning; Multilayer Perceptron; Wireless Sensor Network; Parallel Computing; Distributed Computing

Koop, Matthew J.High-Performance Multi-Transport MPI Design for Ultra-Scale InfiniBand Clusters
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering

Over the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling increasingly higher. The scalability and the performance of the MPI library if very important for the end application performance.

InfiniBand is a cluster interconnect which is based on open-standards and is gaining rapid acceptance. This dissertation explores the different transports provided by InfiniBand to determine the scalabilty and performance aspects of each. Further, new MPI designs have been proposed and implemented for transports that have never been used for MPI in the past. These designs have significantly decreased the resource consumption, increased the performance and increased the reliability of ultra-scale InfiniBand clusters. A framework to simultaneously use multiple transports of InfiniBand and dynamically change transfer protcols has been designed and evaluated. Evaluations show that memory can be reduced from over 1 GB per MPI process to 40 MB per MPI process. In addition, performance using this design has been improved by up to 30% over earlier designs. Investigations into providing reliability have shown that the MPI library can be designed to withstand many network faults and also how to design reliability in software to provide higher message rates than in hardware. Software developed as a part of this dissertation is available in MVAPICH, which is a popular open-source implementation of MPI over InfiniBand and is used by several hundred top computing sites all around the world.

Committee:

Dhabaleswar K. Panda, PhD (Advisor); P. Sadayappan, PhD (Committee Member); Feng Qin, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

MPI; InfiniBand; Message Passing; Cluster; Parallel Computing

Schmidt, SamuelA Massively Parallel Algorithm for Cell Classification Using CUDA
MS, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science
In Bioinformatics, cell classification is the act of separating human cells into different groups based on their RNA-seq expression levels. These data can be quite large, as there are about 20,000 known human genes. Even relatively small datasets (<1000 cell samples) can contains millions of values. Computations and classifications on this data force a choice: speed or thoroughness? Many scientists today choose speed, meaning they must reduce the dataset through feature selection or some similar approach. Others may use complex computing systems to achieve thoroughness, but the cost of these systems is high. NVIDIA CUDA allows the average researcher to perform cell classification without making this choice. CUDA, running on specialized GPUs, gives desktop computers the power of a large computing system and eliminates the need to reduce the complexity of the dataset.

Committee:

Fred Annexstein, Ph.D. (Committee Chair); Kenneth Berman, Ph.D. (Committee Member); Anca Ralescu, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Cell Classification;Parallel Computing;CUDA;Machine Learning;Naive Bayes;CUDA Reduce

Drag, Melvyn IEpithelium: The lightweight, customizable epithelial tissue simulator
Master of Mathematical Sciences, The Ohio State University, 2015, Mathematics
Epithelial tissue performs many important functions in animals, such as preventing contamination, transporting gases and nutrients, and fluid secretion. Macroscopically, epithelial tissue can be thought of as the layer of an animal that separates it from the exterior world. The geometrical and topological features of epithelial tissue make it amenable to computational modeling. There are several simulation codes in existence which reproduce certain aspects of epithelial tissue morphogenesis, wound healing, and equilibration, but to the best of our knowledge only one of them is freely available to the public. Unfortunately, installation and use of this software requires expertise in a unix-like operating system and advanced knowledge of several programming languages. With this in mind, I have developed Epithelium, a lightweight epithelial tissue simulator which compiles easily on any unix-like system, and which can also be distributed as precompiled binaries. The code has very few dependencies, and these dependencies are likely already satisfied by the default packages installed on a Linux or Mac computer. For users with access to NVIDIA GPUs, Epithelium comes in stable and beta parallel versions. In Epithelium simulations are fairly easy to design and run via several configuration files, the source code is highly modularized, and the algorithms used therein are extensively documented. As such, this code is useful for reproducing previous results, and for quickly designing new computational biology experiments of epithelial tissues.

Committee:

Ed Overman (Advisor); Marcos Sotomayor (Advisor)

Subjects:

Biology; Computer Science; Mathematics

Keywords:

parallel computing; epithelial tissue

Hordemann, Glen JExploring High Performance SQL Databases with Graphics Processing Units
Master of Science (MS), Bowling Green State University, 2013, Computer Science
This thesis introduces the development of a new GPU-based database to accelerate queries of Digital Humanities data to extract document texts that are then data-mined to produce visualizations of aspects of the humanities data. The goal is to advance the state-of-the-art in massively parallel database work by investigating methods for utilizing graphical processing units in database systems. This thesis advances prior work done in the field of high-performance, massively-parallel databases. Some prior work focused on fixed length data types such as integers and doubles, often coupled with straight-forward single table queries. Other work focused on using primitives that are not a component of standard SQL databases. This thesis introduces an efficient virtual database engine that executes the majority of database operations directly on the GPU. The GPU database executes a subset of SQLite&#x2019;s SELECT queries, which are typically the most computationally expensive operations in a transactional database. This database engine extends existing research by exploring methods of string operations, multiple table joins, indexing, and varying length data types and sets. This thesis discusses the design of a new GPU virtual database engine, which is developed using the CUDA extensions to C/C++, from loading data from the file on disk to processing the program on the GPU. This thesis focuses on the development of new improvements to deal with caching data for the GPU, processing and coalescing varying length data, and performing joins between multiple tables in the GPU database. The GPU database is demonstrated in a real world application. The application wraps the database in a graphical user interface which facilitates data selection. The application performs data mining of Humanities data for common text mining algorithms. The mined data is processed into visualizations to illustrate the resulting data digests.

Committee:

Jong Kwan Lee (Advisor); Joseph Chao (Committee Member); Andrew Schocket (Committee Member)

Subjects:

Computer Science

Keywords:

GPU; Graphics Processing Unit; Database; SQL; SQLite; Visualization; Data Mining; CUDA; GPGPU; Parallel Computing; High Performance Computing; NVIDIA

Panuganti, RajkiranA High Productivity Framework for Parallel Data Intensive Computing in MATLAB
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering
Higher level languages like Matlab are being increasingly adopted. however, it has significant shortcomings when used for large-scale computationally intensive applications that require very high performance and/or significant amounts of memory. Developing efficient runtime frameworks to aid the high-level languages to make them scalable to larger problem sizes is an effective solution tothis problem. Our solutions, mexMPI, GAMMA and LA enable parallel computing directly in Matlab for high-performance while retaining its productivity aspects. mexMPI provides message passing semantics to enable parallel computing within Matlab environment using high-performance networks. GAMMA presents a distributed shared memory programming model wherein the programmer developes his/her parallel algorithms using a ‘Get-Compute-Put model’ and ‘LA’ is a runtime framework that enable users to develop large scale applications directly in Matlab. We demonstrate the effectiveness of our frameworks using NAS benchmarks. The experimental evaluation of these frameworks indicate effectiveness of our approach.

Committee:

Ponnuswamy Sadayappan, PhD (Advisor); Atanas Rountev, PhD (Committee Member); Dhabaleswar Panda, PhD (Committee Member); Sriram Krishnamoorthy, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

High Performance; High Productivity; Matlab; Parallel Computing; Frameworks

He, HaoNumerical simulations of unsteady flows in a pulse detonation engine by the conservation element and solution element method
Doctor of Philosophy, The Ohio State University, 2006, Mechanical Engineering
This dissertation is focused on a numerical framework for time-accurate solutions of high-speed unsteady flows with applications to flows of a Pulse Detonation Engine (PDE). The space-time Conservation Element and Solution Element (CESE) method employed is a novel numerical method for time-accurate solutions of nonlinear hyperbolic equations. As a part of the present result, a suite of two- and three-dimensional CESE codes has been developed. The computer codes are fully parallelized based on domain decomposition with Message Passing Interface (MPI) for data communication. The codes have been applied to analyze various flow fields of the PDE concept. First, numerical results of one-, two- and three-dimensional detonation waves are reported. The chemical reactions were modeled by a one-step, finite-rate, irreversible global reaction. The classical Zeldovich, von Neumann, and Doering (ZND) analytical solution was used to set up the initial conditions as well as for code validation. In the three-dimensional calculations, detonations in square, round, and annular tubes at different sizes were successfully simulated. Salient features of detonation waves were crisply resolved. Second, as a promising detonation initiation means, implosion with shock focusing was investigated. In two-dimensional calculations, we found a double-implosion mechanism in a successful detonation initiation process. Third, the plume dynamics of a PDE fueled by propane/air mixtures were studied to support the prototype development at NASA Glenn Research Center (GRC). Numerical results show that in each PDE cycle the engine is actively producing thrust forces only in about 6% of one cycle time period. The rest of the time is occupied by the blow-down and refueling processes. Since the PDE tube is always open, the processes depend on the flow conditions outside the PDE tube. In the near-field plume, complex shock/shock and shock/vortex interactions were found. In the far field, a spherical expansion wave is the dominant flow feature. This dissertation work is synergy of a very accurate and efficient CFD method, i.e., the CESE method, and the modern parallel computing technology. This approach could point to a new direction for high-fidelity simulations of complex flow fields of advanced propulsion systems.

Committee:

Sheng-Tao Yu (Advisor)

Subjects:

Engineering, Mechanical

Keywords:

Detonation; Pulse Detonation Engine; The CESE Method; Computational Fluid Dynamics; Numerical Simulation; Parallel Computing

Steinfadt, Shannon IreneSmith-Waterman Sequence Alignment For Massively Parallel High-Performance Computing Architectures
PHD, Kent State University, 2010, College of Arts and Sciences / Department of Computer Science

This research addresses one of the most often used tools in bioinformatics, sequence alignment. The increasing growth and complexity of high-performance computing as well as the stellar data growth in the bioinformatics field are motivating factors. An associative algorithm for performing quality sequence alignments more efficiently and faster is at the center of the dissertation. SWAMP or Smith-Waterman using Associative Massive Parallelism is the parallel algorithm designed and implemented for the ASC associative SIMD computing model. The theoretical parallel speedup for the algorithm is optimal, and reduces the compute time from O(mn) to O(m+n), where m and n are the length of the input sequences. When m = n, the running time becomes O(n) with a very small constant.

Using the capabilities of ASC, innovative new algorithms that are extensions of the above SWAMP algorithm increase the information returned by the alignment algorithms without decreasing the accuracy of those alignments. Known as SWAMP+, these algorithms that provide a highly sensitive parallelized approach extending traditional pairwise sequence alignment. They are useful for in-depth exploration of sequences, including research in expressed sequence tags, regulatory regions, and evolutionary relationships.

Although the SWAMP suite of algorithms was designed for the associative computing platform, they were implemented on the ClearSpeed CSX 620 parallel SIMD accelerator to obtain realistic metrics. The performance for the compute intensive matrix calculation displayed a speedup of roughly 96 using ClearSpeed's 96 processing elements, thus verifying the possibility of achieving the theoretical speedup mentioned above. Additional parallel hardware implementations were explored and a cluster-based approach to test the memory-intensive Smith-Waterman across multiple nodes within a cluster was used. This work utilized a tool called JumboMem. It allowed us to store the huge matrix of computations completely in memory for what we believe to be one of the largest instances of Smith-Waterman ever run.

Committee:

Johnnie W. Baker, PhD (Advisor); Kenneth Batcher, PhD (Committee Member); Paul Farrell, PhD (Committee Member); James Blank, PhD (Committee Member)

Subjects:

Bioinformatics; Computer Science; Molecular Biology

Keywords:

Sequence Alignment; Smith-Waterman; Parallel Computing; Associative Computing; ASC; SIMD; MIMD; ClearSpeed; JumboMem; Smith-Waterman Using Associative Massive Parallelism; SWAMP; SWAMP+

Bazow, Dennis P.Fluid dynamics for the anisotropically expanding quark-gluon plasma
Doctor of Philosophy, The Ohio State University, 2017, Physics
Local momentum anisotropies are large in the early stages of the quark-gluon plasma created in relativistic heavy-ion collisions, due to the extreme difference in the initial longitudinal and transverse expansion rates. In such situations, fluid dynamics derived from an expansion around an isotropic local equilibrium state is bound to break down. Instead, we resum the effects of the slowest nonhydrodynamic degree of freedom (associated with the deviation from momentum isotropy) and include it at leading order, defining a local anisotropic quasi-equilibrium state, thereby treating the longitudinal/transverse pressure anisotropy nonperturbatively. Perturbative transport equations are then derived to deal with the remaining residual momentum anisotropies. This procedure yields a complete transient effective theory called viscous anisotropic hydrodynamics. We then show that the anisotropic hydrodynamic approach, especially after perturbative inclusion of all residual viscous terms, dramatically outperforms viscous hydrodynamics in several simplified situations for which exact solutions exist but which share with realistic expansion scenarios the problem of large dissipative currents. Simulations of the full three-dimensional dynamics of the anisotropic quark-gluon plasma are then presented.

Committee:

Ulrich Heinz (Advisor); Michael Lisa (Committee Member); Yuri Kovchegov (Committee Member); Junko Shigemitsu (Committee Member)

Subjects:

Physics

Keywords:

Relativistic fluid dynamics; Quark-gluon plasma; Anisotropic dynamics; Viscous hydrodynamics; Boltzmann equation; GPU; CUDA; Parallel computing

Maddipati, Sai Ratna KiranImproving the Parallel Performance of Boltzman-Transport Equation for Heat Transfer
Master of Science, The Ohio State University, 2016, Computer Science and Engineering
In a thermodynamically unstable environment, the Boltzman-Transport Equation (BTE) defines the behavior of heat transfer-rate at each location in the environment, the direction of heat transfer by the particles of the environment and the final equilibrium temperature conditions of the environment. The BTE is a very computationally intensive application and there is a need for efficient parallelization. Parallelization implementation of this application is explained along with brief details about several techniques that have been used by others in past work. The implementation involves several code-iterations of the BTE application with distinct changes in order to compare and analyze the performance and identify the reason for the performance improvement or deterioration. Then, experimental results are presented to show the resulting performance of the implementation.

Committee:

P Sadayappan (Advisor); Nasko Rountev (Committee Member)

Subjects:

Computer Science

Keywords:

Parallel Computing, OpenMP, MPI

Dickman, Thomas JEvent List Organization and Management on the Nodes of a Many-Core Beowulf Cluster
MS, University of Cincinnati, 2013, Engineering and Applied Science: Computer Engineering
Parallel Discrete Event Simulation (PDES) has been widely studied for many years. One of the principle mechanisms for the distributed synchronization of concurrent event execution in a PDES is an optimistic method called the Time Warp Mechanism. Researchers at the University of Cincinnati have been studying the Time Warp mechanism for many years and they have developed an implementation of the Time Warp mechanism for execution on Beowulf clusters that is called WARPED. In recent years, the WARPED kernel has been modified to add integrated thread support for execution on multi-core and many-core Beowulf clusters. While this threaded WARPED kernel provides effective execution support for Beowulf clusters containing small multi-core processors, contention for the shared input event queue (or pending event set) on a node has a negative impact on the overall performance of the simulation once the number of threads in the nodes grows above 6-8. This thesis explores the design of the software architecture for the pending event set in the WARPED kernel. In particular, we examine a solution where the pending event set on each node is partitioned into multiple input queues (called LTSF queues) with disjoint subsets of the event processing threads bound to each LTSF queue to help reduce contention. While this reduces contention for the shared pending event set to managable levels, it can potentially lead to an unbalanced advancement of simulation time among the subsets of event processing threads. Thus, this thesis also examines the implementation of methods to dynamically adjust the partitioning of events distributed to the various LTSF queues on a node. Several algorithms and metrics are used to decide when and where to adjust the workload with improved run time results occurring as a result.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Fred Annexstein, Ph.D. (Committee Member); Fred Beyette, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

Time Warp;pending event lists;multi-core;threads;Beowulf clusters;parallel computing

Wang, HongDesign and Implementation of an FPGA-Based Scalable Pipelined Associative SIMD Processor Array with Specialized Variations for Sequence Comparison and MSIMD Operation
PHD, Kent State University, 2006, College of Arts and Sciences / Department of Computer Science
Over the years a number of variations on associative computing have been explored. At Kent State University (KSU), associative SIMD computing has its roots at Goodyear Aerospace Corporation, but most recently has focused on exploring the power of the associative computing paradigm as compared to traditional SIMD, and even MIMD, computing. In contrast, Prof. Robert Walker’s research group at KSU has focused on implementing those associative concepts on a single chip by developing a new associative SIMD RISC processor, called the ASC (ASsociative Computing) Processor, using modern FPGA implementation techniques. This dissertation describes the development of a working, scalable, ASC Processor that is pipelined to improve performance, supports a reconfigurable network, can be specialized further for dedicated applications (e.g., genome processing), and that can realize the multiple SIMD (MSIMD) paradigm by supporting multiple control units. As a first step in this processor development, a Control Unit and a 4-PE Array developed previously by Master’s students were integrated into the first working ASC Processor. This processor was then modified to develop the first scalable PE ASC Processor, and demonstrated on database processing and image processing. However, the core of this dissertation research was the design and implementation of a pipelined scalable ASC Processor, which as far as we are aware is the first working single-chip pipelined SIMD associative processor and perhaps the first single-chip pipelined SIMD processor in general. Compared to the first scalable ASC Processor, this pipelined processor not only gains the advantage of pipelining, but has a faster clock frequency due to a more efficient implementation. With that pipelined scalable ASC Processor as a base, two major architectural variations were explored. To support an innovative LCS algorithm for genome sequence comparison, the reconfigurable PE interconnection was modified with some features inspired by the coterie network, plus row and column broadcast buses. To better support control parallelism and increase PE utilization, a multiple-instruction-stream MASC Processor was developed, which dynamically assigns tasks to Processing Elements as the program executes.

Committee:

Robert Walker (Advisor)

Subjects:

Computer Science

Keywords:

Parallel Computing; Parallel Processing; FPGA; ASC;MASC;Associative Computing; Associative Processing; Parallel Processor Design.

Carraher, Lee A.A Parallel Algorithm for Query Adaptive, Locality Sensitive Hash Search
MS, University of Cincinnati, 2012, Engineering and Applied Science: Computer Science
Nearest neighbor search is a fundamental requirement of many machine learning algorithms and is essential to fuzzy information retrieval. The utility of efficient database search and construction has broad utility in a variety of computing fields. Applications such as coding theory and compression for electronic communication systems as well as use in artificial intelligence for pattern and object recognition. In this thesis, a particular subset of nearest neighbors is consider, referred to as c-approximate k-nearest neighbors search. This particular variation relaxes the constraints of exact nearest neighbors by introducing a probability of finding the correct nearest neighbor c, which offers considerable advantages to the computational complexity of the search algorithm and the database overhead requirements. Furthermore, it extends the original nearest neighbors algorithm by returning a set of k candidate nearest neighbors, from which expert or exact distance calculations can be considered. Furthermore thesis extends the implementation of c-approximate k-nearest neighbors search so that it is able to utilize the burgeoning GPGPU computing field. The specific form of c-approximate k-nearest neighbors search implemented is based on the locality sensitive hash search from the E2LSH package of Indyk and Andoni [1]. In this paper, the authors utilize the exceptional properties of the Leech Lattice [2], as a subspace quantizer for the locality sensitive hash families. The Leech Lattice is unique in that it provides the closest lattice packing of equal sized spheres in 24 dimensional space. In addition, advances from coding theory provide a very favorable decoding algorithm for finding the nearest lattice center to a query point in euclidean 24 dimensional space [3] [4]. The multilevel construction of the Leech Lattice provides an excellent opportunity for parallelization as it contains the minimization of many independent sub-lattice decodings resulting from the lattices exceptional symmetry among lattices. These decodings are additionally highly floating point computationally intensive, and because of which suggest a favorable implementation on GPGPU architectures such as NVIDIAs CUDA based framework. Furthermore, the overall construction of a locality sensitive hash based, nearest neighbors search algorithm, is able to be parallelized fairly efficiently as the hash decodings are completely independent of one another. The goal of this thesis is to present a CUDA optimized parallel implementation of a bounded distance Leech Lattice decoder [4] for use in query optimized c-approximate k-nearest neighbors using the locality sensitive hash framework of E2LSH. The system will be applied to the approximate image retrieval of SIFT transformed [5] image vectors. [1] A. Andoni, Nearest Neighbor Search: the Old, the New, and the Impossible. INSTITUTE OF [2] J. Leech, “Notes on sphere packings,” [3] . J. Forney, G. D., “A bounded-distance decoding algorithm for the leech lattice, with generalizations,” [4] O. Amrani and Y. Beery, “Efficient bounded-distance decoding of the hexacode and associated decoders for the leech lattice and the golay code,” [5] D. G. Lowe, “Object recognition from local scale-invariant features,”

Committee:

Fred Annexation, PhD (Committee Chair); Kenneth Berman, PhD (Committee Member); Yizong Cheng, PhD (Committee Member); Anca Ralescu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

Locality Sensitive Hashing;Approximate Nearest Neighbors;CUDA GPU;Image Search;Distance Adaptive LSH;Parallel Computing;

Zhou, JunParallel Go on CUDA with Monte Carlo Tree Search
MS, University of Cincinnati, 2013, Engineering and Applied Science: Computer Science
Traditional Go AI uses minimax tree search with pruning optimizations and board evaluation function to dictate moves in a sequential fashion. However It is widely accepted that professional human players are far better than min- imax tree search based Go AI, due to lack of a decent Go evaluation function and its astronomical search space. Recent development of Monte Carlo Tree Search (MCTS) based Go AI sees a big surge in its playing strength. With the emergence of CUDA, Nvidia&#x2019;s massively parallel GPU platform, it ap- pears the MCTS process can be parallelized at the simulation stage by the GPU to enhance its performance. This thesis takes on the challenge of build- ing a Monte Carlo Tree Search algorithm to play Go on the manycore CUDA platform.

Committee:

Kenneth Berman, Ph.D. (Committee Chair); Fred Annexstein, Ph.D. (Committee Member); Raj Bhatnagar, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Monte Carlo Tree Search;CUDA;Go;Biased Evaluation Function;Artificial Intelligence;Parallel Computing;

Gideon, JohnThe Integration of LlamaOS for Fine-Grained Parallel Simulation
MS, University of Cincinnati, 2013, Engineering and Applied Science: Computer Engineering
LlamaOS is a custom operating system that provides much of the basic functionality needed for low latency applications. It is designed to run in a Xen-based virtual machine on a Beowulf cluster of multi/many-core processors. The software architecture of llamaOS is decomposed into two main components, namely: the llamaNET driver and llamaApps. The llamaNET driver contains Ethernet drivers and manages all node-to-node communications between user application programs that are contained within a llamaApp instance. Typically, each node of the Beowulf cluster will run one instance of the llamaNET driver with one or more llamaApps bound to parallel applicaitons. These capabilities provide a solid foundation for the deployment of MPI applications as evidenced by our initial benchmarks and case studies. However, a message passing standard still needed to be either ported or implemented in llamaOS. To minimize latency, llamaMPI was developed as a new implementation of the Message Passing Interface (MPI), which is compliant with the core MPI functionality. This provides a standardized and easy way to develop for this new system. Performance assessment of llamaMPI was achieved using both standard parallel computing benchmarks and a locally (but independently) developed program that executes parallel discrete event-driven simulations. In particular, the NAS Parallel Benchmarks are used to show the performance characteristics of llamaMPI. In the experiments, most of the NAS Parallel Benchmarks ran faster than, or equal to their native performance. The benefit of llamaMPI was also shown with the fine-grained parallel application WARPED. The order of magnitude lower communication latency in llamaMPI greatly reduced the amount of time that the simulation spent in rollbacks. This resulted in an overall faster and more efficient computation, because less time was spent off the critical path due to causality errors.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Fred Beyette, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

Parallel Computing;Time Warp Simulation;MPI;Operating Systems;Beowulf Cluster;Parallel Discrete Event Simulation

Next Page