Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 4)

Mini-Tools

 
 

Search Report

  • 1. Hong, Changwan Code Optimization on GPUs

    Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering

    Graphic Processing Units (GPUs) have become popular in the last decade due to their high memory bandwidth and powerful computing capacity. Nevertheless, achieving high-performance on GPUs is not trivial. It generally requires significant programming expertise and understanding of details of low-level execution mechanisms in GPUs. This dissertation introduces approaches for optimizing regular and irregular applications. To optimize regular applications, it introduces a novel approach to GPU kernel optimization by identifying and alleviating bottleneck resources. This approach, however, is not effective in irregular applications because of data-dependent branches and memory accesses. Hence, tailored approaches are developed for two popular domains of irregular applications: graph algorithms and sparse matrix primitives. Performance modeling for GPUs is carried out by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. The utility of the bottleneck analysis is demonstrated in two contexts: i) Enhancing the OpenTuner auto-tuner with the new bottleneck-driven optimization strategy. Effectiveness is demonstrated by experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite. ii) Manual code optimization. Two case studies illustrate the use of a bottleneck analysis to iteratively improve the performance of code from state-of-the-art DSL code generators. However, the above approach is ineffective for irregular applications such as graph algorithms and sparse linear systems. Graph algorithms are used in various applications, and high-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high-performance. This dissertation develops an approach to graph processing on GPUs (open full item for complete abstract)

    Committee: Ponnuswamy Sadayappan (Advisor); Rountev Atanas (Committee Member); Teodorescu Radu (Committee Member) Subjects: Computer Science
  • 2. Niu, Qingpeng Characterization and Enhancement of Data Locality and Load Balancing for Irregular Applications

    Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering

    The rate of improvement in data access costs continues to lag behind the improvement in computational rates. Therefore characterization and enhancement of data locality in applications is extremely important. In addition, load balancing also plays a significant role in parallel application performance. This is particularly challenging for irregular and unstructured applications. In this dissertation, we address both the efficient parallel characterization of data locality characteristics of programs, as well as develop parallel applications with enhanced data locality and load balancing. First, we address the speed of reuse distance analysis by parallelization. Reuse distance can directly predict the cache hit ratio for a fully associative cache and be used in various program optimization techniques like loop tiling, code reordering, cache sharing and cache partitioning to improve locality. Though reuse distance analysis is very useful, it is also costly. We develop the first parallel reuse distance analysis algorithm (Parda). Parda achieves speedup from 13 to 50 on various SPEC CPU2006 benchmarks compared to state-of-art sequential accurate reuse distance analysis algorithm. Second, we utilize reuse distance analysis to construct a locality based performance model to analyze and enhance the performance of two production scientific applications QMCPACK and QWalk. These quantum Monte Carlo (QMC) applications use a very large read-only table to store spline interpolation coefficients, and accesses to the table are generated at random based on the state of the Monte Carlo simulation. Currently QMC applications such as QWalk and QMCPACK replicate this table for every process or node, which limits scalability because increasing the number of processors does not enable larger systems to be run. We present a partitioned global address space (PGAS) approach to transparently managing this data using Global Arrays in a manner that allows (open full item for complete abstract)

    Committee: P Sadayappan Dr (Advisor); Srinivasan Parthasarathy Dr (Committee Member); Rountev Atanas Dr (Committee Member) Subjects: Computer Science
  • 3. Singh, Kunal High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture

    Master of Science, The Ohio State University, 2018, Computer Science and Engineering

    SpMM is a widely used primitive in many domains like Fluid Dynamics, Data Analytics, Economic Modelling and Machine Learning. In Machine Learning and Artificial Neural Network domain SpMM is used iteratively and is the main bottleneck in many kernels. Due to its prime importance, many Machine Learning frameworks like Tensorflow, PyTorch, etc offer SpMM as a primitive. When compared to SpMV, SpMM has a higher theoretical operational intensity. However, the fraction of roofline performance achieved by SpMM is lower than SpMV suggesting possible improvements. In this paper, we systematically explore different design choices for SpMM primitive and develop a high-performance SpMM algorithm targeted at Multi-core and Many-core architectures. In addition, we also developed an analytical model to guide the tile size selection. As shown in our experimental section we achieve up to 3.4x speedup when compared to Intel MKL library.

    Committee: Ponnuswamy Sadayappan (Advisor); Atanas Rountev (Committee Member) Subjects: Computer Science
  • 4. Murugandi, Iyyappa A New Representation of Structured Grids for Matrix-vector Operation and Optimization of Doitgen Kernel

    Master of Science, The Ohio State University, 2010, Computer Science and Engineering

    In the era of technological advances in the scientific community, the demand for massive amounts of computation is driven by the need for higher levels of accuracy. Coupled with the advent of multi and many core architectures and with the advances in the architecture over the past decade, there is great demand and opportunity for optimizing many scientific applications. Many conventional algorithms needs to be rethought and modified to take advantage of the advances in the architecture field. In first part of the thesis, we focus on developing a new representation of structure grids that arise from finite difference and finite volume methods. We propose a new data structure and an algorithm for structure grids that can take advantage of SIMD architectures and is space efficient. We compare our performance with existing standard algorithms that are used in general. We find that our algorithm runs nearly 3.5X faster for single precision and 1.7X for double precision applications on a single core than the existing algorithm in the PETSc system. In the second part of our work, we focus on a kernel from the Multiresolution ADaptive NumErical Scientific Simulation (MADNESS) quantum chemistry application, which is a framework for scientific simulation in many dimensions using adaptive multi-resolution methods in multi-wavelet bases. We optimize this kernel with the annotation based tuning tool called Orio through various loop transformations and SIMD vectorization. We compare our results with the Intel Math Library Basic Linear Algebra Solver (BLAS). We observe that our tuned kernel can perform 1.4X faster than BLAS in both sequential and parallel versions for odd ranked matrix sizes.

    Committee: Dr. Sadayappan Ponnuswamy (Committee Chair); Dr. Atanas Rountev (Committee Co-Chair) Subjects: Computer Science