Search Results (1 - 4 of 4 Results)

Sort By  
Sort Dir
 
Results per page  

Wang, KaiboAlgorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Massively data-parallel processors, Graphics Processing Units (GPUs) in particular, have recently entered the main stream of general-purpose computing as powerful hardware accelerators to a large scope of applications including databases, medical informatics, and big data analytics. However, despite their performance benefit and cost effectiveness, the utilization of GPUs in production systems still remains limited. A major reason behind this situation is the slow development of supportive GPU software ecosystem. More specially, (1) CPU-optimized algorithms for some critical computation problems have irregular memory access patterns with intensive control flows, which cannot be easily ported to GPUs to take full advantage of its fine-grained, massively data-parallel architecture; (2) commodity computing environments are inherently concurrent and require coordinated resource sharing to maximize throughput, while existing systems are still mainly designed for dedicated usage of GPU resources. In this Ph.D. dissertation, we develop efficient software solutions to support the adoption of massively data-parallel processors in general-purpose commodity computing systems. Our research mainly focuses on the following areas. First, to make a strong case for GPUs as indispensable accelerators, we apply GPUs to significantly improve the performance of spatial data cross-comparison in digital pathology analysis. Instead of trying to port existing CPU-based algorithms to GPUs, we design a new algorithm and fully optimize it to utilize GPU’s hardware architecture for high performance. Second, we propose operating system support for automatic device memory management to improve the usability and performance of GPUs in shared general-purpose computing environments. Several effective optimization techniques are employed to ensure the efficient usage of GPU device memory space and to achieve high throughput. Finally, we develop resource management facilities in GPU database systems to support concurrent analytical query processing. By allowing multiple queries to execute simultaneously, the resource utilization of GPUs can be greatly improved. It also enables GPU databases to be utilized in important application areas where multiple user queries need to make continuous progresses simultaneously.

Committee:

Xiaodong Zhang (Advisor); P. Sadayappan (Committee Member); Christopher Stewart (Committee Member); Harald Vaessin (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

GPUs, Memory Management, Operating Systems, GPU Databases, Resource Management, Digital Pathology

Baskaran, Muthu ManikandanCompile-time and Run-time Optimizations for Enhancing Locality and Parallelism on Multi-core and Many-core Systems
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering

Current trends in computer architecture exemplify the emergence of multiple processor cores on a chip. The modern multiple-core computer architectures that include general-purpose multi-core architectures (from Intel, AMD, IBM, and Sun), and specialized parallel architectures such as the Cell Broadband Engine and Graphics Processing Units (GPUs) have very high computation power per chip. A significant challenge to be addressed in these systems is the effective load-balanced utilization of the processor cores. Memory subsystem has always been a performance bottleneck in computer systems and it is more so, with the emergence of processor subsystem with multiple on-chip processor cores. Effectively managing the on-chip and off-chip memories and enhancing data reuse to maximize memory performance is another significant challenge in modern multiple-core architectures.

Our work addresses these challenges in multi-core and many-core systems, through various compile-time and run-time optimization techniques. We provide effective automatic compiler support for managing on-chip and off-chip memory accesses, with the compiler making effective decisions on what elements to move in and move out of on-chip memory, when and how to move them, and how to efficiently access the elements brought into on-chip memory. We develop an effective tiling approach for mapping computation in regular programs on to many-core systems like GPUs. We develop an automatic approach for compiler-assisted dynamic scheduling of computation to enhance load balancing for parallel tiled execution on multi-core systems.

There are various issues that are specific to the target architecture which need attention to maximize application performance on the architecture. First, the levels of parallelism available and the appropriate granularity of parallelism needed for the target architecture have to be considered while mapping the computation. Second, the memory access model may be inherent to the architecture and optimizations have to be developed for the specific memory access model. We develop compile-time transformation approaches to address performance factors related to parallelism and data locality that are GPU architecture-specific, and develop an end-to-end compiler framework for GPUs.

Committee:

Sadayappan Ponnuswamy, Dr. (Advisor); Dhabaleswar Panda, Dr. (Committee Member); Atanas Rountev, Dr. (Committee Member); Jaganathan Ramanujam, Dr. (Committee Member)

Subjects:

Computer Science

Keywords:

Compilers. Multi-cores; GPUs

Ren, BinSupporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
SIMD accelerators and many-core coprocessors with coarse-grained and fine-grained level parallelism, become more and more popular. Streaming SIMD Extensions (SSE), Graphics Processing Unit (GPU), and Intel Xeon Phi (MIC) can provide orders of magnitude better performance and efficiency for parallel workloads as compared to single core CPUs. However, parallelizing irregular applications involving dynamic data structures and irregular memory access on these parallel platforms is not straightforward due to their intensive control-flow dependency and lack of memory locality. Our efforts focus on three classes of irregular applications: Irregular Trees and Graphs Traversals, Irregular Reductions and Dynamic Allocated Arrays and Lists, and explore the mechanism of parallelizing them on various parallel architectures from both fine-grained and coarse-grained perspectives. We first focus on the traversal of irregular trees and graphs, more specifically, a class of applications involving the traversal of many pointer-intensive data structures, e.g. Random Forest, and Regular Expressions, on various fine-grained SIMD architectures, e.g. the Streaming SIMD Extension (SSE), and Graphic Processing Unit (GPU). We address this problem by developing an intermediate language for specifying such traversals, followed by a run-time scheduler that maps traversals to SIMD units. A key idea in our run-time scheme is converting branches to arithmetic operations, which then allows us to use SIMD hardware. However, different SIMD architectures have different features, so a significant challenge to our previous work is to automatically optimize applications for various architectures, i.e., we need to implement performance portability. Moreover, one of the first architectural features programmers look to when optimizing their applications is the memory hierarchy. Thus, we design a portable optimization engine for accelerating irregular data traversal applications on various SIMD architectures by emphasizing on improving the data locality and hiding memory latency. We next explore the possibility of efficiently parallelizing two irregular reduction applications on Intel Xeon Phi architecture, an emerging many-core coprocessor architecture with long SIMD vectors, via data layout optimization. During this process, we also identify a general data management problem in the CPU-Coprocessor programming model, i.e., the problem of automating and optimizing dynamic allocated data structures transfers between CPU and Coprocessors. For dynamic multi-dimensional arrays, we design a set of compile-time solutions involving heap layout transformation, while for other irregular data structures such as linked lists, we improve the existing shared memory runtime solution to reduce the transfer costs. Dynamic allocated data structures like List are also commonly used in high-level programming languages, such as Python to support dynamic, flexible features to increase the programming productivity. To parallelize applications in such high-level programming languages on both coarse-grained and fine-grained parallel platforms, we design a compilation system linearizing dynamic data structures into arrays, and invoking low level multi-core, many-core libraries. A critical issue of our linearization method is that it incurs extra data structure transformation overhead, especially for the irregular data structures not reused frequently. To address this challenge, we design a set of transformation optimization algorithms including an inter-procedural Partial Redundancy Elimination (PRE) algorithm to minimize the data transformation overhead automatically.

Committee:

Gagan Agrawal (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Science

Keywords:

Irregular Data Structure; Fine Grained Parallelism; SIMD; MIMD; SSE; GPUs; Xeon Phi; Static Analysis; Runtime Analysis; Offloading; Python; Redundancy Elimination

Hartley, Timothy D. R.Accelerating Component-Based Dataflow Middleware with Adaptivity and Heterogeneity
Doctor of Philosophy, The Ohio State University, 2011, Electrical and Computer Engineering
This dissertation presents research into the development of high performance dataflow middleware and applications on heterogeneous, distributed-memory supercomputers. We present coarse-grained state-of-the-art ad-hoc techniques for optimizing the performance of real-world, data-intensive applications in biomedical image analysis and radar signal analysis on clusters of computational nodes equipped with multi-core microprocessors and accelerator processors, such as the Cell Broadband Engine and graphics processing units. Studying the performance of these applications gives valuable insights into the relevant parameters to tune for achieving efficiency, because being large-scale, data-intensive scientific applications, they are representative of what researchers in these fields will need to conduct innovative science. Our approaches shows that multi-core processors and accelerators can be used cooperatively to achieve application performance which may be many orders of magnitude above naive reference implementations. Additionally, a fine-grained programming framework and runtime system for the development of dataflow applications for accelerator processors such as the Cell is presented, along with an experimental study showing our framework leverages all of the peak performance associated with such architectures, at a fraction of the cognitive cost to developers. Then, we present an adaptive technique for automating the coarse-grained ad-hoc optimizations we developed for tuning the decomposition of application data and tasks for parallel execution on distributed, heterogeneous processors. We show that our technique is able to achieve high performance, while significantly reducing the burden placed on the developer to manually tune the relevant parameters of distributed dataflow applications. We evaluate the performance of our technique on three real-world applications, and show that it performs favorably compared to three state-of-the-art distributed programming frameworks. By bringing our adaptive dataflow middleware to bear on supporting alternative programming paradigms, we show our technique is flexible and has wide applicability.

Committee:

Umit Catalyurek, PhD (Advisor); Fusun Ozguner, PhD (Committee Member); Charles Klein, PhD (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

High Performance Computing; GPUs; Heterogeneous Computing; Run-time Systems; Middleware