Parallel computing architectures have become ubiquitous nowadays. Particularly, Graphics Processing Unit (GPU), Intel Xeon Phi co-processor (many integrated core architecture), and heterogeneous coupled and decoupled CPU-GPU architectures have emerged as a very significant player in high performance computing due to their performance, cost, and energy efficiency. Based on this trend, an important research issue is how to efficiently utilize these varieties of architectures to accelerate different kinds of applications.
Our overall goal is to provide parallel strategies and runtime support for modern GPUs, Intel Xeon Phi, and heterogeneous architectures to accelerate the applications with different communication patterns. The challenges rise from two aspects, application and architecture respectively. In the application aspect, not all kinds of applications can be easily ported to GPUs or Intel Xeon Phi with optimal high performance. From the architecture aspect, the SIMT (Single Instruction Multiple Thread) and SIMD (Single Instruction Multiple Data) execution models employed by GPUs and Intel Xeon Phi respectively do not favor applications with data and control dependencies. Moreover, heterogeneous CPU-GPU architectures, i.e., the integrated CPU-GPU, bring new models and challenges for data sharing. Our efforts focus on four kinds of application patterns: generalized reduction, irregular reduction, stencil computation, and recursive applications, and explore the mechanism of parallelizing them on various parallel architectures, including GPU, Intel Xeon Phi architecture, and heterogeneous CPU-GPU architectures.
We first study the parallel strategies for generalized reductions on modern GPUs. The traditional mechanism to parallelize generalized reductions on GPUs is referred to as full replication method, which assigns each thread an independent data copy to avoid data racing and maximize parallelism. However, it introduces significant memory and result combination overhead, and is inapplicable for a large number of threads. In order to reduce memory overhead, we propose a locking method, which allows all threads per block to share the same data copy, and supports fine-grained and coarse-grained locking access. The locking method succeeds in reducing memory overhead, but introduces thread competition. To achieve a tradeoff between memory overhead and thread competition, we also present a hybrid scheme.
Next, we investigate the strategies for irregular or unstructured applications. One of the key challenges is how to utilize the limited-sized shared memory on GPUs. We present a partitioning-based locking scheme, by choosing the appropriate partitioning space which can guarantee all the data can be put into the shared memory. Moreover, we propose an efficient partitioning method and data reordering mechanism.
Further, to better utilize the computing capability on both host and accelerator, we extend our irregular parallel scheme to the heterogeneous CPU-GPU environment by introducing a multi-level partitioning scheme and a dynamic task scheduling framework, which supports pipelining between partitioning and computation, and work stealing between CPU and GPU. Then, motivated by the emerging of the integrated CPU-GPU architecture and its new shared physical memory, we develop the thread block level scheduling framework for three communication patterns, generalized reduction, irregular reduction, and stencil computation. This novel scheduling framework introduces locking free access between CPU and GPU, reduces command launching and synchronization overhead by removing extra synchronizations, and improves load balance.
Although the support of unstructured control follow has been included in GPUs, the performance of recursive applications on modern GPUs is limited due to the current thread reconvergence method in the SIMT execution model. To efficiently port recursive applications to GPUs, we present our novel dynamic thread reconvergence method, which allows threads to reconverge either before or after the static reconvergence point, and improves instructions per cycle (IPC) significantly.
Intel Xeon Phi architecture is an emerging x86 based many-core coprocessor architecture with wide SIMD vectors. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. We consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reorganization and our methods to effectively manage control flow.
In addition, all proposed methods have been extensively evaluated with multiple applications and different data inputs.