Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 40)

Mini-Tools

 
 

Search Report

  • 1. Sarkauskas, Nicholas Large-Message Nonblocking Allgather and Broadcast Offload via BlueField-2 DPU

    Master of Science, The Ohio State University, 2022, Computer Science and Engineering

    Since the introduction of nonblocking collectives in the MPI-3 standard, communication has been progressed by several mechanisms. One such mechanism includes modifying the application code to periodically call MPI_Test to enter the MPI library. Another launches an extra thread per core to progress communication asynchronously. Communication progression can also be offloaded to the Host Channel Adapter (HCA) using the latest hardware. In this thesis, we explore this last option by using the Data Processing Unit (DPU) shipped with the BlueField-2 SmartNIC adapter to offload progression of nonblocking MPI_Ibcast and MPI_Iallgather collectives. For both collectives, we present several designs which take advantage of the DPU. We demonstrate the efficacy of our proposed designs through microbenchmark and application kernel evaluations. At the microbenchmark level, total execution time of the osu_ibcast microbenchmark can be reduced by up to 54% using our DPU-based Ibcast designs. Total execution time of the osu_iallgather microbenchmark can be reduced by up to 43%. For application kernel level evaluations, we run a parallel radix sort kernel that was modified to take advantage of nonblocking allgather, and show up to a 6.4% reduction in overall execution time using our DPU-based Iallgather. To the best of our knowledge, this is the first work to optimize nonblocking broadcast and allgather collectives on emerging BlueField DPUs.

    Committee: Dhabaleswar Panda (Advisor); Hari Subramoni (Committee Member); Radu Teodorescu (Committee Member) Subjects: Computer Engineering; Computer Science
  • 2. Senthil Kumar, Nithin Designing optimized MPI+NCCL hybrid collective communication routines for dense many-GPU clusters

    Master of Science, The Ohio State University, 2021, Computer Science and Engineering

    CUDA-aware Message Passing Interface (MPI) libraries like MVAPICH2-GDR have rapidly evolved to keep up with the demand for efficient GPU buffer-based communication by incorporating the latest technological advances to drive down communication latency significantly. However, with the advent of Deep Learning (DL), vendors have started to introduce libraries that are DL-focused, but not MPI-compliant – like the NVIDIA Collective Communications Library (NCCL). Furthermore, there is a lack of a single common standardized benchmarking tool to evaluate the performance of both MPI and NCCL operations. In this work, we introduce a new set of collective benchmarks within OSU-Micro Benchmarks (OMB) to evaluate the performance of NCCL operations in a manner that is semantically equivalent to MPI benchmarks. We then tackle the challenge to see if modern CUDA-aware MPI libraries like MVAPICH2-GDR can take advantage of advances in collective communications libraries like NCCL to provide high-performance MPI-compliant collective communication primitives for High-Performance Computing (HPC) and DL applications. We incorporate the ability to invoke NCCL API into MVAPICH2-GDR's tuning framework in order to select the best algorithm for any given message size. Finally, we evaluate the performance of our designs by investigating the improvement in latency at different message sizes and scales on the Lassen supercomputing system using OMB. The designs developed as a part of this thesis will be made available in future releases of MVAPICH2-GDR and OMB.

    Committee: Dhabaleswar Panda (Advisor); Feng Qin (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Science
  • 3. Awan, Ammar Ahmad Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    Recent advances in Machine Learning (ML) and Deep Learning (DL) techniques have triggered key success stories in many application domains like Computer Vision, Speech Comprehension and Recognition, and Natural Language Processing. Large-scale Deep Neural Networks (DNNs) are primary drivers of these success stories. However, training complex DNN architectures that consist of millions of trainable parameters is compute-intensive. Training is done using a large number of examples (training data set) and can take from weeks to months to achieve state-of-the-art prediction capabilities (accuracy). To achieve higher accuracy, making the DNN deeper and larger has become a common strategy but it also leads to a significantly bigger memory footprint. Thus, DNN training is not only compute-intensive but also a memory-hungry process requiring gigabytes of memory. To accelerate the process of large-scale DNN training, this dissertation is focused on designing high-performance systems that can exploit thousands of CPUs and GPUs for faster training. The novel approach presented in this work is called the co-design of high-perfo-rmance communication middleware and DL frameworks. Co-design is necessary because of the complexity of the overall execution stack for modern DL frameworks. Broadly, this stack consists of many layers, which start from the application layer followed by the DL framework layer (e.g. TensorFlow). The next layer in the stack is the distributed training middleware layer (e.g. Horovod) that connects a DL framework to an underlying communication middleware (e.g. a Message Passing Interface (MPI) library). Finally, the communication middleware layer sits directly on top of the parallel hardware that consists of multiple CPU/GPU nodes connected with a high-performance network. \textit{The complexity of this stack coupled with inefficient existing approaches to utilize it has led to several problems}: First, there is a lack of credible and systematic performance (open full item for complete abstract)

    Committee: Dhabaleswar Kumar Panda (Advisor); Srinivasan Parthasarathy (Committee Member); Radu Teodorescu (Committee Member); Hari Subramoni (Committee Member) Subjects: Artificial Intelligence; Computer Science
  • 4. Zhou, Qinghua High Performance Communication Middleware with On-the-fly GPU-based Compression for HPC and Deep Learning Applications

    Doctor of Philosophy, The Ohio State University, 2024, Computer Science and Engineering

    General-purpose accelerators such as graphics processing unit (GPU), field-programmable gate array (FPGA), and tensor processing unit (TPU) are increasingly utilized to improve the performance of modern High-Performance Computing (HPC) and Cloud systems. GPUs, in particular, have emerged as a popular hardware choice due to their ability to handle massive parallelism and high-bandwidth memory. They have become a driving force behind rapid advancements in HPC and ML applications, particularly Deep Learning. GPUs significantly improve computational efficiency and overall performance and are ideal for handling computationally intensive workloads related to scientific simulations, data analysis, and neural network training. To handle growing data and models, HPC and Deep Learning applications need multiple nodes for faster computation. Interconnects like Ethernet and InfiniBand are keys for node communication and data sharing. Slow interconnect between nodes can be a bottleneck in these applications compared to intra-node interconnect PCIe, NVLINK, etc. Large data sets and training large deep-learning models increase the need for data transfer between nodes, causing significant delays and reducing performance. The Message Passing Interface (MPI)—considered the de facto parallel programming model—provides a set of communication primitives to support parallel and distributed execution of user applications on HPC systems. With the support of passing GPU buffers to MPI primitives directly, the state-of-the-art MPI libraries significantly improve performance for GPU-accelerated applications. However, the inter-node communication bandwidth of the state-of-the-art MPI libraries has saturated the bandwidth of the InfiniBand network for large GPU resident data. In this dissertation, we take advantage of GPU-based compression techniques with GPU computing resources to reduce the data size being transferred through the network with limited bandwidth on modern heterogeneous sy (open full item for complete abstract)

    Committee: Dhabaleswar Kumar Panda (Advisor); Hari Subramoni (Advisor); Radu Teodorescu (Committee Member); Christopher Stewart (Committee Member) Subjects: Computer Engineering; Computer Science
  • 5. Han, Mingzhe PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI

    Master of Science, The Ohio State University, 2024, Computer Science and Engineering

    The Message Passing Interface is the de facto standard in high-performance computing (HPC) for inter-process communication. MPI libraries employ numerous algorithms for each collective communication pattern whose behavior is largely affected by the underlying hardware, communication pattern, message size, and number of processes involved. Choosing the “best” algorithm for every possible scenario is a non-trivial task. MPI libraries primarily depend on heuristics for algorithm selection on previously unseen clusters, often resulting in evident slowdowns. Although offline micro-benchmarking tools can exhaustively identify optimal algorithms for all configurations, this is an excessively time-consuming approach. Machine Learning (ML) emerged as an alternate approach. However, most ML-based approaches employ online methods that introduce additional runtime overhead, which makes this impractical at scale. To address this challenge, we propose a pre-trained ML framework that eliminates runtime overhead. Our model requires only a quick inference for each new cluster without necessitating model retraining. Our model's training utilizes tuning data from a broad range of architectures, promoting its versatility and our proposed system exhibits up to 6.3% speedup over default heuristics on systems of up to 1024 cores while significantly speedup over default heuristics on systems of up to 1024 cores while significantly minimizing model overhead in comparison to existing methodologies.

    Committee: Dhabaleswar K. Panda (Advisor) Subjects: Computer Science
  • 6. Alattar, Kinan Optimizing Apache Spark using the MVAPICH2 MPI library for High Performance Computing

    Master of Science, The Ohio State University, 2023, Computer Science and Engineering

    With the growing popularity of Big Data frameworks such as Apache Spark, there is an accompanying demand for running such frameworks on High Performance Computing (HPC) systems to utilize their full potential. Apache Spark has seen deployment on HPC systems such as the Texas Advanced Computing Center (TACC) and San Diego's Supercomputing Center (SDSC). Spark, however, does not reap the full performance benefits of HPC. That is, high-speed interconnects and the defacto standard for writing scientific applications in HPC, the Message Passing Interface (MPI). In my thesis, I present a new design that combines MPI along with Apache Spark. We call this new effort MPI4Spark. The MPI4Spark package can launch the Spark ecosystem using MPI launchers, allowing MPI communication inside of it. Semantic differences between the application-driven communication of MPI and the event-driven communication in Spark are bridged. A relatively new feature of MPI is also adopted in this new design. This new feature, Dynamic Process Management (DPM), is used in launching of Spark executor processes used to execute user applications. MPI4Spark is portable across different HPC systems as it benefits from the underlying portability of the MVAPICH2 library, meaning the MPI4spark package can run on different popular HPC interconnects including InfiniBand, RoCE, Intel Omni-Path, and Slingshot. The MPI4Spark design was evaluated using OSU HiBD Benchmarks (OHB) and the Intel HiBench Suite, and the performance results were compared against regular “vanilla” Spark and RDMA-Spark. The evaluation was carried out on three HPC systems including TACC Frontera, TACC Stampede2, and OSU's internal cluster, RI2. MPI4Spark overall outperforms Vanilla Spark and RDMA-Spark. For the OHB GroupByTest benchmark, MPI4Spark outperformed vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 workers), with a communication speed-up, compared to vanilla (open full item for complete abstract)

    Committee: Dhabaleswar K. Panda (Advisor); Dhabaleswar K. Panda (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering
  • 7. Jain, Arpan Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

    Doctor of Philosophy, The Ohio State University, 2023, Computer Science and Engineering

    Deep Learning has achieved state-of-the-art performance in several artificial intelligence tasks like object recognition, speech recognition, machine translation, and summarization. Deep learning is a subset of machine learning that learns multiple levels of data representation using Neural Networks (NNs). The rise of deep learning can be attributed to the presence of large datasets and computation power. Large-scale Deep Neural Networks (DNNs) can provide state-of-the-art performance by learning complex relationships, enabling them to push the boundaries in artificial intelligence. However, training such large-scale DNNs is a compute-intensive task as it can have billions of parameters, which increases both the memory and computational requirement of DNN training. Hence, distributed DNN training has become the default approach to train large-scale DNNs like AmoebaNet, GPT3, and T5. Broadly, the DNN training pipeline can be divided into multiple phases: 1) Data Loading and Data Augmentation, 2) Forward/Backward Pass, and 3) Model Validation. Traditionally, these phases are executed sequentially on a single CPU or GPU due to a lack of additional resources. Multiple processing elements can be used to parallelize the computation in each phase and reduce the overall training time. In this dissertation, we propose novel parallelization strategies for distributed DNN training to alleviate bottlenecks in different phases of DNN training and parallelize the computation across multiple processing elements. Novel parallelization strategies are required to efficiently distribute the work among multiple processing elements and reduce communication overhead, as naive parallelization strategies may not give performance benefits when distributing work among multiple processing elements because of high communication overhead. Therefore, we need novel parallelization strategies designed to distribute the work while keeping the communication overhead low. There are several challenge (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Raghu Machiraju (Committee Member); Aamir Shafi (Committee Member); Hari Subramoni (Committee Member); Rajiv Ramnath (Committee Member) Subjects: Computer Science
  • 8. Srivastava, Siddhartha MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library

    Master of Science, The Ohio State University, 2021, Computer Science and Engineering

    The Message Passing Interface (MPI) is a popular parallel programming interface for developing scientific applications. These applications rely a lot on MPI for performance. Collective operations like MPI_Allreduce, MPI_Alltoall, and others, provide an abstraction for group communication on High-Performance Computing (HPC) systems. MVAPICH2 is a popular open-source high-performance implementation of the MPI standard that provides advanced designs for these collectives through various algorithms. These collectives are highly optimized to provide the best performance on different existing and emerging architectures. To provide the best performance, the right algorithm must be chosen for a collective. Choosing the best algorithm depends on many factors like the architecture of the system, the scale at which the application is run, etc. This process of choosing the best algorithm is called tuning of the collective. But tuning of the collective takes a lot of time and using static tables may not lead to the best performance. To solve this issue, we have designed an “Autotuning Framework”. The proposed Autotuning Framework selects the best algorithm for a collective during runtime without having to rely on the previous static tuning of the MVAPICH2 library for the system. Experimental results have shown a performance increase of up to 3X while using the Autotuning Framework version of the MVAPICH2 library versus an untuned MVAPICH2 library for collectives.

    Committee: Dhabaleswar K. Panda (Advisor); Radu Teodorescu (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering; Computer Science
  • 9. Chu, Ching-Hsiang Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    In the era of post Moore's law, the traditional general-purpose CPU is not able to keep the pace up and provide the computing power demanded by the modern compute-intensive and highly parallelizable applications. Under this context, various accelerator architectures such as tensor processing unit (TPU), field-programmable gate array (FPGA), and graphics processing unit (GPU) are being designed to meet the high computational demands. Notably, the GPU has been widely adopted in high-performance computing (HPC) and cloud systems to significantly accelerate numerous scientific and emerging machine/deep learning (ML/DL) applications. To seek more computing power, researchers and engineers are building large-scale GPU clusters, i.e., scale-out. Moreover, the recent advent of high-speed interconnects technology such as NVIDIA NVLink and AMD Infinity fabric enables the deployment of dense GPU systems, i.e., scale-up. As a result, we are witnessing that six out of the top 10 supercomputers, as of July 2020, are powered by thousands of NVIDIA GPUs with NVLink and InfiniBand networks. Driven by the ever large GPU systems, GPU-Aware Message Passing Interface (MPI) has become the standard programming model for developing GPU-enabled parallel applications. However, the state-of-the-art GPU-Aware MPI libraries are predominantly optimized by leveraging advanced technology like Remote Direct Memory Access (RDMA) and NOT exploiting GPUs' computational power. There is a dearth of research in designing GPU-enabled communication middleware that efficiently handles end-to-end networking and harnesses computational power provided by the accelerators. In this thesis, we take the GPU as an example to demonstrate how to design accelerator-enabled communication middleware that harness hardware computational resources and cutting-edge interconnects for high-performance and scalable communication on the modern and next-generation heterogeneous HPC systems. Specifically, this thesis addresse (open full item for complete abstract)

    Committee: Dhabaleswar Panda K (Advisor); Radu Teodorescu (Committee Member); Feng Qin (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering; Computer Science
  • 10. Hashmi, Jahanzeb Maqbool Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    Modern High-Performance Computing (HPC) systems are enabling scientists from different research domains such as astrophysics, climate simulations, computational fluid dynamics, drugs discovery, and others, to model and simulate computation-heavy problems at different scales. In recent years, the resurgence of Artificial Intelligence (AI), particularly Deep Learning (DL) algorithms, has been made possible by the evolution of these HPC systems. The diversity of applications ranging from traditional scientific computing to the training and inference of neural-networks are driving the evolution of processor and interconnect technologies as well as communication middlewares. Today's multi-petaflop HPC systems are powered by dense multi-/many-core architectures and this trend is expected to grow for next-generation systems. This rapid adoption of these high core-density architectures by the current- and next-generation HPC systems, driven by emerging application trends, are putting more emphasis on the middleware designers to optimize various communication primitives to meet the diverse needs of the applications. While these novelties in the processor architectures have led to increased on-chip parallelism, they come at the cost of rendering traditional designs, employed by the communication middlewares, to suffer from higher intra-node communication costs. Tackling the computation and communication challenges that accompany these dense multi-/manycores garner special design considerations. Scientific and AI applications that rely on such large-scale HPC systems to achieve higher performance and scalability often use Message Passing Interface (MPI), Partition Global Address Space (PGAS), or a hybrid of both as underlying communication substrate. These applications use various communication primitives (e.g., point-to-point, collectives, RMA) and often use custom data layouts (e.g., derived datatypes), spending a fair bit of time in communication an (open full item for complete abstract)

    Committee: Dhabaleswar K. (DK) Panda (Advisor); Radu Teodorescu (Committee Member); Feng Qin (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Science
  • 11. Sankarapandian Dayala Ganesh R, Kamal Raj Profiling MPI Primitives in Real-time Using OSU INAM

    Master of Science, The Ohio State University, 2020, Computer Science and Engineering

    In an MPI library, there are numerous routines available under Point-to-Point (MPI_Send, MPI_Recv, MPI_ISend, MPI_IRecv) and Collectives (MPI_Allreduce, MPI_Bcast, MPI_Alltoall, MPI_Allgather, MPI_Allgatherv and a lot more) each with a multitude of implementations/algorithms. When an MPI developer is looking to optimize an application, it could easily become overwhelming to look at profiling and tracing information from multiple processes to understand communication patterns. Historically, it has been extremely hard to gain insights by profiling MPI runtimes at a deeper level. For instance, no existing IB fabric monitoring tool can, at runtime, elicit MPI level behavior to classify traffic as belonging or being generated by a specific algorithm of a certain MPI primitive. The ability to look at the performance of an application at this finer level of granularity will empower MPI developers to gain deep insights into the operation of the MPI library. We take up the challenge of modifying MVAPICH2-X to gather and report information pertaining to its various operations with low overhead at runtime, and to modify OSU INAM to efficiently store and visualize the information collected from MVAPICH2-X in an intuitive way.

    Committee: Dhabaleswar K. Panda (Advisor); Hari Subramoni (Committee Member); Feng Qin (Committee Member) Subjects: Computer Science
  • 12. Chakraborty, Sourav High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures

    Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering

    Modern high-performance computing (HPC) systems are enabling scientists to tackle various grand challenge problems in diverse domains including cosmology and astrophysics, earthquake and weather analysis, molecular dynamics and physics modeling, biological computations, and computational fluid dynamics among others. Along with the increasing demand for computing power, these applications are creating fundamental new challenges in terms of communication complexity, scalability, and reliability. At the same time, remote and virtualized clouds are rapidly gaining in popularity compared to on-premise clusters due to lower initial cost and greater flexibility. These requirements are driving the evolution of modern HPC processors, interconnects, storage systems, as well as middleware and runtimes. However, a large number of scientific applications have irregular and/or dynamic computation and communication patterns that require different approaches to extract the best performance. The increasing scale of HPC systems coupled with the diversity of emerging architectures, including the advent of multi-/many-core processors and Remote Direct Memory Access (RDMA) aware networks have exacerbated this problem by making a "one-size-fits-all" policy non-viable. Thus, a fundamental shift is required in how HPC middleware interact with the application and react to its computation and communication requirements. Furthermore, current generation middleware consist of many independent components like the communication runtime, resource manager, job launcher etc. However, the lack of cooperation among these components often limits the performance and scalability of the end-application. To address these challenges, we propose a high-performance and scalable "Cooperative Communication Middleware" for HPC systems. The middleware supports MPI (Message Passing Interface), PGAS (Partitioned Global Address Space), and hybrid MPI+PGAS programming models and provides improved point-to-p (open full item for complete abstract)

    Committee: Dhabaleswar K Panda (Advisor); Gagan Agrawal (Committee Member); Ponnuswamy Sadayappan (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering; Computer Science
  • 13. Levi, Jacob Automated Beam Hardening Correction for Myocardial Perfusion Imaging using Computed Tomography

    Doctor of Philosophy, Case Western Reserve University, 2019, Physics

    Myocardial perfusion imaging using computed tomography (MPI-CT) and coronary computed tomography angiography (CTA) have the potential to make CT an ideal, non-invasive imaging gatekeeper exam for invasive coronary angiography. However, beam hardening (BH) artifacts prevent accurate blood flow assessment and the reduction of false positive identification of coronary disease in MPI-CT. BH occurs when a poly-energetic x-ray beam passes through an attenuating material. Depending on the source spectrum and the material absorption, the low-energy (soft) photons are attenuated at a higher rate than high-energy (hard) photons, therefore hardening the beam. In the image reconstruction process, BH leads to characteristic streaks and non-uniformities (“cupping”), which can lead to incorrect clinical interpretation or diagnosis. Current BH correction methods require either energy-sensitive CT, not widely available, prior knowledge of physical characteristics of the scanner (i.e., the x-ray source spectrum or calibration against attenuating materials). In this dissertation, I propose an image-based, calibration-free, automated BH correction (ABHC) method suitable for MPI-CT, which is one of the most demanding applications for BH correction. In the heart of ABHC, a tailored cost function is used to evaluate streak and cupping artifacts that originate from BH. ABHC minimizes the cost function and finds optimal correction parameters for an image based BH correction algorithm. Two BH correction algorithms from the literature were incorporated into ABHC and tested: the polynomial BH correction and the newer empirical BH correction (EBHC). With both correction algorithms, ABHC leads to optimal correction parameters that dramatically reduced BH artifacts. The ABHC algorithm is evaluated by measuring BH artifact streaks and cupping on simulated and physical phantom images, and on preclinical porcine and clinical MPI-CT data. For example, we observe a reduction of 86% in cupping artifact (open full item for complete abstract)

    Committee: Michael Martens PhD (Committee Chair); David Wilson PhD (Committee Member); Robert Brown PhD (Committee Member); Steven Izen PhD (Committee Member) Subjects: Biomedical Engineering; Medical Imaging; Physics
  • 14. Scheiman, Kevin A Parallel Spectral Method Approach to Model Plasma Instabilities

    Master of Science (MS), Wright State University, 2018, Physics

    The study of solar-terrestrial plasma is concerned with processes in magnetospheric, ionospheric, and cosmic-ray physics involving different particle species and even particles of different energy within a single species. Instabilities in space plasmas and the earth's atmosphere are driven by a multitude of free energy sources such as velocity shear, gravity, temperature anisotropy, electron, and, ion beams and currents. Microinstabilities such as Rayleigh-Taylor and Kelvin-Helmholtz instabilities are important for the understanding of plasma dynamics in presence of magnetic field and velocity shear. Modeling these turbulences is a computationally demanding processes; requiring large memory and suffer from excessively long runtimes. Previous works have successfully modeled the linear and nonlinear growth phases of Rayleigh-Taylor and Kelvin-Helmholtz type instabilities in ionospheric plasmas using finite difference methods. The approach here uses a two-fluid theoretical ion-electron model by solving two-fluid equations using iterative procedure keeping only second order terms. It includes the equation of motion for ions and electrons, the continuity equations for both species, and the assumption that the electric drift and gravitational drift are of the same order. The effort of this work is to focus on developing a new pseudo-spectral, highly-parallelizable numerical approach to achieve maximal computational speedup and efficiency. Domain decomposition along with Message Passing Interface (MPI) functionality was implemented for use of multiple processor distributed memory computing. The global perspective of using Fourier Transforms not only adds to the accuracy of the differentiation process but also limits memory calling when performing calculations. An original method for calculating the Laplacian for a periodic function was developed that obtained a maximum speedup of 2.98 when run on 16 processors, with a theoretical max of 3.63. Using this meth (open full item for complete abstract)

    Committee: Amit Sharma Ph.D. (Advisor); Brent Foy Ph.D. (Committee Member); Ivan Medvedev Ph.D. (Committee Member) Subjects: Computer Science; Physics; Plasma Physics
  • 15. Eck, Brendan Myocardial Perfusion Imaging with X-Ray Computed Tomography

    Doctor of Philosophy, Case Western Reserve University, 2018, Biomedical Engineering

    Early detection and treatment of coronary artery disease (CAD) can improve prognosis and overall survival. However, current noninvasive assessment is highly inefficient: of patients referred to invasive angiography, >60% do not have obstructive CAD. Microvascular disease (MVD) accounts for a significant portion of these patients, particularly in patients with diabetes, smoking, hypertension, or other cardiomyopathies. Quantitative estimates of myocardial blood flow by myocardial perfusion imaging (MPI) can detect the physiologic impact of MVD and obstructive CAD. The combination of MPI with computed tomography (MPI-CT) and coronary CT angiography would enable rapid physiologic and anatomic evaluation of CAD and MVD in a single exam. Despite a number of promising MPI-CT reports, the lack of consensus in image acquisition and myocardial blood flow quantification methods, as well as concern regarding imaging artifacts and radiation dose, slows clinical adoption. Four projects are described in this dissertation. First, in a porcine model of flow-limiting stenosis scanned on a spectral detector CT, energy-sensitive reconstruction and dynamic imaging were shown to improve detection of myocardial ischemia as compared to conventional reconstruction and static imaging. Second, the role of imaging conditions and quantification methods was evaluated with regards to obtaining accurate and precise myocardial blood flow (MBF) estimates. Several methods from the literature, some implemented in commercial software, gave imprecise, biased MBF estimates. A proposed robust physiologic model was found to precisely and accurately quantify MBF. Third, a method to calculate MBF confidence intervals (MBFCI) was developed and used to select appropriate analysis models. Use of MBFCI and a goodness-of-fit metric, Akaike Information Criterion (AIC), selected a model with precise MBF estimates whereas AIC alone selected models with imprecise MBF estimates. Fourth, an advanced iterative reco (open full item for complete abstract)

    Committee: David Wilson PhD (Advisor); Nicole Seiberlich PhD (Committee Chair); Hiram Bezerra MD, PhD (Committee Member); Raymond Muzic Jr., PhD (Committee Member); Steven Izen PhD (Committee Member) Subjects: Biomedical Engineering; Biomedical Research; Medical Imaging; Radiology
  • 16. Li, Mingzhe Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters

    Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering

    Multi-/Many-core architectures and networking technologies like InfiniBand (IB) are fueling the growth of next-generation ultra-scale systems that have high compute density. The communication requirements of scientific applications are steadily increasing. MPI two-sided programming models have been used by most of the scientific applications on High-Performance Computing (HPC) systems, however there is an increased focus on MPI one-sided and the Partitioned Global Address Space (PGAS) programming models such as OpenSHMEM. As modern computer hardware architectures keep evolving, it is critical that the MPI and PGAS runtimes and scientific applications are designed with high scalability and performance for next generation systems. This thesis focuses on designing high-performance Remote Memory Access (RMA) for MPI and PGAS models with modern networking technologies on heterogeneous clusters. High-performance interconnects have been the key drivers for high-performance computing systems. Many new networking technologies have been offered on interconnects to meet the increasing communication requirements of scientific applications. However, MPI and PGAS runtimes have not been designed with such technologies to future boost the performance of scientific applications on multi-petaflop/exascale HPC systems. We present designs at MPI and PGAS runtime level taking advantage of hardware atomics, User-Mode Memory Registration (UMR), and On-Demand Paging (ODP) of InfiniBand to benefit scientific applications transparently. With our ODP-Aware MPI runtime, the pin-down buffer size of LAMMPS application has been reduced by 11X. Similarly, we have shown up to 4X performance improvement in point-to-point latency for noncontiguous data movement for 4MB messages. Most of the scientific applications have been written with MPI two-sided programming models that have been shown as a good fit for regular and iterative applications. However, it can be very difficult to use MPI two-sid (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Gagan Agrawal (Committee Member); P. Sadayappan (Committee Member); Xiaoyi Lu (Committee Member) Subjects: Computer Science
  • 17. Augustine, Albert Mathews Designing a Scalable Network Analysis and Monitoring Tool with MPI Support

    Master of Science, The Ohio State University, 2016, Computer Science and Engineering

    State-of-the-art high-performance computing is powered by the tight integration of several hardware and software components. While on the hardware side, we have multi-/many core architectures (including accelerators and co-processors) and high end interconnects (like InfiniBand, Omni-Path), on the software front we have several high performance implementations of parallel programming models which help us to take advantage of the advanced features offered by the hardware components. This tight coupling between both these layers helps in delivering the multi-petaflop level performance to the end application allowing scientists/engineers to tackle the grand challenges in their respective areas. Understanding and gaining insights into the performance of the end application on these modern systems is a challenging task. Several tools have been developed to inspect the network level or MPI level activities to address this challenge. However, these existing tools inspect the network and MPI layer in a disjoint manner and are not able to provide a holistic picture correlating the data generated for network layer and MPI. Thus, the user can miss out on critical information which could have helped in understanding the interaction between MPI applications and the network they are running on. In this thesis, we take up this challenge and design OSU INAM. OSU INAM allows users to analyze and visualize the communication happening in the network in conjunction with the data obtained from the MPI library. Our experimental analysis shows that the tool is able to profile and visualize the communication with very low performance overhead at scale.

    Committee: Dhabaleswar Panda Dr. (Advisor); Radu Teodorescu Dr. (Committee Member); Hari Subramoni Dr. (Committee Member) Subjects: Computer Engineering; Computer Science
  • 18. Maddipati, Sai Ratna Kiran Improving the Parallel Performance of Boltzman-Transport Equation for Heat Transfer

    Master of Science, The Ohio State University, 2016, Computer Science and Engineering

    In a thermodynamically unstable environment, the Boltzman-Transport Equation (BTE) defines the behavior of heat transfer-rate at each location in the environment, the direction of heat transfer by the particles of the environment and the final equilibrium temperature conditions of the environment. The BTE is a very computationally intensive application and there is a need for efficient parallelization. Parallelization implementation of this application is explained along with brief details about several techniques that have been used by others in past work. The implementation involves several code-iterations of the BTE application with distinct changes in order to compare and analyze the performance and identify the reason for the performance improvement or deterioration. Then, experimental results are presented to show the resulting performance of the implementation.

    Committee: P Sadayappan (Advisor); Nasko Rountev (Committee Member) Subjects: Computer Science
  • 19. Raja Chandrasekar, Raghunath Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters

    Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering

    In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors. A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the order of tens of minutes to hours. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources. On supercomputing systems geared for Exascale, parallel applications will have a wider range of storage media to choose from - on-chip/off-chip caches, node-level RAM, Non-Volatile Memory (NVM), distributed-RAM, flash-storage (SSDs), HDDs, parallel file systems, and archival sto (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Kathryn Mohror (Committee Member) Subjects: Computer Engineering; Computer Science
  • 20. Jose, Jithin Designing High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data Middleware

    Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering

    The computation and communication requirements of modern HighPerformance Computing (HPC) and Big Data applications are steadily increasing. HPC scientific applications typically use Message Passing Interface (MPI) as the programming model, however, there is an increased focus on hybrid MPI+PGAS (Partitioned Global Address Space) models for emerging exascale systems. Big Data applications rely on middleware such as Hadoop (including MapReduce, HDFS, HBase, etc.) and Memcached. It is critical that these middleware be designed with high scalability and performance for next generation systems. In order to ensure that HPC and Big Data applications can continue to scale and leverage the capabilities and performance of emerging technologies, a high performance communication runtime is much needed. This thesis focuses on designing a high performance and scalable Unified Communication Runtime (UCR) for HPC and Big Data middleware. In HPC domain, MPI has been the prevailing communication middleware for more than two decades. Even though it has been successful in developing regular and iterative applications, it can be very difficult to use MPI and maintain performance for irregular, data-driven applications. PGAS programming model presents an attractive alternative for designing such applications and provides higher productivity. It is widely believed that parts of applications can be redesigned using PGAS models - leading to hybrid MPI+PGAS applications, and improve performance. In order to fully leverage the performance benefits offered by the modern HPC systems, a unified communication runtime that offers the advantages of both MPI and PGAS programming models is critical. We present "MVAPICH2-X" - a high performance and scalable 'Unified Communication Runtime' that supports both MPI and PGAS programming models. This thesis also targets at redesigning applications making use of hybrid programming features, for better performance. With our hybrid MPI+PGAS design using Un (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Karen Tomko (Committee Member) Subjects: Computer Science