Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 26)

Mini-Tools

 
 

Search Report

  • 1. Zhou, Qinghua High Performance Communication Middleware with On-the-fly GPU-based Compression for HPC and Deep Learning Applications

    Doctor of Philosophy, The Ohio State University, 2024, Computer Science and Engineering

    General-purpose accelerators such as graphics processing unit (GPU), field-programmable gate array (FPGA), and tensor processing unit (TPU) are increasingly utilized to improve the performance of modern High-Performance Computing (HPC) and Cloud systems. GPUs, in particular, have emerged as a popular hardware choice due to their ability to handle massive parallelism and high-bandwidth memory. They have become a driving force behind rapid advancements in HPC and ML applications, particularly Deep Learning. GPUs significantly improve computational efficiency and overall performance and are ideal for handling computationally intensive workloads related to scientific simulations, data analysis, and neural network training. To handle growing data and models, HPC and Deep Learning applications need multiple nodes for faster computation. Interconnects like Ethernet and InfiniBand are keys for node communication and data sharing. Slow interconnect between nodes can be a bottleneck in these applications compared to intra-node interconnect PCIe, NVLINK, etc. Large data sets and training large deep-learning models increase the need for data transfer between nodes, causing significant delays and reducing performance. The Message Passing Interface (MPI)—considered the de facto parallel programming model—provides a set of communication primitives to support parallel and distributed execution of user applications on HPC systems. With the support of passing GPU buffers to MPI primitives directly, the state-of-the-art MPI libraries significantly improve performance for GPU-accelerated applications. However, the inter-node communication bandwidth of the state-of-the-art MPI libraries has saturated the bandwidth of the InfiniBand network for large GPU resident data. In this dissertation, we take advantage of GPU-based compression techniques with GPU computing resources to reduce the data size being transferred through the network with limited bandwidth on modern heterogeneous sy (open full item for complete abstract)

    Committee: Dhabaleswar Kumar Panda (Advisor); Hari Subramoni (Advisor); Radu Teodorescu (Committee Member); Christopher Stewart (Committee Member) Subjects: Computer Engineering; Computer Science
  • 2. Scyphers, Madeline Bayesian Optimization for Anything (BOA): An Open-Source Framework for Accessible, User-Friendly Bayesian Optimization

    Master of Science, The Ohio State University, 2024, Environmental Science

    I introduce Bayesian Optimization for Anything (BOA), a high-level BO framework and model wrapping toolkit, which presents a novel approach to simplifying Bayesian Optimization (BO) with the goal of making it more accessible and user-friendly, particularly for those with limited expertise in the field. BOA addresses common barriers in implementing BO, focusing on ease of use, reducing the need for deep domain knowledge, and cutting down on extensive coding requirements. A notable feature of BOA is its language-agnostic architecture. Using JSON serialization, BOA facilitates communication between different programming languages, enabling a wide range of users to integrate BOA with their existing models, regardless of the programming language used, with a simple and easy-to-use interface. This feature enhances the applicability of BOA, allowing for broader application in various fields and to a wider audience. I highlight BOA's application through several real-world examples. BOA has been successfully employed in a high-dimensional (184 parameters) optimization Soil & Water Assessment Tool (SWAT+) model, demonstrating its capability in parallel optimization with SWAT and non-parallel models, such as SWAT+. I employed BOA in a multi-objective optimization of the FETCH3.14 model. These case studies illustrate BOA's effectiveness in addressing complex optimization challenges in diverse scenarios.

    Committee: Gil Bohrer (Advisor); James Stagge (Committee Member); Joel Paulson (Committee Member) Subjects: Artificial Intelligence; Computer Science; Environmental Engineering; Environmental Studies; Statistics
  • 3. Alattar, Kinan Optimizing Apache Spark using the MVAPICH2 MPI library for High Performance Computing

    Master of Science, The Ohio State University, 2023, Computer Science and Engineering

    With the growing popularity of Big Data frameworks such as Apache Spark, there is an accompanying demand for running such frameworks on High Performance Computing (HPC) systems to utilize their full potential. Apache Spark has seen deployment on HPC systems such as the Texas Advanced Computing Center (TACC) and San Diego's Supercomputing Center (SDSC). Spark, however, does not reap the full performance benefits of HPC. That is, high-speed interconnects and the defacto standard for writing scientific applications in HPC, the Message Passing Interface (MPI). In my thesis, I present a new design that combines MPI along with Apache Spark. We call this new effort MPI4Spark. The MPI4Spark package can launch the Spark ecosystem using MPI launchers, allowing MPI communication inside of it. Semantic differences between the application-driven communication of MPI and the event-driven communication in Spark are bridged. A relatively new feature of MPI is also adopted in this new design. This new feature, Dynamic Process Management (DPM), is used in launching of Spark executor processes used to execute user applications. MPI4Spark is portable across different HPC systems as it benefits from the underlying portability of the MVAPICH2 library, meaning the MPI4spark package can run on different popular HPC interconnects including InfiniBand, RoCE, Intel Omni-Path, and Slingshot. The MPI4Spark design was evaluated using OSU HiBD Benchmarks (OHB) and the Intel HiBench Suite, and the performance results were compared against regular “vanilla” Spark and RDMA-Spark. The evaluation was carried out on three HPC systems including TACC Frontera, TACC Stampede2, and OSU's internal cluster, RI2. MPI4Spark overall outperforms Vanilla Spark and RDMA-Spark. For the OHB GroupByTest benchmark, MPI4Spark outperformed vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 workers), with a communication speed-up, compared to vanilla (open full item for complete abstract)

    Committee: Dhabaleswar K. Panda (Advisor); Dhabaleswar K. Panda (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering
  • 4. Jain, Arpan Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

    Doctor of Philosophy, The Ohio State University, 2023, Computer Science and Engineering

    Deep Learning has achieved state-of-the-art performance in several artificial intelligence tasks like object recognition, speech recognition, machine translation, and summarization. Deep learning is a subset of machine learning that learns multiple levels of data representation using Neural Networks (NNs). The rise of deep learning can be attributed to the presence of large datasets and computation power. Large-scale Deep Neural Networks (DNNs) can provide state-of-the-art performance by learning complex relationships, enabling them to push the boundaries in artificial intelligence. However, training such large-scale DNNs is a compute-intensive task as it can have billions of parameters, which increases both the memory and computational requirement of DNN training. Hence, distributed DNN training has become the default approach to train large-scale DNNs like AmoebaNet, GPT3, and T5. Broadly, the DNN training pipeline can be divided into multiple phases: 1) Data Loading and Data Augmentation, 2) Forward/Backward Pass, and 3) Model Validation. Traditionally, these phases are executed sequentially on a single CPU or GPU due to a lack of additional resources. Multiple processing elements can be used to parallelize the computation in each phase and reduce the overall training time. In this dissertation, we propose novel parallelization strategies for distributed DNN training to alleviate bottlenecks in different phases of DNN training and parallelize the computation across multiple processing elements. Novel parallelization strategies are required to efficiently distribute the work among multiple processing elements and reduce communication overhead, as naive parallelization strategies may not give performance benefits when distributing work among multiple processing elements because of high communication overhead. Therefore, we need novel parallelization strategies designed to distribute the work while keeping the communication overhead low. There are several challenge (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Raghu Machiraju (Committee Member); Aamir Shafi (Committee Member); Hari Subramoni (Committee Member); Rajiv Ramnath (Committee Member) Subjects: Computer Science
  • 5. Chen, Weicong High-performance and Scalable Bayesian Group Testing and Real-time fMRI Data Analysis

    Doctor of Philosophy, Case Western Reserve University, 2023, EECS - Computer and Information Sciences

    The COVID-19 pandemic has necessitated disease surveillance using group testing. Novel Bayesian methods using lattice models Ire proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using Bayesian Halving Algorithms. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. To overcome these challenges, I propose a high-performance Bayesian group testing framework named HiBGT, which systematically explores the design space of Bayesian group testing and provides comprehensive heuristics on how to achieve high-performance Bayesian group testing. I show that HiBGT can perform large-scale test selections ($>2^{50}$ state iterations) and accelerate statistical analyzes up to 15.9x (up to 363x with little trade-offs) through a varied selection of sophisticated parallel computing techniques while achieving near linear scalability using up to 924 CPU cores. I further propose to scale HiBGT using a lightning fast and highly scalable framework, named SGBT. In particular, SBGT is up to 376x, 1733x, and 1523x faster than HiBGT in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9\% scaling efficiency up to 4096 CPU cores. I propose algorithms and workflows for next-generation real-time analysis of fMRI data and dynamically adjustment of experiment stimuli through early stopping. To overcome significant computational challenges raised in this setting, I design a \underline{S}calable, \underline{P}arallel, and \underline{R}eal-\underline{T}ime \underline{S}equential \underline{P}robability \underline{R}atio \underline{T}est (open full item for complete abstract)

    Committee: Curtis Tatsuoka (Advisor); Vipin Chaudhary (Committee Chair); Xiaoyi Lu (Committee Member); Vincenzo Liberatore (Committee Member) Subjects: Computer Science
  • 6. Tronge, Jacob Orchestration of HPC Workflows: Scalability Testing and Cross-System Execution

    MS, Kent State University, 2022, College of Arts and Sciences / Department of Computer Science

    HPC and scientific workflows change over time and often require more resources, different parameters and environments. Workflows may eventually have a need for more resources than are available on a single platform. Also, as applications evolve to fit new requirements and design goals, their performance and how well they can scale on existing hardware needs to be measured to ensure optimal application development and design. New workflows, as well as existing workflows, will require next-generation workflow engines that are able to handle multiple platforms, testing of scalability as well as communication and monitoring of applications, all designed to allow for greater portability and reproducibility. In this work I will demonstrate extensions to the Build and Execute Environment (BEE) workflow orchestration system, as well as additional code known as BeeSwarm, that are used for testing scalability and performance of existing HPC applications. This work also encompasses new design choices in BEE that allow for running workflows across multiple underlying systems, thus not limiting workflows to only the resources that are available on a single system.

    Committee: Qiang Guan (Advisor); Mikhail Nesterenko (Committee Member); Xiang Lian (Committee Member) Subjects: Computer Science
  • 7. Srivastava, Siddhartha MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library

    Master of Science, The Ohio State University, 2021, Computer Science and Engineering

    The Message Passing Interface (MPI) is a popular parallel programming interface for developing scientific applications. These applications rely a lot on MPI for performance. Collective operations like MPI_Allreduce, MPI_Alltoall, and others, provide an abstraction for group communication on High-Performance Computing (HPC) systems. MVAPICH2 is a popular open-source high-performance implementation of the MPI standard that provides advanced designs for these collectives through various algorithms. These collectives are highly optimized to provide the best performance on different existing and emerging architectures. To provide the best performance, the right algorithm must be chosen for a collective. Choosing the best algorithm depends on many factors like the architecture of the system, the scale at which the application is run, etc. This process of choosing the best algorithm is called tuning of the collective. But tuning of the collective takes a lot of time and using static tables may not lead to the best performance. To solve this issue, we have designed an “Autotuning Framework”. The proposed Autotuning Framework selects the best algorithm for a collective during runtime without having to rely on the previous static tuning of the MVAPICH2 library for the system. Experimental results have shown a performance increase of up to 3X while using the Autotuning Framework version of the MVAPICH2 library versus an untuned MVAPICH2 library for collectives.

    Committee: Dhabaleswar K. Panda (Advisor); Radu Teodorescu (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering; Computer Science
  • 8. Kedia, Mansa Profile, Monitor, and Introspect Spark Jobs Using OSU INAM

    Master of Science, The Ohio State University, 2020, Computer Science and Engineering

    With Apache Spark gaining popularity in the Big Data domain, it is getting crucial to be able to profile Spark applications and get details about each stage to help optimize performance. Spark already exposes a suite of web UIs to help the users monitor the basic application statistics. However, this information is not sufficient. An area with a lot of potential for performance improvement is the shuffle phase. If a user can get insights about this phase, it can help them to address many sources of inefficiency by modifying some design decisions. We take up this challenge of introducing a new capability to OSU INAM that allows it to gain insights about spark based big data applications to help in performance troubleshooting and workload characterization. We present a holistic view by correlating network information and spark middleware level information as well as the data transfer that happens during the shuffle phase. To demonstrate the use of this capability, we run different types of spark applications/benchmarks that help highlight the different communication patterns.

    Committee: Dhabaleswar K, Panda (Advisor); Radu Teodorescu (Committee Member); Hari Subramoni (Committee Member); Aamir Shafi (Committee Member) Subjects: Computer Science
  • 9. Baheri, Betis MARS: Multi-Scalable Actor-Critic Reinforcement Learning Scheduler

    MS, Kent State University, 2020, College of Arts and Sciences / Department of Computer Science

    In this thesis we introduce a new scheduling algorithm MARS based on a cost-aware multi-scalable reinforcement learning approach, which serves as an intermediate layer between HPC resource manager and user application workflow. MARS ensembles the pre-generated models from users' workflows and decides on the most suitable strategy for optimization. A whole workflow application would be split into several optimized sub-tasks. Then, based on a pre-defined resource management plan, a reward will be generated after executing a scheduled task. Lastly, MARS updates the Deep Neural Network (DNN) model for future use. MARS is designed to be able to optimize the existing models through reinforcement mechanism. MARS can adapt to the shortage of training samples by optimizing the performance through combining small tasks together or switching between pre-built scheduling strategies such as Backfilling, SJF, etc., and choosing the most suitable approach. After testing MARS using different real-world workflow traces, results shows that MARS can achieve between 5%-60% better performance against the other approaches.

    Committee: Qiang Guan Dr. (Advisor); Feodor Dragan Dr. (Committee Member); Rouming Jin Dr. (Committee Member) Subjects: Computer Science
  • 10. Chu, Ching-Hsiang Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    In the era of post Moore's law, the traditional general-purpose CPU is not able to keep the pace up and provide the computing power demanded by the modern compute-intensive and highly parallelizable applications. Under this context, various accelerator architectures such as tensor processing unit (TPU), field-programmable gate array (FPGA), and graphics processing unit (GPU) are being designed to meet the high computational demands. Notably, the GPU has been widely adopted in high-performance computing (HPC) and cloud systems to significantly accelerate numerous scientific and emerging machine/deep learning (ML/DL) applications. To seek more computing power, researchers and engineers are building large-scale GPU clusters, i.e., scale-out. Moreover, the recent advent of high-speed interconnects technology such as NVIDIA NVLink and AMD Infinity fabric enables the deployment of dense GPU systems, i.e., scale-up. As a result, we are witnessing that six out of the top 10 supercomputers, as of July 2020, are powered by thousands of NVIDIA GPUs with NVLink and InfiniBand networks. Driven by the ever large GPU systems, GPU-Aware Message Passing Interface (MPI) has become the standard programming model for developing GPU-enabled parallel applications. However, the state-of-the-art GPU-Aware MPI libraries are predominantly optimized by leveraging advanced technology like Remote Direct Memory Access (RDMA) and NOT exploiting GPUs' computational power. There is a dearth of research in designing GPU-enabled communication middleware that efficiently handles end-to-end networking and harnesses computational power provided by the accelerators. In this thesis, we take the GPU as an example to demonstrate how to design accelerator-enabled communication middleware that harness hardware computational resources and cutting-edge interconnects for high-performance and scalable communication on the modern and next-generation heterogeneous HPC systems. Specifically, this thesis addresse (open full item for complete abstract)

    Committee: Dhabaleswar Panda K (Advisor); Radu Teodorescu (Committee Member); Feng Qin (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering; Computer Science
  • 11. Hashmi, Jahanzeb Maqbool Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems

    Doctor of Philosophy, The Ohio State University, 2020, Computer Science and Engineering

    Modern High-Performance Computing (HPC) systems are enabling scientists from different research domains such as astrophysics, climate simulations, computational fluid dynamics, drugs discovery, and others, to model and simulate computation-heavy problems at different scales. In recent years, the resurgence of Artificial Intelligence (AI), particularly Deep Learning (DL) algorithms, has been made possible by the evolution of these HPC systems. The diversity of applications ranging from traditional scientific computing to the training and inference of neural-networks are driving the evolution of processor and interconnect technologies as well as communication middlewares. Today's multi-petaflop HPC systems are powered by dense multi-/many-core architectures and this trend is expected to grow for next-generation systems. This rapid adoption of these high core-density architectures by the current- and next-generation HPC systems, driven by emerging application trends, are putting more emphasis on the middleware designers to optimize various communication primitives to meet the diverse needs of the applications. While these novelties in the processor architectures have led to increased on-chip parallelism, they come at the cost of rendering traditional designs, employed by the communication middlewares, to suffer from higher intra-node communication costs. Tackling the computation and communication challenges that accompany these dense multi-/manycores garner special design considerations. Scientific and AI applications that rely on such large-scale HPC systems to achieve higher performance and scalability often use Message Passing Interface (MPI), Partition Global Address Space (PGAS), or a hybrid of both as underlying communication substrate. These applications use various communication primitives (e.g., point-to-point, collectives, RMA) and often use custom data layouts (e.g., derived datatypes), spending a fair bit of time in communication an (open full item for complete abstract)

    Committee: Dhabaleswar K. (DK) Panda (Advisor); Radu Teodorescu (Committee Member); Feng Qin (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Science
  • 12. Sankarapandian Dayala Ganesh R, Kamal Raj Profiling MPI Primitives in Real-time Using OSU INAM

    Master of Science, The Ohio State University, 2020, Computer Science and Engineering

    In an MPI library, there are numerous routines available under Point-to-Point (MPI_Send, MPI_Recv, MPI_ISend, MPI_IRecv) and Collectives (MPI_Allreduce, MPI_Bcast, MPI_Alltoall, MPI_Allgather, MPI_Allgatherv and a lot more) each with a multitude of implementations/algorithms. When an MPI developer is looking to optimize an application, it could easily become overwhelming to look at profiling and tracing information from multiple processes to understand communication patterns. Historically, it has been extremely hard to gain insights by profiling MPI runtimes at a deeper level. For instance, no existing IB fabric monitoring tool can, at runtime, elicit MPI level behavior to classify traffic as belonging or being generated by a specific algorithm of a certain MPI primitive. The ability to look at the performance of an application at this finer level of granularity will empower MPI developers to gain deep insights into the operation of the MPI library. We take up the challenge of modifying MVAPICH2-X to gather and report information pertaining to its various operations with low overhead at runtime, and to modify OSU INAM to efficiently store and visualize the information collected from MVAPICH2-X in an intuitive way.

    Committee: Dhabaleswar K. Panda (Advisor); Hari Subramoni (Committee Member); Feng Qin (Committee Member) Subjects: Computer Science
  • 13. Chakraborty, Sourav High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures

    Doctor of Philosophy, The Ohio State University, 2019, Computer Science and Engineering

    Modern high-performance computing (HPC) systems are enabling scientists to tackle various grand challenge problems in diverse domains including cosmology and astrophysics, earthquake and weather analysis, molecular dynamics and physics modeling, biological computations, and computational fluid dynamics among others. Along with the increasing demand for computing power, these applications are creating fundamental new challenges in terms of communication complexity, scalability, and reliability. At the same time, remote and virtualized clouds are rapidly gaining in popularity compared to on-premise clusters due to lower initial cost and greater flexibility. These requirements are driving the evolution of modern HPC processors, interconnects, storage systems, as well as middleware and runtimes. However, a large number of scientific applications have irregular and/or dynamic computation and communication patterns that require different approaches to extract the best performance. The increasing scale of HPC systems coupled with the diversity of emerging architectures, including the advent of multi-/many-core processors and Remote Direct Memory Access (RDMA) aware networks have exacerbated this problem by making a "one-size-fits-all" policy non-viable. Thus, a fundamental shift is required in how HPC middleware interact with the application and react to its computation and communication requirements. Furthermore, current generation middleware consist of many independent components like the communication runtime, resource manager, job launcher etc. However, the lack of cooperation among these components often limits the performance and scalability of the end-application. To address these challenges, we propose a high-performance and scalable "Cooperative Communication Middleware" for HPC systems. The middleware supports MPI (Message Passing Interface), PGAS (Partitioned Global Address Space), and hybrid MPI+PGAS programming models and provides improved point-to-p (open full item for complete abstract)

    Committee: Dhabaleswar K Panda (Advisor); Gagan Agrawal (Committee Member); Ponnuswamy Sadayappan (Committee Member); Hari Subramoni (Committee Member) Subjects: Computer Engineering; Computer Science
  • 14. Biswas, Rajarshi Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems

    Master of Science, The Ohio State University, 2018, Computer Science and Engineering

    Google's TensorFlow is one of the most popular Deep Learning (DL) frameworks available in the community. gRPC, a Remote Procedure Call (RPC) framework also developed by Google, is the main communication engine for distributed TensorFlow. TensorFlow primarily uses gRPC for exchanging tensors and communicating administrative tasks among different processes across the nodes. Tensor updates during the training phase are communication intensive and thus TensorFlow's performance is heavily dependent on the underlying network and the efficacy of the communication engine. Apart from the default gRPC channel, TensorFlow supports various high-performance channels to efficiently transfer tensors such as gRPC+Verbs and gRPC+MPI. However, at present, the community lacks a thorough characterization of these available distributed TensorFlow communication channels. This is critical to understand because high-performance Deep Learning with TensorFlow on modern HPC systems needs an efficient communication runtime. In this work, we first conduct a meticulous analysis of the communication characteristics of distributed TensorFlow over all available channels. Based on these characteristics we propose TF-gRPC-Bench micro-benchmark suite that enables system researches to quickly understand the impact of the underlying network and communication runtime on DL workloads. We propose three micro-benchmarks that take account TensorFlow DL workload characteristics over gRPC. Furthermore, our characterization shows that none of the existing channels in TensorFlow can support adaptive and efficient communication for DL workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this work proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA protocols. In AR-gRPC, we propose designs such (open full item for complete abstract)

    Committee: Dhabaleswar K. Panda (Advisor); Christopher Stewart (Committee Member); Xiaoyi Lu (Committee Member) Subjects: Computer Engineering; Computer Science
  • 15. Li, Mingzhe Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters

    Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering

    Multi-/Many-core architectures and networking technologies like InfiniBand (IB) are fueling the growth of next-generation ultra-scale systems that have high compute density. The communication requirements of scientific applications are steadily increasing. MPI two-sided programming models have been used by most of the scientific applications on High-Performance Computing (HPC) systems, however there is an increased focus on MPI one-sided and the Partitioned Global Address Space (PGAS) programming models such as OpenSHMEM. As modern computer hardware architectures keep evolving, it is critical that the MPI and PGAS runtimes and scientific applications are designed with high scalability and performance for next generation systems. This thesis focuses on designing high-performance Remote Memory Access (RMA) for MPI and PGAS models with modern networking technologies on heterogeneous clusters. High-performance interconnects have been the key drivers for high-performance computing systems. Many new networking technologies have been offered on interconnects to meet the increasing communication requirements of scientific applications. However, MPI and PGAS runtimes have not been designed with such technologies to future boost the performance of scientific applications on multi-petaflop/exascale HPC systems. We present designs at MPI and PGAS runtime level taking advantage of hardware atomics, User-Mode Memory Registration (UMR), and On-Demand Paging (ODP) of InfiniBand to benefit scientific applications transparently. With our ODP-Aware MPI runtime, the pin-down buffer size of LAMMPS application has been reduced by 11X. Similarly, we have shown up to 4X performance improvement in point-to-point latency for noncontiguous data movement for 4MB messages. Most of the scientific applications have been written with MPI two-sided programming models that have been shown as a good fit for regular and iterative applications. However, it can be very difficult to use MPI two-sid (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Gagan Agrawal (Committee Member); P. Sadayappan (Committee Member); Xiaoyi Lu (Committee Member) Subjects: Computer Science
  • 16. Rahman, Md Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

    Doctor of Philosophy, The Ohio State University, 2016, Computer Science and Engineering

    Big Data processing and High-Performance Computing (HPC) are two disruptive technologies that are converging to meet the challenges exposed by large-scale data analysis. MapReduce, a popular parallel programming model for data-intensive applications, is being used extensively through different execution frameworks (e.g. batch processing, Directed Acyclic Graph or DAG) on modern HPC systems because of its ease-of-programming, fault-tolerance, and scalability. However, as these applications begin scaling to terabytes of data, the socket-based communication model, which is the default implementation in the open-source MapReduce execution frameworks, demonstrates performance bottleneck. Moreover, because of the synchronized nature of stocking the data in various execution phases, the default Hadoop MapReduce framework cannot leverage the full potential of the underlying interconnect. MapReduce frameworks also rely heavily on the availability of the local storage media, which introduces space inadequacy for applications that generate a large amount of intermediate data. On the other hand, most leadership-class HPC systems follow the traditional Beowulf architecture with separate parallel storage system and either no, or very limited, local storage. The storage architectures in these HPC systems are not naively conducive for default MapReduce. Also, modern high performance interconnects (e.g. InfiniBand) used to access the parallel storage in these systems can provide extremely low latency and high bandwidth. Additionally, advanced storage architectures, such as Non-Volatile Memories (NVM), can provide byte-addressability as well as data persistence. Efficient utilization of all these resources through enhanced designs of execution frameworks with tuned parameter space is crucial for MapReduce in terms of performance and scalability. This work addresses several of the shortcomings that the current MapReduce execution frameworks hold. It presents an enhanced Big Data e (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member) Subjects: Computer Engineering; Computer Science
  • 17. AlMulhem, Norah Cryopreservation and Hypothermal Storage of Hematopoietic Stem Cells

    MS, University of Cincinnati, 2015, Allied Health Sciences: Transfusion and Transplantation Sciences

    The recent availability of commercially available storage media (CryoStor™ and HypoThermosol™) designed for optimal long-term and short-term hematopoietic stem cell (HSC) storage prompted an evaluation of hematopoietic stem cell and hematopoietic progenitor cell (HSC/P) viability and functionality after storage in these media formulations, compared with the conventional media used at Hoxworth Blood Center. Three human umbilical cord blood units (CBUs) were cryopreserved in CryoStor5 (CS5), CryoStor10 (CS10), and a conventional internally prepared cryopreservation medium, then analyzed post-thaw for viability and recovery of several mature and immature hematopoietic cell types, as well as for clonogenic capacity, and proliferation potential. Flow cytometric analysis indicated similar post-thaw viability of most cell subsets cryopreserved in CS5, and CS10 compared to the conventional cryopreservation medium (containing 5 % Dimethylsulfoxide (DMSO) and 2.5 % hydroxyethyl starch). This variation in viability was not statistically significant (p-value 0.2-1). Bromodeoxyuridine (BrdU) uptake was used to measure the ability of the frozen/thawed cells to proliferate in culture for 48 h in response to stem cell factor (SCF), FLt-3 ligand (FLt-3) and thrombopoietin (TPO). Proliferation potential and clonogenic capacity were both slightly better with after freezing in CS10; however, the differences were not statistically significant. This study shows that the conventional medium for cryopreservation used in our laboratory is similarly effective, compared with CS5 or CS10 media, in protecting the cryopreserved CBU derived HSC/P products. The same analytical methods were used to compare HypoThermosol® (HTR-FRS®), which is designed for short-term refrigerated storage of hematopoietic cells, to a locally prepared medium containing Plasma-Lyte A and 0.5 % human serum albumin (HSA). Measurements were performed after 24, 48 and 72 h of cold storage (4°C). Results showed similar (open full item for complete abstract)

    Committee: Thomas Leemhuis Ph.D. (Committee Chair); Jose Cancelas-Perez M.D. (Committee Member); Patricia Morgan Carey M.D. (Committee Member); Carolyn Lutzko Ph.D. (Committee Member) Subjects: Health Sciences
  • 18. Jamaliannasrabadi, Saba High Performance Computing as a Service in the Cloud Using Software-Defined Networking

    Master of Science (MS), Bowling Green State University, 2015, Computer Science

    Benefits of Cloud Computing (CC) such as scalability, reliability, and resource pooling have attracted scientists to deploy their High Performance Computing (HPC) applications on the Cloud. Nevertheless, HPC applications can face serious challenges on the cloud that could undermine the gained benefit, if care is not taken. This thesis targets to address the shortcomings of the Cloud for the HPC applications through a platform called HPC as a Service (HPCaaS). Further, a novel scheme is introduced to improve the performance of HPC task scheduling on the Cloud using the emerging technology of Software-Defined Networking (SDN). The research introduces “ASETS: A SDN-Empowered Task Scheduling System” as an elastic platform for scheduling HPC tasks on the cloud. In addition, a novel algorithm called SETSA is developed as part of the ASETS architecture to manage the scheduling task of the HPCaaS platform. The platform monitors the network bandwidths to take advantage of the changes when submitting tasks to the virtual machines. The experiments and benchmarking of HPC applications on the Cloud identified the virtualization overhead, cloud networking, and cloud multi-tenancy as the primary shortcomings of the cloud for HPC applications. A private Cloud Test Bed (CTB) was set up to evaluate the capabilities of ASETS and SETSA in addressing such problems. Subsequently, Amazon AWS public cloud was used to assess the scalability of the proposed systems. The obtained results of ASETS and SETSA on both private and public cloud indicate significant performance improvement of HPC applications can be achieved. Furthermore, the results suggest that proposed system is beneficial both to the cloud service providers and the users since ASETS performs better the degree of multi-tenancy increases. The thesis also proposes SETSAW (SETSA Window) as an improved version of SETSA algorism. Unlike other proposed solutions for HPCaaS which have either optimized the cloud to make it more HPC-fr (open full item for complete abstract)

    Committee: Hassan Rajaei Ph.D (Advisor); Robert Green Ph.D (Committee Member); Jong Kwan Lee Ph.D (Committee Member) Subjects: Computer Engineering; Computer Science; Technology
  • 19. Raja Chandrasekar, Raghunath Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters

    Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering

    In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors. A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the order of tens of minutes to hours. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources. On supercomputing systems geared for Exascale, parallel applications will have a wider range of storage media to choose from - on-chip/off-chip caches, node-level RAM, Non-Volatile Memory (NVM), distributed-RAM, flash-storage (SSDs), HDDs, parallel file systems, and archival sto (open full item for complete abstract)

    Committee: Dhabaleswar Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Kathryn Mohror (Committee Member) Subjects: Computer Engineering; Computer Science
  • 20. Potluri, Sreeram Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

    Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering

    Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems that have high compute density and high performance per watt. However, these many-core architectures cause systems to be heterogeneous by introducing multiple levels of parallelism and varying computation/communication costs at each level. Application developers also use a hierarchy of programming models to extract maximum performance from these heterogeneous systems. Models such as CUDA, OpenCL, LEO, and others are used to express parallelism across accelerator or coprocessor cores, while higher level programming models such as MPI or OpenSHMEM are used to express parallelism across a cluster. The presence of multiple programming models, their runtimes and the varying communication performance at different levels of the system hierarchy has hindered applications from achieving peak performance on these systems. Modern interconnects such as InfiniBand, enable asynchronous communication progress through RDMA, freeing up the cores to do useful computation. MPI and PGAS models offer one-sided communication primitives that extract maximum performance, minimize process synchronization overheads and enable better computation and communication overlap using the high performance networks. However, there is limited literature available to guide scientists in taking advantage of these one-sided communication semantics on high-end applications, more so on heterogeneous clusters. In our work, we present an enhanced model, MVAPICH2-GPU, to use MPI for data movement from both CPU and GPU memories, in a unified manner. We also extend the OpenSHMEM PGAS model to support such unified communication. These models considerably simplify data movement in MPI and OpenSHMEM applications running on GPU clusters. We propose designs in MPI and OpenSHMEM runtimes to optimize data movement on GPU clusters, using state-of-the-art GPU technologies (open full item for complete abstract)

    Committee: Dhabaleswar K. Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Karen Tomko (Committee Member) Subjects: Computer Science