Search Results (1 - 24 of 24 Results)

Sort By  
Sort Dir
 
Results per page  

Potluri, SreeramEnabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems that have high compute density and high performance per watt. However, these many-core architectures cause systems to be heterogeneous by introducing multiple levels of parallelism and varying computation/communication costs at each level. Application developers also use a hierarchy of programming models to extract maximum performance from these heterogeneous systems. Models such as CUDA, OpenCL, LEO, and others are used to express parallelism across accelerator or coprocessor cores, while higher level programming models such as MPI or OpenSHMEM are used to express parallelism across a cluster. The presence of multiple programming models, their runtimes and the varying communication performance at different levels of the system hierarchy has hindered applications from achieving peak performance on these systems. Modern interconnects such as InfiniBand, enable asynchronous communication progress through RDMA, freeing up the cores to do useful computation. MPI and PGAS models offer one-sided communication primitives that extract maximum performance, minimize process synchronization overheads and enable better computation and communication overlap using the high performance networks. However, there is limited literature available to guide scientists in taking advantage of these one-sided communication semantics on high-end applications, more so on heterogeneous clusters. In our work, we present an enhanced model, MVAPICH2-GPU, to use MPI for data movement from both CPU and GPU memories, in a unified manner. We also extend the OpenSHMEM PGAS model to support such unified communication. These models considerably simplify data movement in MPI and OpenSHMEM applications running on GPU clusters. We propose designs in MPI and OpenSHMEM runtimes to optimize data movement on GPU clusters, using state-of-the-art GPU technologies such as CUDA IPC and GPUDirect RDMA. Further, we introduce PRISM, a proxy-based multi-channel framework that enables an optimized MPI library for communication on clusters with Intel Xeon Phi co-processors. We evaluate our designs using micro-benchmarks, application kernels and end-applications. We present the re-design of a petascale seismic modeling code to demonstrate the use of one-sided semantics in end-applications and their impact on performance. We finally demonstrate the benefits of using one-sided semantics on heterogeneous clusters.

Committee:

Dhabaleswar K. Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Karen Tomko (Committee Member)

Subjects:

Computer Science

Keywords:

Heterogeneous Clusters; GPU; MIC; Many-core Architectures; MPI; PGAS; One-sided; Communication Runtimes; InfiniBand; RDMA; Overlap; HPC Applications

Gideon, JohnThe Integration of LlamaOS for Fine-Grained Parallel Simulation
MS, University of Cincinnati, 2013, Engineering and Applied Science: Computer Engineering
LlamaOS is a custom operating system that provides much of the basic functionality needed for low latency applications. It is designed to run in a Xen-based virtual machine on a Beowulf cluster of multi/many-core processors. The software architecture of llamaOS is decomposed into two main components, namely: the llamaNET driver and llamaApps. The llamaNET driver contains Ethernet drivers and manages all node-to-node communications between user application programs that are contained within a llamaApp instance. Typically, each node of the Beowulf cluster will run one instance of the llamaNET driver with one or more llamaApps bound to parallel applicaitons. These capabilities provide a solid foundation for the deployment of MPI applications as evidenced by our initial benchmarks and case studies. However, a message passing standard still needed to be either ported or implemented in llamaOS. To minimize latency, llamaMPI was developed as a new implementation of the Message Passing Interface (MPI), which is compliant with the core MPI functionality. This provides a standardized and easy way to develop for this new system. Performance assessment of llamaMPI was achieved using both standard parallel computing benchmarks and a locally (but independently) developed program that executes parallel discrete event-driven simulations. In particular, the NAS Parallel Benchmarks are used to show the performance characteristics of llamaMPI. In the experiments, most of the NAS Parallel Benchmarks ran faster than, or equal to their native performance. The benefit of llamaMPI was also shown with the fine-grained parallel application WARPED. The order of magnitude lower communication latency in llamaMPI greatly reduced the amount of time that the simulation spent in rollbacks. This resulted in an overall faster and more efficient computation, because less time was spent off the critical path due to causality errors.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Fred Beyette, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

Parallel Computing;Time Warp Simulation;MPI;Operating Systems;Beowulf Cluster;Parallel Discrete Event Simulation

Singh, Ashish KumarOptimizing All-to-All and Allgather Communications on GPGPU Clusters
Master of Science, The Ohio State University, 2012, Computer Science and Engineering

High Performance Computing (HPC) is rapidly becoming an integral part of Science,Engineering and Business. Scientists and engineers are leveraging HPC solutions to run their applications that require high bandwidth, low latency, and very high compute capabilities. General Purpose Graphics Processing Units (GPGPUs)are becoming more popular within the HPC community because of their highly parallel structure, which makes it possible for applications to gain multi-x performance gain. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [33]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols.

Collective communication is commonly used in HPC applications. These applications spend a significant portion of their time doing such collective communications. Therefore, optimizing performance of collectives has a significant impact on the applications’ performance. The all-to-all and allgather communication operations in message-passing systems are heavily used collectives that have O(N2) communication, for N processes. In this thesis, we outline the major design alternatives for the two collective communication operations on GPGPU clusters. We propose efficient and scalable designs and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI Alltoall on GPGPU clusters can be improved by 59% over a Naive approach based implementation and 44% over a Send-Recv based implementation for 32KBytes messages on 32 processes. Our proposed design, Fine Grained Pipeline, can improve the performance of MPI Allgather on GPGPU clusters by 46% over Naive design and 81% over Send-Recv based design for a message size of 16 KBytes on 64 processes. The proposed designs have been incorporated into the open source MPI stack, MVAPICH2.

Committee:

Dhabaleswar K. Panda, Professor (Advisor); P. Sadayappan, Professor (Committee Member)

Subjects:

Computer Science

Keywords:

GPGPU; HPC; Infiniband; Collective Communications; MPI

Varia, SiddharthREGULARIZED MARKOV CLUSTERING IN MPI AND MAP REDUCE
Master of Science, The Ohio State University, 2013, Computer Science and Engineering
The major objective of this thesis is to exploit parallelism in the clustering of graphs based on the simulation of stochastic flows and propose a scalable algorithm for the same. Over the past decade there has been a surge in the use of Map Reduce, MPI & CUDA for large scale graph mining. In this thesis, Map Reduce and MPI are being used to implement a parallel version of Regularized Markov Clustering (RMCL). RMCL is a variant of Markov clustering.

Committee:

Srinivasan Parthasarathy, Prof. (Advisor); P Sadayappan, Prof. (Committee Member)

Subjects:

Computer Science

Keywords:

MPI, MapReduce, Graphs

Vishnu, AbhinavHigh performance and network fault tolerant MPI with multi-pathing over infiniBand
Doctor of Philosophy, The Ohio State University, 2007, Computer and Information Science
In the last decade or so, the high performance community is observing a paradigm shift with interconnection methodology for processing elements. Combining commercial off-the-shelf components to build supercomputers has provided users with an excellent price-to-performance ratio. At the same time, scientific applications ranging from molecular dynamics to ocean modeling are being designed with Message Passing Interface (MPI) being the de facto programming model. The insatiable computational requirements of the scientific applications has been continuously pushing the scale of these clusters. Increasing scale of these clusters has aggravated the occurrence of hot-spots in the network and reduced the mean time between failures of difference network components. In order to provide the best performance to the scientific applications, it is imperative that the MPI libraries are capable of avoiding network hot-spots and resilience to faults in the network. At the same time, InfiniBand has emerged as a popular interconnect, providing a plethora of modern features with open standard and high performance. In this dissertation, we focus on designing a communications and network fault tolerance layer with InfiniBand, which leverages the presence of multiple paths in the network for avoidance of hot-spots in the network and network fault tolerance. Much of the dissertation has been integrated with an open source effort, MVAPICH, which is a popular implementation of MPI over InfiniBand and is used by a large number of supercomputers in the world.

Committee:

Dhabaleswar Panda (Advisor)

Subjects:

Computer Science

Keywords:

InfiniBand; MPI; Network Fault Tolerance; Hot-Spot

Raveendran, AarthiA Framework For Elastic Execution of Existing MPI Programs
Master of Science, The Ohio State University, 2011, Computer Science and Engineering
There is a clear trend towards using cloud resources in the scientific or the HPC community, with a key attraction of cloud being the elasticity it offers. In executing HPC applications on a cloud environment, it will clearly be desirable to exploit elasticity of cloud environments, and increase or decrease the number of instances an application is executed on during the execution of the application, to meet time and/or cost constraints. Unfortunately, HPC applications have almost always been designed to use a fixed number of resources.This work focuses on the goal of making existing MPI applications elastic for a cloud framework. Considering the limitations of the MPI implementations currently available, we support adaptation by terminating one execution and restarting a new program on a different number of instances. The components of the system include a decision layer which considers time and cost constraints, a framework for modifying MPI programs, and a cloud-based runtime support that can enable redistributing of saved data, and support automated resource allocation and application restart on a different number of nodes. Using two MPI applications, the feasibility of our approach is demonstrated, it is shown that outputting, redistributing, and reading back data can be a reasonable approach for making existing MPI applications elastic. The decision layer with a feedback model is designed to monitor the application by interact with it at regular intervals, and perform scaling with the assistance of resource allocator when necessary. This approach is tested using the same two applications and is used to meet the user demands of maximum specified input time or budget.

Committee:

Gagan Agrawal, Dr. (Advisor); Christopher Stewart (Committee Member)

Subjects:

Computer Science

Keywords:

MPI elastic HPC cloud Amazon EC2

Raja Chandrasekar, RaghunathDesigning Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors. A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the order of tens of minutes to hours. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources. On supercomputing systems geared for Exascale, parallel applications will have a wider range of storage media to choose from - on-chip/off-chip caches, node-level RAM, Non-Volatile Memory (NVM), distributed-RAM, flash-storage (SSDs), HDDs, parallel file systems, and archival storage. Current-generation checkpointing middleware and frameworks are oblivious to this hierarchy in storage where each medium has unique performance and data-persistence characteristics. This thesis proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include - \textit{CRUISE}, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; \textit{Stage-FS}, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; \textit{MIC-Check}, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; \textit{Power-Check}, an energy-efficient checkpointing framework for transparent and application-aware HPC checkpointing systems; and \textit{FTB-IPMI}, an out-of-band fault-prediction mechanism that pro-actively monitors for failures. The components of this framework have been evaluated up to a scale of three million compute processes, have reduced the checkpointing overhead on scientific applications by a factor of 30, and reduced the amount of energy consumed by checkpointing systems by up to 48\%.

Committee:

Dhabaleswar Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Kathryn Mohror (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

fault-tolerance; resilience; checkpointing; process-migration; Input-Output; HPC; supercomputing; MPI; MVAPICH; accelerators; energy-efficiency;

Kouril, MichalA Backtracking Framework for Beowulf Clusters with an Extension to Multi-cluster Computation and SAT Benchmark Problem Implementation
PhD, University of Cincinnati, 2006, Engineering : Computer Science and Engineering

The main topic of this dissertation involves cluster-based computing, specifically relating to computations performed on Beowulf clusters. I have developed a light-weight library for dynamic interoperable message passing, called the InterCluster Interface (ICI). This library not only supports computations performed over multiple clusters that are running different Message Passing Interface (MPI) implementations, but also can be used independently of MPI. In addition I developed the Backtracking Framework (BkFr) that simplifies implementations of the parallel backtracking paradigm in the single cluster environment, and supports the extension of computations over multiple clusters. BkFr uses MPI for the intra-cluster communication and ICI for the inter-cluster communication. I have also developed a template-based library of programming modules that facilitate the introduction of the rapidly emerging message passing parallel computing paradigm in upper-division undergraduate courses.

An important application of ICI and the backtracking framework discussed in this dissertation is the computation of Van der Waerden numbers, which involve the existence of arithmetic progressions in arbitrary partitions of {1, 2, …, n} for sufficiently large n. The computations of these numbers utilized a special SAT solver that I developed. For example, I almost doubled the previously known lower bound for the Van der Waerden number W(2, 6), and I am running a calculation whether the lower bound is actually the exact value. Among other reductions, I was able to drastically reduce the search for potential solutions with the aid of a tunneling technique based on an aggressive addition of uninferred constraints. The tunneling technique has the likely potential to be used in a number of other satisfiability settings. I also developed an FPGA version of my SAT solver for Van der Waerden numbers that improved the search speed up to 230 times or more compared to its sequential equivalent. Utilizing this special SAT solver has resulted in significant progress in the search to determine bounds or exact values for Van der Waerden numbers and, more generally, for Van der Waerden subnumbers. I have found the exact values for several of these subnumbers, specifically W(2; 5,6)=206, W(2; 4,8)=146, W(2; 3,16)=238 and W(3; 2,4,7)=119.

Committee:

Dr. Jerome Paul (Advisor)

Subjects:

Computer Science

Keywords:

MPI; backtracking framework; message passing; dynamic interoperability; Van der Waerden numbers

Dosopoulos, StylianosInterior Penalty Discontinuous Galerkin Finite Element Method for the Time-Domain Maxwell's Equations
Doctor of Philosophy, The Ohio State University, 2012, Electrical and Computer Engineering
This dissertation, investigates a discontinuous Galerkin (DG) methodology to solve Maxwell's equations in the time-domain. More specifically, we focus on a Interior Penalty (IP) approach to derive a DG formulation. In general, discontinuous Galerkin methods decompose the computational domain into a number of disjoint polyhedral (elements). For each polyhedron, we define local basis functions and approximate the fields as a linear combination of these basis functions. To ensure equivalence to the original problem the tangentially continuity of the electric and magnetic fields need to be enforced between polyhedra interfaces. This condition is applied in the weak sense by proper penalty terms on the variational formulation also known as numerical fluxes. Due to this way of coupling between adjacent polyhedra DG methods offer great flexibility and a nice set of properties such as, explicit time-marching, support for non-conformal meshes, freedom in the choice of basis functions and high efficiency in parallelization. Here, we first introduce an Interior Penalty (IP) approach to derive a DG formulation and a physical interpretation of such an approach. This physical interpretation will provide a physical insight into the IP method and link important concepts like the duality pairing principle to a physical meaning. Furthermore, we discuss the time discretization and stability condition aspects of our scheme. Moreover, to address the issue of very small time steps in multi-scale applications we employ a local time-stepping (LTS) strategy which can greatly reduce the solution time. Secondly, we present an approach to incorporate a conformal Perfectly Matched Layer (PML) in our interior penalty discontinuous Galerkin time-domain (IPDGTD) framework. From a practical point of view, a conformal PML is easier to model compared to a Cartesian PML and can reduce the buffer space between the structure and the truncation boundary, thus potentially reducing the number of unknowns. Next, we discuss our approach to combine EM and circuit simulation into a single framework. We show how we incorporate passive lumped elements such as resistors, capacitors and inductors in the IPDGTD framework. Practically, such a capability is useful since EM applications may often include lumped elements.Following, we present our design of a scalable parallel implementation of IPDGTD in order to exploit the inherit DG parallelism and significantly speed up computations. Our parallelization, is aimed to multi-core CPUs and/or graphics processor units (GPUs), for shared and/or distributed memory systems. In this way all of MPI/CPU, MPI/GPU and MPI/OpenMP configurations can be used. Finally, we extend our IPDGTD to further include the case of non-conformal meshes. Since, in DG methods the tangentially continuity of the fields is enforced in a weak sense, DG methods naturally support non-conformal meshes. In cases of complicated geometries where a conformal mesh is nearly impossible to get, the ability to handle non-conformal meshes is important. The original geometry in divided into smaller pieces and each piece is meshed independently. Thus, meshing requirements can be greatly relaxed and a final mesh can be obtained for computation.

Committee:

Jin-Fa Lee (Advisor); Teixeira Fernando (Committee Member); Krishnamurthy Ashok (Committee Member)

Subjects:

Electrical Engineering; Electromagnetics

Keywords:

Discontinuous Galerkin; Time Domain; Non-conformal; MPI/GPU parallelization

Zhang, WenbinLibra: Detecting Unbalance MPI Collective Calls
Master of Science, The Ohio State University, 2011, Computer Science and Engineering

Collective calls in MPI applications allow all processes within the same communicator to collaborate with each other, while missing to call in any process may lead to unexpected behavior such as deadlock. This thesis presents a method to detect unbalance MPI collective calls among all processes. The main idea is to track the possible execution paths in the control flow graph. If two paths on different processes have different calling histories on the MPI collective calls, the corresponding function will be reported to be incorrect.

We have built a tool called Libra on Linux to implement the detection. Libra has been evaluated using three real world applications, with a real world bug and three injected bugs. The experimental results show that Libra can correctly detect the bug cases and pinpoint the root causes without reporting any false positives.

Committee:

Feng Qin, PhD (Advisor); Dong Xuan, PhD (Committee Member)

Subjects:

Computer Engineering

Keywords:

MPI; Collective call; Unbalance; Bug detection

Jose, JithinDesigning High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data Middleware
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
The computation and communication requirements of modern HighPerformance Computing (HPC) and Big Data applications are steadily increasing. HPC scientific applications typically use Message Passing Interface (MPI) as the programming model, however, there is an increased focus on hybrid MPI+PGAS (Partitioned Global Address Space) models for emerging exascale systems. Big Data applications rely on middleware such as Hadoop (including MapReduce, HDFS, HBase, etc.) and Memcached. It is critical that these middleware be designed with high scalability and performance for next generation systems. In order to ensure that HPC and Big Data applications can continue to scale and leverage the capabilities and performance of emerging technologies, a high performance communication runtime is much needed. This thesis focuses on designing a high performance and scalable Unified Communication Runtime (UCR) for HPC and Big Data middleware. In HPC domain, MPI has been the prevailing communication middleware for more than two decades. Even though it has been successful in developing regular and iterative applications, it can be very difficult to use MPI and maintain performance for irregular, data-driven applications. PGAS programming model presents an attractive alternative for designing such applications and provides higher productivity. It is widely believed that parts of applications can be redesigned using PGAS models - leading to hybrid MPI+PGAS applications, and improve performance. In order to fully leverage the performance benefits offered by the modern HPC systems, a unified communication runtime that offers the advantages of both MPI and PGAS programming models is critical. We present "MVAPICH2-X" - a high performance and scalable 'Unified Communication Runtime' that supports both MPI and PGAS programming models. This thesis also targets at redesigning applications making use of hybrid programming features, for better performance. With our hybrid MPI+PGAS design using Unified Communication Runtime, the execution time of Graph500 was reduced by 13X, compared to existing MPI based design at 16,384 processes. Similarly, the sort-rate of data intensive Out-of-Core Sort application was improved by a factor of two, using our hybrid designs. The requirements for Big Data communication middleware are similar to those used in HPC. Both these rely on low-level communication techniques. However, the performance of current generation Big Data computing platforms, such as HBase and Memcached remains low. One of the fundamental bottlenecks in delivering high-performance on these platforms is use of the traditional BSD sockets interface and two-sided (send/recv) communication. Such communication semantics prevent the use of Remote Direct Memory Access (RDMA) and associated features of modern networking and I/O technologies. This thesis focuses on the design of a lightweight, scalable and high performance communication runtime for Big Data middleware, such as HBase and Memcached. With our UCR design, Memcached operation latencies were reduced by a factor of 12X, and HBase throughput was improved by 3X times.

Committee:

Dhabaleswar Panda (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member); Karen Tomko (Committee Member)

Subjects:

Computer Science

Keywords:

MPI; PGAS; Unified Runtime; OpenSHMEM; Unified Parallel C; Memcached; HBase; InfiniBand; Clusters; RDMA; Runtime Design; Hybrid Programming

Dhanraj, VijayEnhancement of LIMIC-Based Collectives for Multi-core Clusters
Master of Science, The Ohio State University, 2012, Computer Science and Engineering
High Performance Computing (HPC) has made it possible for scientists and engi- neers to solve complex science, engineering and business problems using applications that require high bandwidth, low latency networking, and very high compute capa- bilities. To satisfy these ever increasing need for high compute capabilities, more and more clusters are deploying multi-core processors. A majority of these parallel applications are written in MPI, and employ collective operations in their communication kernels. Optimization of these collective operations on the multi-core platforms is one of the key factors to obtaining good performance speed-ups. However, the challenge for these applications lie in utilizing the Non- uniform memory access (NUMA) architecture and shared cache hierarchies, provided by the modern multi-core processors. Also, care must be taken so as to reduce traffic to remote NUMA memory, avoid memory bottlenecks for rooted collectives and reduce the multiple copies as in case of shared memory. The existing optimizations for the MPI collectives deploy a single-leader approach, using kernel assisted techniques (LiMIC) in a point-to-point manner to mitigate the above mentioned issues, but this would still not scale well as the number of cores per node increase. In this thesis, we present Direct LiMIC Primitives for MPI Collectives, such as MPI Bcast and MPI Gather, that use existing LiMIC APIs to perform the collec- tive operation. We deploy these new techniques on single-leader and multi-leader approaches, which are created from a hierarchical framework based on the underlying node topology. This helps us to have stable performance even in case of irregular process placement. Based on this hierarchical framework and Direct LiMIC primitives, we propose multiple hybrid schemes which utilize the existing optimized MPI point-to-point schemes and kernel assisted techniques and evaluate the performance of three MPI Collectives (Broadcast, Gather and Allgather) on a popular open source MPI stack, MVAPICH2. Based on our experimental evaluation on SDSC Trestles cluster, which is an ideal example for multi-core cluster and using our OMB benchmarks, we were able to see a performance improvement around 10-28% for MPI Bcast within a node, around 10-35% improvement for MPI Gather collective for system sizes ranging from 64-1,024 processes. With good improvements in MPI Bcast and MPI Gather, we also evaluated MPI Allgather which in turn would use these enhanced collectives. We were able to see performance improvement of around 8-60% for small messages (1-64 bytes) for system sizes greater than or equal to 256 processes.

Committee:

Dhabaleswar Kumar Panda, Dr (Advisor); Radu Teodorescu, Dr (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

LiMIC; MPI Collectives; Multi-Leader; Topology; Hierarchical Framework

Luo, MiaoDesigning Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors
Doctor of Philosophy, The Ohio State University, 2013, Computer Science and Engineering
High End Computing (HEC) has been growing dramatically over the past decades. The emerging multi-core systems, heterogeneous architectures and interconnects introduce various challenges and opportunities to improve the performance of communication middlewares and applications. The increasing number of processor cores and Co-Processors results in not only heavy contention on communication resources, but also much more complicated levels of communication patterns. Message Passing Interface (MPI) is the dominant parallel programming language for HPC application area in the past two decades. MPI has been very successful in implementing regular, iterative parallel algorithms with well defined communication pattern. Instead, the Partitioned Global Address Space (PGAS) programming model provides a flexible way for these applications to express parallelism. Different variations and combinations of these programming languages present new challenges in designing optimized programming model runtimes, in terms of efficient sharing of networking resources and efficient work-stealing techniques for computation load balancing across threads/processes, etc. Middlewares play a key role in delivering the benefits of new hardware techniques to support the new requirement from applications and programming models. This dissertation aims to study several critical contention problems of existing runtimes, which supports popular parallel programming models (MPI and UPC) on emerging multi-core/many-core systems. We start with shared memory contention problem within existing MPI runtime. Then we explore the network throughput congestion issue at node level, based on Unified Parallel C (UPC) runtime. We propose and implement lock-free multi-threaded runtimes for MPI/OpenMP and UPC with multi-endpoint support, respectively. Based on the multi-endpoint design, we further explore how to enhance MPI/OpenMP applications with transparent support for collective operations and minimal modifications for point-to-point operations. Finally we extend our multi-endpoint research to include GPU and MIC architecture for UPC and explore the performance features. Software developed as a part of this dissertation is available in MVAPICH2 and MVAPICH2-X. MVAPICH2 is a popular open-source implementation of MPI over InfiniBand and is used by hundreds of top computing sites all around the world. MVAPICH2-X supports both MPI and UPC hybrid programming models on InfiniBand clusters and is based on MVAPICH2 stack.

Committee:

Dhabaleswar K. Panda (Advisor); P. Sadayappan (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

High Performance Computing; MPI; UPC; GPGPU; MIC; InfiniBand; Middleware

Zhang, DeanData Parallel Application Development and Performance with Azure
Master of Science, The Ohio State University, 2011, Computer Science and Engineering
The Microsoft Windows Azure technology platform provides on-demand, cloud-based computing, where the cloud is a set of interconnected computing resources located in one or more data centers. Recently, Windows Azure platform is available in data centers in most area of the world large cities. MPI is a message passing library standard, which has many participating organizations, including vendors, researchers, software library developers, and users. The goal of the Message Passing Interface is to establish a portable, efficient, and flexible standard for message passing that will be widely used parallel programs. The advantages of developing message passing software using are well known as its portability, efficiency, and flexibility. This thesis is about how to develop MPI like applications on Windows Azure Platform and simulate parallel computing in cloud. The specific goal is to simulate MPI_Reduce and MPI_Allreduce on Windows Azure, and use this simulation to support and build data parallel applications on windows Azure. We also compare the performances of three data parallel applications under three platforms which are traditional clusters, Azure with queue and Azure with WCF.

Committee:

Gagan Agrawal (Advisor); Feng Qin (Committee Member)

Subjects:

Computer Science

Keywords:

MPI Cloud computing

Gebre, Meseret RedaeMUSE: A parallel Agent-based Simulation Environment
Master of Science, Miami University, 2009, Computer Science and Systems Analysis
Realizing the advantages of simulation-based methodologies requires the use of a software environment that is conducive for modeling, simulation, and analysis. Furthermore, parallel simulation methods must be employed to reduce the time for simulation, particularly for large problems, to enable analysis in reasonable timeframes. Accordingly, this thesis covers the development of a general purpose agent-based, parallel simulation environment called MUSE (Miami University Simulation Environment). MUSE, provides an Application Program Interface (API) for agent-based modeling and a framework for parallel simulation. The API was developed in C++ using its object oriented features. The core parallel simulation capabilities of MUSE were realized using the Time Warp synchronization methodology and the Message Passing Interface (MPI). Experiments show MUSE to be a scalable and efficient simulation environment.

Committee:

Dhanajai Rao, PhD (Advisor); Mufit Ozden, PhD (Committee Member); Lukasz Opyrchal, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

Parallel Simulations; Time Warp; MPI; Agent-based Modeling; MUSE; Parallel Simulation Environment; C++ API

Liu, JiuxingDesigning high performance and scalable MPI over InfiniBand
Doctor of Philosophy, The Ohio State University, 2004, Computer and Information Science

Rapid technological advances in recent years have made powerful yet inexpensive commodity PCs a reality. New interconnecting technologies that deliver very low latency and very high bandwidth are also becoming available. These developments lead to the trend of cluster computing , which combines the computational power of commodity PCs and the communication performance of high speed interconnects to provide cost-effective solutions for computational intensive applications, especially for those grand challenge applications such as weather forecasting, air flow analysis, protein searching, and ocean simulation.

InfiniBand was proposed recently as the next generation interconnect for I/O and inter-process communication. Due to its open standard and high performance, InfiniBand is becoming increasingly popular as an interconnect for building clusters. However, since it is not designed specifically for high performance computing, there exists a semantic gap between its functionalities and those required by high performance computing software such as Message Passing Interface (MPI). In this dissertation, we take on this challenge and address research issues in designing efficient and scalable communication subsystems to bridge this gap. We focus on how to take advantage of the novel features offered by InfiniBand to design different components in the communication subsystems such as protocol design, flow control, buffer management, communication progress, connection management, collective communication, and multirail network support.

Our research has already made notable contributions in the areas of cluster computing and InfiniBand. A large part of our research has been integrated into our MVAPICH software, which is a high performance and scalable MPI implementation over InfiniBand. Our software is currently used by more than 120 organizations world-wide to build InfiniBand clusters, including both research testbeds and production systems. Some of the fastest supercomputers in the world, including the 3rd ranked Virginia Tech Apple G5 cluster, are currently powered by MVAPICH. Research in this dissertation will also have impact on designing communication subsystems for systems other than high performance computing and for other high speed interconnects.

Committee:

Dhabaleswar Panda (Advisor)

Subjects:

Computer Science

Keywords:

Cluster Computing; InfiniBand; MPI; High Performance Computing; High Speed Interconnect; Performance; Scalability

Koop, Matthew J.High-Performance Multi-Transport MPI Design for Ultra-Scale InfiniBand Clusters
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering

Over the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling increasingly higher. The scalability and the performance of the MPI library if very important for the end application performance.

InfiniBand is a cluster interconnect which is based on open-standards and is gaining rapid acceptance. This dissertation explores the different transports provided by InfiniBand to determine the scalabilty and performance aspects of each. Further, new MPI designs have been proposed and implemented for transports that have never been used for MPI in the past. These designs have significantly decreased the resource consumption, increased the performance and increased the reliability of ultra-scale InfiniBand clusters. A framework to simultaneously use multiple transports of InfiniBand and dynamically change transfer protcols has been designed and evaluated. Evaluations show that memory can be reduced from over 1 GB per MPI process to 40 MB per MPI process. In addition, performance using this design has been improved by up to 30% over earlier designs. Investigations into providing reliability have shown that the MPI library can be designed to withstand many network faults and also how to design reliability in software to provide higher message rates than in hardware. Software developed as a part of this dissertation is available in MVAPICH, which is a popular open-source implementation of MPI over InfiniBand and is used by several hundred top computing sites all around the world.

Committee:

Dhabaleswar K. Panda, PhD (Advisor); P. Sadayappan, PhD (Committee Member); Feng Qin, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

MPI; InfiniBand; Message Passing; Cluster; Parallel Computing

Marsh, Gregory J.Evaluation of High Performance Financial Messaging on Modern Multi-core Systems
Master of Science, The Ohio State University, 2010, Computer Science and Engineering

Multi-cores are coming to have a significant impact on all forms of computing and the financial sector is no exception. This sector relies heavily on message passing over networks for market data dissemination and transaction processing. However, its reliance on the traditional Ethernet standard has the potential to limit the ever increasing demand for more data at higher speeds. Furthermore, the message oriented middleware in use throughout much of the financial sector uses a centralized "broker" architecture in a hub-spoke configuration. Our previous studies with this architecture have shown the centralized "broker" to be a performance bottleneck.

This thesis demonstrates how the High Performance Computing (HPC) technology called MPI (Message Passing Interface) interacts with financial messaging. Features of our group's MVAPICH2, a middleware linking MPI with network and shared memory communication, are used to configure a simulated financial market across a multi-core cluster. This configuration avoids the centralized "broker" bottleneck while still delivering high performance. Our results show that replication of the market simulator, one instance per cluster node, outperforms a single instance of the simulator's order generation process, servicing many instances of the simulator's trade engine using inter node, networked communication. This high performance is obtained at the limit of one trade engine per node core. However, at the low order generation rates typical of many NASDAQ stocks, up to 12 instances of the simulator's trade engine may be multiplexed per CPU core, thereby further increasing the number of trades a cluster node can simulate.

Committee:

D.K. Panda, PhD (Advisor); P. Sadayappan, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

financial; multi-core; MPI; InfiniBand; messaging

Augustine, Albert MathewsDesigning a Scalable Network Analysis and Monitoring Tool with MPI Support
Master of Science, The Ohio State University, 2016, Computer Science and Engineering
State-of-the-art high-performance computing is powered by the tight integration of several hardware and software components. While on the hardware side, we have multi-/many core architectures (including accelerators and co-processors) and high end interconnects (like InfiniBand, Omni-Path), on the software front we have several high performance implementations of parallel programming models which help us to take advantage of the advanced features offered by the hardware components. This tight coupling between both these layers helps in delivering the multi-petaflop level performance to the end application allowing scientists/engineers to tackle the grand challenges in their respective areas. Understanding and gaining insights into the performance of the end application on these modern systems is a challenging task. Several tools have been developed to inspect the network level or MPI level activities to address this challenge. However, these existing tools inspect the network and MPI layer in a disjoint manner and are not able to provide a holistic picture correlating the data generated for network layer and MPI. Thus, the user can miss out on critical information which could have helped in understanding the interaction between MPI applications and the network they are running on. In this thesis, we take up this challenge and design OSU INAM. OSU INAM allows users to analyze and visualize the communication happening in the network in conjunction with the data obtained from the MPI library. Our experimental analysis shows that the tool is able to profile and visualize the communication with very low performance overhead at scale.

Committee:

Dhabaleswar Panda, Dr. (Advisor); Radu Teodorescu, Dr. (Committee Member); Hari Subramoni, Dr. (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

OSU INAM;InfiniBand;MPI;MVAPICH2;Omni-Path;Network Monitoring;Exascale;

Santhanaraman, GopalakrishnanDesigning Scalable and High Performance One Sided Communication Middleware for Modern Interconnects
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering
High-end computing (HEC) systems are enabling scientists and engineersto tackle grand challenge problems in their respective domains and make significant contributions to their fields. As these systems continue to grow in scale, the performance that applications can achieve depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system as well as the capability to overlap computation and communication. One-sided programming models are gaining importance and popularity as they have provide many programming constructs that enable communication using one-sided communication operations. At the same time modern interconnects like InfiniBand provides a rich set of network primitives like Remote Direct Memory Operations (RDMA), Remote Atomics, Scatter/Gather capabilities etc. My work explores designing a High Performance and Scalable One sided Communication subsystem by effectively harnessing the capabilities of modern Interconnects like InfiniBand.

Committee:

Panda Dhabaleswar (Advisor); Ponnuswamy Sadayappan (Committee Member); Feng Qin (Committee Member); Pavan Balaji (Committee Member)

Subjects:

Computer Science

Keywords:

one-sided; Middleware; InfiniBand; MPI-2; Communication; Programming-models; overlap; networks;

Chai, LeiHigh Performance and Scalable MPI Intra-node Communication Middleware for Multi-core Clusters
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering

Cluster of workstations is one of the most popular architectures in high performance computing, thanks to its cost-to-performance effectiveness. As multi-core technologies are becoming mainstream, more and more clusters are deploying multi-core processors as the build unit. In the latest Top500 supercomputer list published in November 2008, about 85% of the sites use multi-core processors from Intel and AMD. Message Passing Interface (MPI) is one of the most popular programming models for cluster computing. With increased deployment of multi-core systems in clusters, it is expected that considerable communication will take place within a node. This suggests that MPI intra-node communication is going to play a key role in the overallapplication performance.

This dissertation presents novel MPI intra-node communication designs, including user level shared memory based approach, kernel assisted direct copy approach, and efficient multi-core aware hybrid approach. The user level shared memory based approach is portable across operating systems and platforms. The processes copy messages into and from a shared memory area for communication. The shared buffers are organized in a way such that it is efficient in cache utilization and memory usage. The kernel assisted direct copy approach takes help from the operating system kernel and directly copies message from one process to another so that it only needs one copy and improves performance from the shared memory based approach. In this approach, the memory copy can be either CPU based or DMA based. This dissertation explores both directions and for DMA based memory copy, we take advantage of novel mechanism such as I/OAT to achieve better performance and computation and communication overlap. To optimize performance on multi-core systems, we efficiently combine the shared memory approach and the kernel assisted direct copy approach and propose a topology-aware and skew-aware hybrid approach. The dissertation also presents comprehensive performance evaluation and analysis of the approaches on contemporary multi-core systems such as Intel Clovertown cluster and AMD Barcelona cluster, both of which are quad-core processors based systems.

Software developed as a part of this dissertation is available in MVAPICH and MVAPICH2, which are popular open-source implementations of MPI-1 and MPI-2 libraries over InfiniBand and other RDMA-enabled networks and are used by several hundred top computing sites all around the world.

Committee:

Dhabaleswar Panda (Advisor); P. Sadayappan (Committee Member); Feng Qin (Committee Member)

Subjects:

Computer Science

Keywords:

MPI; Cluster Computing; Multi-core Processors

Maddipati, Sai Ratna KiranImproving the Parallel Performance of Boltzman-Transport Equation for Heat Transfer
Master of Science, The Ohio State University, 2016, Computer Science and Engineering
In a thermodynamically unstable environment, the Boltzman-Transport Equation (BTE) defines the behavior of heat transfer-rate at each location in the environment, the direction of heat transfer by the particles of the environment and the final equilibrium temperature conditions of the environment. The BTE is a very computationally intensive application and there is a need for efficient parallelization. Parallelization implementation of this application is explained along with brief details about several techniques that have been used by others in past work. The implementation involves several code-iterations of the BTE application with distinct changes in order to compare and analyze the performance and identify the reason for the performance improvement or deterioration. Then, experimental results are presented to show the resulting performance of the implementation.

Committee:

P Sadayappan (Advisor); Nasko Rountev (Committee Member)

Subjects:

Computer Science

Keywords:

Parallel Computing, OpenMP, MPI

Yu, WeikuanEnhancing MPI with modern networking mechanisms in cluster interconnects
Doctor of Philosophy, The Ohio State University, 2006, Computer and Information Science
Advances in CPU and networking technologies make it appealing to aggregate commodity compute nodes into ultra-scale clusters. But the performance achievable is highly dependent on how tightly their components are integrated together. The ever-increasing size of clusters and applications running over them leads to dramatic changes in the requirements. These include at least scalable resource management, fault tolerance process control, scalable collective communication, as well as high performance and scalable parallel IO. Message Passing Interface (MPI) is the de facto standard for the development of parallel applications. There are many research efforts actively studying how to leverage the best performance of the underlying systems and present to the end applications. In this dissertation, we exploit various modern networking mechanisms from the contemporary interconnects and integrate them into MPI implementations to enhance their performance and scalability. In particular, we have leveraged the novel features available from InfiniBand, Quadrics and Myrinet to provide scalable startup, adaptive connection management, scalable collective operations, as well as high performance parallel IO. We have also designed a parallel Checkpoint/Restart framework to provide transparent fault tolerance to parallel applications. Through this dissertation, we have demonstrated that modern networking mechanisms can be integrated into communication and IO subsystems for enhancing the scalability, performance and reliability of MPI implementations. Some of the research results have been incorporated into production MPI software releases such as MVAPICH/MVAPICH2 and LA-MPI. This dissertation has showcased and shed light on where and how to enhance the design of parallel communication subsystems to meet the current and upcoming requirements of large-scale clusters, as well as high end computing environments in general.

Committee:

Dhabaleswar Panda (Advisor)

Subjects:

Computer Science

Keywords:

InfiniBand; Myrinet; Quadrics; MPI; Parallel IO; RDMA

Gangadharappa, Tejus A.Designing Support For MPI-2 Programming Interfaces On Modern InterConnects
Master of Science, The Ohio State University, 2009, Computer Science and Engineering

Scientific computing has seen an unprecedented growth in the recent years.The growth of high performance interconnects and the emergence of multi-core processors have fueled this growth. Complement to the growing cluster sizes, researchers have developed varied parallel programming models to harness the power of larger clusters. Popular parallel programming models in use range from traditional message passing and shared memory models to newer partitioned global address space models. MPI, a de-facto programming model for distributed memory machines was extended in MPI-2 to support two new programming paradigms: the MPI-2 dynamic process management interface and the MPI-2 remote memory access interface.

The MPI-2 dynamic process management provides MPI applications the flexibility to dynamically alter the scale of the job by allowing applications to spawn new processes, making way for a master/slave paradigm in MPI. The MPI-2 remote memory access interface allows applications the illusion of globally accessible memory. In this thesis, we study the two MPI-2 programming interfaces and propose optimized designs for the them. We design a low overhead connection-less transport based dynamic process interface and demonstrate the effectiveness of our design using benchmarks. We address the design of the remote memory interface on onload-ed InfiniBand using a DMA copy offload. Our design of the remote memory interface provides for computation-copy overlap and minimal cache pollution. The proposed designs are implemented and evaluated on InfiniBand, a modern interconnect which provides a rich set of features. The designs developed as a part of this thesis are available in MVAPICH2, a popular open-source implementation of MPI over InfiniBand used by over 900 organizations.

Committee:

Dhabaleswar Panda, PhD (Advisor); Ponnuswamy Sadayappan, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

MPI-2; InfiniBand; Dynamic Process Management; Remote Memory Access; High Performance Computing