Search Results (1 - 25 of 37 Results)

Sort By  
Sort Dir
 
Results per page  

Gutierrez Soto, MariantonietaMULTI-AGENT REPLICATOR CONTROL METHODOLOGIES FOR SUSTAINABLE VIBRATION CONTROL OF SMART BUILDING AND BRIDGE STRUCTURES
Doctor of Philosophy, The Ohio State University, 2017, Civil Engineering
The protection of large infrastructure is a critical and complex issue facing civil engineers. Earthquakes are especially unpredictable and pose a great threat to critical infrastructure that directly affect people’s lives. To tackle this problem, the latest innovation is the development of intelligent structures. Intelligent structures have technology installed to dampen the movement caused by forces of nature. The goal is to develop a new generation of smart structures equipped with sensors and control devices that can react in real-time during an earthquake. Structural control methods have been the subject of significant research in the past 20 years but still face limitations. This investigation consists of four parts. In Part I, four ideas are introduced for vibration control of smart structures: decentralized control, agent-based modeling, replicator dynamics from evolutionary game theory, and energy minimization. Two new control algorithms are presented: 1) a single-agent Centralized Replicator Controller (CRC) and a decentralized Multi-Agent Replicator Controller (MARC) for real-time vibration control of smart structures. The use of agents and a decentralized approach enhances the robustness of the entire vibration control system. The proposed control methodologies are applied to vibration control of a 3-story steel frame and a 20-story steel benchmark structure subjected to two sets of seismic loadings: historic earthquake and artificial accelerograms and compared with the corresponding centralized and decentralized conventional control algorithms. In Part II, the aforementioned control algorithms are integrated with a multi-objective optimization algorithm in order to find Pareto optimal values for replicator dynamics parameters with the goal of achieving maximum structural performance with minimum energy consumption. The patented neural dynamic model of Adeli and Park is used to solve the multi-objective optimization problem. Vibration control of irregular structures subjected to earthquake excitations is a complex civil engineering problem with associated torsional vibrations. In Part III, the replicator dynamics concepts are adapted for active/semi-active control of multi-story irregular base-isolated structures. The control algorithm is evaluated using a 3D base-isolated benchmark structure subjected to major historical earthquakes. In Part IV, the idea of combining the conventional base isolation with an active or semi-active control system to create a smart bridge structure is investigated. A control algorithm based on game theory and replicator dynamics is employed for hybrid vibration control of highway bridge structures equipped with both a passive isolation system and semi-active control devices subjected to earthquake loadings. The efficacy of the model is demonstrated by application to a benchmark example based on interstate I-5/91 overcrossing highway bridge in southern California subjected to near-field historical earthquake excitations. Substantial reduction in both mid-span displacement and deck acceleration is achieved compared with the conventional base-isolated bridge.

Committee:

Hojjat Adeli (Advisor); Ethan Kubatko (Committee Member); Kevin Passino (Committee Member); Daniel Pradel (Committee Member)

Subjects:

Civil Engineering

Keywords:

smart structures, vibration control, earthquake, resilience, decentralized, agent, multi-agent, replicator dynamics, game theory, bridges, base isolation, benchmark, MR damper, semi-active, structural dynamics, modeling, high performance, computing

Subramoni, HariTopology-Aware MPI Communication and Scheduling for High Performance Computing Systems
Doctor of Philosophy, The Ohio State University, 2013, Computer Science and Engineering
Most of the traditional High End Computing (HEC) applications and current petascale applications are written using the Message Passing Interface (MPI) programming model. Consequently, MPI communication primitives (both point to point and collectives) are extensively used across various scientific and HEC applications. The large-scale HEC systems on which these applications run, by necessity, are designed with multiple layers of switches with different topologies like fat-trees (with different kinds of over-subscription), meshes, torus, etc. Hence, the performance of an MPI library, and in turn the applications, is heavily dependent upon how the MPI library has been designed and optimized to take the system architecture (processor, memory, network interface, and network topology) into account. In addition, parallel jobs are typically submitted to such systems through schedulers (such as PBS and SLURM). Currently, most schedulers do not have the intelligence to allocate compute nodes to MPI tasks based on the underlying topology of the system and the communication requirements of the applications. Thus, the performance and scalability of a parallel application can suffer (even using the best MPI library) if topology-aware scheduling is not employed. Moreover, the placement of logical MPI ranks on a supercomputing system can significantly affect overall application performance. A naive task assignment can result in poor locality of communication. Thus, it is important to design optimal mapping schemes with topology information to improve the overall application performance and scalability. It is also critical for users of High Performance Computing (HPC) installations to clearly understand the impact IB network topology can have on the performance of HPC applications. However, no currently existing tool allows users of such large scale clusters to analyze and to visualize the communication pattern of their MPI based HPC applications in a network topology-aware manner. This work addresses several of these critical issues in the MVAPICH2 communication library. It exposes the network topology information to MPI applications, job schedulers and users through an efficient and flexible topology management API called the Topology Information Interface. It proposes the design of network topology aware communication schemes for multiple collective (Scatter, Gather, Broadcast, Alltoall) operations. It proposes a communication model to analyze the communication costs involved in collective operations on large scale supercomputing systems and uses it to model the costs involved in the Gather and Scatter collective operations. It studies the various alternatives available to a middleware designer for designing network topology and speed aware broadcast algorithms. The thesis also studies the challenges involved in designing network topology- aware algorithms for network intensive Alltoall collective operation and propose multiple schemes to design a network-topology-aware Alltoall primitive. It carefully analyzes the performance benefits of these schemes and evaluate their trade-offs across varying application characteristics. The thesis also proposes the design of a network topology-aware MPI communication library capable of leveraging the topology information to improve communication performance of point-to-point operations through network topology-aware placement of processes. It describes the design of network topology-aware plugin for the SLURM job scheduler to make scheduling decisions and allocate compute resources as well as place processes in a network topology-aware manner. It also describes the design of a network topology-aware, scalable, low-overhead profiler to analyze the impact of our schemes on the communication pattern microbenchmarks and end applications. The designs proposed in this thesis have been successfully tested at up to 4,096 processes on the Stampede supercomputing system at TACC. We observe up to 14% improvement in the latency of the broadcast operation using our proposed topology-aware scheme over the default scheme at the micro-benchmark level for 1,024 processes. The topology-aware point-to-point communication and process placement scheme is able to improve the performance the MILC application up to 6% and 15% improvement in total execution time on 1,024 cores of Hyperion and 2,048 cores of Ranger, respectively. We also observe that our network topology-aware communication schedules for Alltoall is able to significantly reduce the amount of network contention observed during the Alltoall / FFT operations. It is also able to deliver up to 12% improvement in the communication time of P3DFFT at 4,096 processes on Stampede. The proposed network topology-aware plugin for SLURM is able to improve the throughput of a 512 core cluster by up to 8%.

Committee:

Dhabaleswar Kumar Panda, Dr (Advisor); Ponnuswamy Sadayappan, Dr (Committee Member); Radu Teodorescu, Dr (Committee Member); Karen Tomko, Dr (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

InfiniBand, High Performance Computing, Topology-Aware Communication, Network-Topology, Topology-Aware process placement, Topology-Aware process point-to-point communication, Topology-Aware process collective communication

Kumar, Vijay ShivSpecification, Configuration and Execution of Data-intensive Scientific Applications
Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

Recent advances in digital sensor technology and numerical simulations of real-world phenomena are resulting in the acquisition of unprecedented amounts of raw digital data. Terms like ‘data explosion’ and ‘data tsunami’ have come to describe the uncontrolled rate at which scientific datasets are generated by automated sources ranging from digital microscopes and telescopes to in-silico models simulating the complex dynamics of physical and biological processes. Scientists in various domains now have secure, affordable access to petabyte-scale observational data gathered over time, the analysis of which, is crucial to scientific discovery. The availability of commodity components have fostered the development of large distributed systems with high-performance computing resources to support the execution requirements of scientific data analysis applications. Increased levels of middleware support over the years have aimed to provide high scalability of application execution on these systems. However, the high-resolution, multi-dimensional nature of scientific datasets, and the complexity of analysis requirements present challenges to efficient application execution on such systems. Traditional brute-force analysis techniques to extract useful information from scientific datasets may no longer meet desired performance levels at extreme data scales.

This thesis builds on a comprehensive study involving multi-dimensional data analysis applications at large data scales, and identifies a set of advanced factors or parameters to this class of applications which can be customized in domain-specific ways to obtain substantial improvements in performance. A useful property of these applications is their ability to operate at multiple performance levels based on a set of trade-off parameters, while providing different levels of quality-of-service (QoS) specific to the application instance. To avail the performance benefits brought about by such factors, applications must be configured for execution in specific ways for specific systems. Middleware support for such domain-specific configuration is limited, and there is typically no integration across middleware layers to this end. Low-level manual configuration of applications within a large space of solutions is error-prone and tedious.

This thesis proposes an approach for the development and execution of large scientific multi-dimensional data analysis applications that takes multiple performance parameters into account and supports the notion of domain-specific configuration-as-a-service. My research identifies various aspects that go into the creation of a framework for user-guided, system-directed performance optimizations for such applications. The framework seeks to achieve this goal by integrating software modules that (i) provide a unified, homogeneous model for the high-level specification of any conceptual knowledge that may be used to configure applications within a domain, (ii) perform application configuration in response to user directives, i.e., use the specifications to translate high-level requirements into low-level execution plans optimized for a given system, and (iii) carry out the execution plans on the underlying system in an efficient and scalable manner. A prototype implementation of the framework that integrates several middleware layers is used for evaluating our approach. Experimental results gathered for real-world application scenarios from the domains of astronomy and biomedical imaging demonstrate the utility of our framework towards meeting the scientific performance requirements at very large data scales.

Committee:

P Sadayappan, PhD (Advisor); Joel Saltz, MD, PhD (Committee Member); Gagan Agrawal, PhD (Committee Member); Umit Catalyurek, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

data-intensive computing; high performance computing; scientific workflow; out-of-core; multi-dimensional data; semantic modeling

Su, YuBig Data Management Framework based on Virtualization and Bitmap Data Summarization
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
In recent years, science has become increasingly data driven. Data collected from instruments and simulations is extremely valuable for a variety of scientific endeavors. The key challenge being faced by these efforts is that the dataset sizes continue to grow rapidly. With growing computational capabilities of parallel machines, temporal and spatial scales of simulations are becoming increasingly fine-grained. However, the data transfer bandwidths and disk IO speed are growing at a much slower pace, making it extremely hard for scientists to transport these rapidly growing datasets. Our overall goal is to provide a virtualization and bitmap based data management framework for “big data” applications. The challenges rise from four aspects. First, the “big data” problem leads to a strong requirement for efficient but light-weight server-side data subsetting and aggregation to decrease the data loading and transfer volume and help scientists find subsets of the data that is of interest to them. Second, data sampling, which focuses on selecting a small set of samples to represent the entire dataset, is able to greatly decrease the data processing volume and improve the efficiency. However, finding a sample with enough accuracy to preserve scientific data features is difficult, and estimating sampling accuracy is also time-consuming. Third, correlation analysis over multiple variables plays a very important role in scientific discovery. However, scanning through multiple variables for correlation calculation is extremely time-consuming. Finally, because of the huge gap between computing and storage, a big amount of time for data analysis is wasted on IO. In an in-situ environment, before the data is written to the disk, how to generate a smaller profile of the data to represent the original dataset and still support different analyses is very difficult. In our work, we proposed a data management framework to support more efficient scientific data analysis, which contains two modules: SQL-based Data Virtualization and Bitmap-based Data Summarization. SQL-based Data Virtualization module supports high-level SQL-like queries over different kinds of low-level data formats such as NetCDF and HDF5. From the scientists’ perspective, all they need to know is how to use SQL queries to specify their data subsetting, aggregation, sampling or even correlation analysis requirements. And our module can automatically transfer the high-level SQL queries into low-level data access languages, fetch the data subsets, perform different calculations and return the final results to the scientists. Bitmap-based Data Summarization module treats bitmap index as a data summarization and supports different kinds of analysis only using bitmaps. Indexing technology, especially bitmap indexing have been widely used in database area to improve the data query efficiency. The major contribution of our work is that we find bitmap index keeps both value distribution and spatial locality of the scientific dataset. Hence, it can be treated as a summarization of the data with much smaller size. We demonstrate that many different kinds of analyses can be supported only using bitmaps.

Committee:

Gagan Agrawal (Advisor)

Subjects:

Computer Science

Keywords:

Big Data; High-Performance Computing; Bitmap Index; Data Virtualization; Sampling; Correlation Analysis; Time Steps Selection; In-Situ Analysis; Distributed Computing; Scientific Data Management; Wide-area Data Transfer

Eisenlohr, John MerrickParallel ILU Preconditioning for Structured Grid Matrices
Master of Science, The Ohio State University, 2015, Computer Science and Engineering
Many scientific simulation codes represent physical systems by storing data at a set of discrete points. For example, a weather modeling software application may store data representing temperature, wind velocity, air pressure, humidity, etc. at a set of points in the atmosphere over a portion of the Earth's surface. Similarly, Computational Fluid Dynamics (CFD) software stores data representing fluid velocity, pressure, etc. at discrete points in the domain of the particular problem being solved. In many simulations, this set of discrete points is a structured grid; i.e., a finite n-dimensional regular lattice. One of the most important steps in many scientific simulations is the solution of a system of linear equations, where the unknowns of the system correspond to data elements at each of the discrete points used to model the system. If the simulation is based on a structured grid, this linear system will often have a special structure itself, and this structure may lead to more efficient techniques for solving the system than can be used for general sparse linear systems. Scientific simulations most often use an iterative technique such as the Conjugate Gradient Method or GMRES for solving linear systems.. It is well known that these iterative techniques converge much more quickly if a preconditioner is used. The ILU, or Incomplete LU Factorization, preconditioner is a good choice but it is not parallelizable is a straightforward way. This thesis examines techniques for parallelizing the application of the ILU preconditioner to linear systems arising from scientific simulations on structured grids. Various techniques are tested and timing results are recorded for different types and sizes of linear systems on structured grids.

Committee:

P Sadayappan (Advisor); Atanas Rountev (Committee Member)

Subjects:

Computer Science

Keywords:

high-performance computing; numerical linear algebra; preconditioning; parallel programming; SIMD; vectorization

Dinan, James S.Scalable Task Parallel Programming in the Partitioned Global Address Space
Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

Applications that exhibit irregular, dynamic, and unbalanced parallelism are growing in number and importance in the computational science and engineering communities. These applications span many domains including computational chemistry, physics, biology, and data mining. In such applications, the units of computation are often irregular in size and the availability of work may be depend on the dynamic, often recursive, behavior of the program. Because of these properties, it is challenging for these programs to achieve high levels of performance and scalability on modern high performance clusters.

A new family of programming models, called the Partitioned Global Address Space (PGAS) family, provides the programmer with a global view of shared data and allows for asynchronous, one-sided access to data regardless of where it is physically stored. In this model, the global address space is distributed across the memories of multiple nodes and, for any given node, is partitioned into local patches that have high affinity and low access cost and remote patches that have a high access cost due to communication. The PGAS data model relaxes conventional two-sided communication semantics and allows the programmer to access remote data without the cooperation of the remote processor. Thus, this model is attractive for supporting irregular and dynamic applications on distributed memory clusters. However, in spite of the flexible data model, PGAS execution models require the programmer to explicitly partition the computation into a process-centric execution.

In this work, we build a complete environment to support irregular and dynamic parallel computations on large scale clusters by extending the PGAS data model with a task parallel execution model. Task parallelism allows the programmer to express their computation as a dynamic collection of tasks. The execution of these tasks is managed by a scalable and efficient runtime system that performs dynamic load balancing, enhances locality, and provides opportunities for efficient recovery from faults. Experimental results indicate that this system is scalable to over 8192 processor cores, can achieve extremely high efficiency even in the presence of highly unbalanced and dynamic computations, and can also be leveraged to enable rapid recovery from failures.

Committee:

Ponnuswamy Sadayappan, PhD (Advisor); Paolo Sivilotti, PhD (Committee Member); Atanas Rountev, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

High performance computing; Parallel programming models; Scalable runtime systems

Jiang, WeiA Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures
Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

Parallel computing environments are ubiquitous nowadays, including traditional CPU clusters and the emergence of GPU clusters and CPU-GPU clusters because of their performance, cost and energy efficiency. With this trend, an important research issue is to effectively utilize the massive computing power in these architectures to accelerate data-intensive applications arising from commercial and scientific domains. Map-reduce and its Hadoop implementation have become popular for its high programming productivity but exhibits non-trivial performance losses for many classes of data-intensive applications. Also, there is no general map-reduce-like support up to date for programming heterogeneous systems like a CPU-GPU cluster. Besides, it is widely accepted that the existing fault tolerant techniques for high-end systems will not be feasible in the exascale era and novel solutions are clearly needed.

Our overall goal is to solve these programmability and performance issues by providing a map-reduce-like API with better performance efficiency as well as efficient fault tolerance support, targeting data-intensive applications and various new emerging parallel architectures. We believe that a map-reduce-like API can ease the programming difficulty in these parallel architectures, and more importantly improve the performance efficiency of parallelizing these data-intensive applications. Also, the use of a high-level programming model can greatly simplify fault-tolerance support, resulting in low overhead checkpointing and recovery.

We performed a comparative study showing that the map-reduce processing style could cause significant overheads for a set of data mining applications. Based on the observation, we developed a map-reduce system with an alternate API (MATE) using a user-declared reduction-object to be able to further improve the performance of map-reduce programs in multi-core environments. To address the limitation in MATE that the reduction object must fit in memory, we extended the MATE system to support the reduction object of arbitrary sizes in distributed environments and apply it to a set of graphmining applications, obtaining better performance than the original graph mining library based on map-reduce.

We then supported the generalized reduction API in a CPU-GPU cluster with the ability of automatic data distribution among CPUs and GPUs to achieve the best-possible heterogeneous execution of iterative applications. Finally, in our recent work, we extended the generalized reduction model with supporting low overhead fault tolerance for MPI programs in our FT-MATE system. Especially, we are able to deal with CPU/GPU failures in a cluster with low overhead checkpointing, and restart the computations from a different number of nodes. Through this work, we would like to provide useful insights for designing and implementing efficient fault tolerance solutions for the exascale systems in the future.

Committee:

Gagan Agrawal (Advisor); Ponnuswamy Sadayappan (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Science

Keywords:

High-Performance Computing; Data-Intensive Computing; Map-Reduce

Sanghvi, Niraj D.Parallel Computation of the Meddis MATLAB Auditory Periphery Model
Master of Science, The Ohio State University, 2012, Electrical and Computer Engineering

The Meddis MATLAB Auditory Periphery (MAP) model is a computational model of the mammalian auditory system made using MATLAB. The model simulates the physiological processes taking place at different stages within the auditory system and provides an insight on how a sound wave incident on the ear is modified as it passes through the auditory system. The stages of the auditory system included in the model are the middle ear, the inner ear and parts of the brain stem.

Calculating the response of each stage is a computationally intensive and time consuming task. It takes more than 7 minutes to just calculate the response of all auditory nerves, each producing an action potential about 300 times a second for a sound signal of 1 second duration when 1000 best frequencies are considered. This is a major disadvantage especially when the model has to be run multiple times in speech and hearing experiments.

The thesis describes how the runtime of the MAP model can be reduced by modifying the code, enabling it to run in parallel on multiple processing cores using tools provided by MATLAB. The thesis also describes how GPUs can be utilized to further reduce runtimes. The thesis concludes with an application of the MAP model in detecting differences between FM signals.

Committee:

Ashok Krishnamurthy, PhD (Advisor); Yuan Zheng, PhD (Committee Member); Lawrence Feth, PhD (Committee Member)

Keywords:

Parallel Computing; Auditory Periphery; High Performance Computing; GPU; CUDA; Meddis Auditory Periphery; MATLAB; Parallel Computing Toolbox;

Nikam, Akshay MachhindraRedistribution of Tensors for Distributed Contractions
Master of Science, The Ohio State University, 2014, Computer Science and Engineering
In computational quantum chemistry and nuclear physics, tensor contractions are frequently required computationally expensive operations. Methods involving tensor contractions often require them to run in series. Efficient contraction algorithms require the tensors to be distributed in certain ways. Hence, a redistribution operation on tensors is essential between two contractions in series. In this thesis, an efficient method to redistribute tensors on a multi-dimensional torus-grid of processors is proposed. Multiple ways of redistribution exist and the most efficient one for the given pair of distributions can be picked. The choice is mainly driven by replication of tensor data in the processor grid. Redistribution approaches are developed based on the replication scheme and they involve division of processor grid into smaller hyperplanar sub-grids that can handle the communication in parallel. Some approaches involve data broadcast along grid dimensions to replicate data as required by the new distribution. Wherever such efficient schemes are not possible, a point-to-point communication approach is proposed.

Committee:

Ponnuswamy Sadayappan (Advisor); Atanas Rountev (Committee Member)

Subjects:

Computer Science

Keywords:

high performance computing; torus networks; tensors

Stock, Kevin AlanVectorization and Register Reuse in High Performance Computing
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Automatic vectorization is crucial for attaining maximum performance on modern processors in a performance portable fashion. There is significant room for improvement over techniques in existing compilers, as demonstrated in this work, through careful synthesis of vectorized code and application of aggressive performance optimizations, especially in the domains of tensor contractions and stencils. Tensor contractions are a class of computation used extensively in computational chemistry which are essentially higher-dimension matrix multiplication. Vector code synthesis for tensor contractions has been explored, utilizing a novel in-register transpose for efficiently accessing tensors with SIMD vectors along any dimension. Aggressive use of unroll-and-jam and loop permutation optimizations is able to increase register reuse and amortize the cost of expensive vector operations. Stencils are a critical part of many scientific codes including PDE solvers and image processing. This work presents a framework for reordering operations in regular loop computations, such as stencils, by utilizing the associativity and commutativity of operations. The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers. The reordering of operations enables substantially improved register reuse through reduced register pressure, resulting in up to 4x performance improvement for existing benchmarks. The multidimensional retiming formalism characterizes the space of valid implementations in conjunction with other transformations from the vector code synthesis used for tensor contractions. Given the transformations used to optimize high performance software, there are many parameters which create a space of possible variants from which the best, or nearly best, candidate must be found. In addition to auto-tuning over the space of synthesized vector codes, two models have been developed for selecting a high performance variant at compile time. The first model is based on estimating the total number of SIMD instructions executed and demonstrated and for a set of benchmark kernels achieved a mean performance of 73% of the best observed performance across multiple architectures and compilers. The second model used features extracted from the assembly code generated by the compiler which were then processed by machine learning algorithms. By considering a linear combination of the most successful models, a mean performance of 89% of the best observed was achieved in the same tests as the first model. This research has shown that integration of tile code generation techniques with existing automatic tiling and parallelizing algorithms based on the polyhedral model can produce significant performance improvements on modern architectures such as Intel's Many Integrated Core. Additionally, these techniques have been applied in production software to automatically produce high performance code on multiple architectures for the dominant computational kernel in MADNESS.

Committee:

P Sadayappan (Advisor); Atanas Rountev (Committee Member); Srinivasan Parthasarathy (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

SIMD; High Performance Computing; Autovectorization; Tensor contractions; Stencils;

Carver, Eric RReducing Network Latency for Low-cost Beowulf Clusters
MS, University of Cincinnati, 2014, Engineering and Applied Science: Computer Engineering
Parallel Discrete Event Simulation (PDES) is a fine-grained parallel application that can be difficult to optimize on distributed Beowulf clusters. A significant challenge on these compute platforms is the relatively high network latency compared to the high CPU performance on each node. The frequent communications and high network latency means that event information communicated between nodes can arrive after a significant delay where the processing node is either waiting for the event to arrive (conservatively synchronized solutions) or prematurely processing events while the transmitted event is in transit (optimistically synchronized solutions). Thus, solutions to reduce network latency are crucial to the deployment of PDES. Conventional attacks on network latency in cluster environments are to use high priced hardware such as Infiniband and/or lightweight messaging layers other than TCP/IP. However, clusters are generally high cost systems (tens to hundreds of thousands of dollars) that, by necessity, must be shared. The use of lower latency hardware such as Infiniband can nearly double the hardware cost and the replacement of the TCP/IP network stack on a shared platform is generally infeasible as other users of the shared platform (with coarse-grained parallel computations) are well served by the TCP/IP stack and unwilling to rewrite their applications to use the APIs of alternate network stacks. Furthermore, configuring the hardware with multiple messaging transport layers is also quite difficult to setup and not generally supported. Low cost, small-form factor compute nodes with multi-core processing chips are becoming widely available. These solutions have lower performing compute nodes and yet often still support 100Mb/1Gb Ethernet hardware (reducing the network latency/processor performance disparity). The much lower per node costs (on the order of $200 per node) can enable the deployment of non-shared, dedicated clusters and thus, may be an attractive alternative for network customization and use to support PDES applications. This thesis explores this option of using an ODROID compute node for the cluster. The conventional TCP/IP networking stack is replaced with the (publicly available) RDMA over Converged Ethernet (RoCE) networking layer which has significantly lower latency costs. We find that RoCE solution is capable of reducing end-to-end small message latency by more than 30%. This translates to a performance improvement of greater than 10% (compared to the TCP/IP solution) for PDES applications using Rensselaer's Optimistic Simulation System (ROSS). However, when comparing the ODROID-based cluster performance for cost, both in terms of operations per second and Parallel Discrete Event Simulation performance, we find that its performance does not justify its price for either application.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Wen Ben Jone, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

High Performance Computing;Cluster;Parallel Discrete Event Simulation;Infiniband;RoCE;Time Warp

GANDHI, SACHINAN FPGA IMPLEMENTATIN OF FDTD CODES FOR RECONFIGURABLE HIGH PERFORMANCE COMPUTING
MS, University of Cincinnati, 2004, Engineering : Computer Engineering
Finite-difference time-domain (FDTD) codes are used in modeling RF signatures and electronic coupling. Despite improvements in large-scale modeling, simulation codes and the acquisition of powerful high-performance computing (HPC) platforms, simulations of such scientific problems require still more powerful computers. The advances in Field-Programmable Gate Array (FPGA) chips and FPGA-based co-processor (FPGA-CP) boards offer the potential for accelerating current and future HPC platforms. Higher capacity FPGAs offer an opportunity for implementing floating point applications, previously infeasible. In this thesis, I have investigated the feasibility of using FPGA-CP acceleration for FDTD simulation. The FDTD method has been chosen for this case study due to its simple computation kernel and abundant parallelism. A floating-point solver for eletro-magnetic simulation has been implemented. Results achieved in this thesis demonstrate the feasibility of high throughput and a good circuit density through the use of architectural features like the 18-bit block multipliers provided by modern-day FPGAs.

Committee:

Dr. Karen Tomko (Advisor)

Keywords:

FPGA; FDTD; HPC; hardware implementation; Finite Difference Time-Domain; Reconfigurable High Performance Computing; floating-point implementation

Lilly, Bradford R.A Novel System for Wireless Robotic Surgery Through the Use of Ultrasonic Tracking Coupled with Advanced Modeling Techniques
Master of Science in Engineering, University of Toledo, 2012, Engineering (Computer Science)
Through the use the University of Toledo sensor network infrastructure , along with a High Performance Compute (HPC) infrastructure, and the new robotics and virtual reality labs, experiments are conducted to determine the viability of roboticsapplied to minimally invasive spinal fusions. We explore the implications of radiation reduction as well as accuracy (both required and actualized). To support said experiments,infrastructural components in both hardware and software are developed. Results show that this method is able to achieve the goals set forward. Steps are being taken to continue funded research in collaboration with industry partners, which has the real potential of leading to a commercial product in a high revenue market.

Committee:

Krishna Shenai (Committee Chair); Vijay Goel (Committee Member); Ashok Biyani (Committee Member); Richard Molyet (Committee Member)

Subjects:

Computer Engineering

Keywords:

Robotic Surgery; Wireless Sensor Networks; High Performance Computing

Samsi, Siddharth SadanandComputer Aided Analysis of IHC and H&E Stained Histopathological Images in Lymphoma and Lupus
Doctor of Philosophy, The Ohio State University, 2012, Electrical and Computer Engineering

The use of computers in medical image analysis has seen tremendous growth following the development of imaging technologies that can capture image data in-vivo as well as ex-vivo. While the field of radiology has adopted computer aided image analysis in research as well as clinical settings, the use of similar techniques in histopathology is still in a nascent stage. The current gold standard in diagnosis involves labor-intensive tasks such as cell counting and quantification for disease diagnosis and characterization. This process can be subjective and affected by human factors suach as reader bias and fatigue. Computer based tools such as digital image analysis have the potential to help alleviate some of these problems while also offering insights that may not be readily apparent when viewing glass slides under an optical microscope. Commercially available high-resolution slide scanners now make it possible to obtain images of whole slides scanned at 40x microscope resolution. Additionally, advanced tools for scanning tissue images at 100x resolution are also available. Such scanning tools have led to a large amount of research focused on the development of image analysis techniques for histopathological images. While the availability of high-resolution image data presents innumerable research opportunities, it also leads to several challenges that must be addressed.

This dissertation explores some of the challenges associated with computer-aided analysis of histopathological images. Specifically, we develop a number of tools for Follicular Lymphoma and Lupus. We aim to develop algorithms for detection of salient features in tissue biopsies of follicular lymphoma tumors. We analyze the algorithms from a computational point of view and develop techniques for processing whole slide images efficiently using high performance computing resources. In the application of image analysis for Lupus, we analyze mouse renal biopsies for characterizing the distribution of infiltrates in tissue as well as develop algorithms for identification of tissue components such as the glomeruli, which play a significant role in the diagnosis of the disease. Finally, we explore the development of a web-based system for dissemination of high-resolution images of tissues with the goal of advancing collaboration, research and teaching. Through the use of web technologies and developments in the field of geospatial imaging, we demonstrate the efficacy of an online tissue repository that can enable pathologists, medical students and all researchers to explore these images as well as use high performance computing systems to leverage computer-aided diagnosis tools in their field.

Committee:

Ashok Krishnamurthy, PhD (Advisor); Bradley Clymer, PhD (Committee Member); Kimerly Powell, PhD (Committee Member)

Subjects:

Electrical Engineering; Information Systems; Medical Imaging

Keywords:

Histopathology; Renal image analysis; Lymphoma; IHC; H&38;E;virtual microscopy;high performance computing; HPC

Chen, Yung-YuA Multi-Physics Software Framework on Hybrid Parallel Computing for High-Fidelity Solutions of Conservation Laws
Doctor of Philosophy, The Ohio State University, 2011, Mechanical Engineering

This dissertation reports a novel approach to achieve high-fidelity solutions of conservation laws by designing a software framework that focuses on hybrid parallel computing. The software framework has been developed and open-sourced as a part of this research work. The reference implementation is named SOLVCON, which stands for SOLVer CONstructor. SOLVCON aims to provide high-fidelity numerical solutions of linear and nonlinear, first-order, fully-coupled, hyperbolic partial differential equations (PDEs) associated with conservation laws. This dissertation discusses the numerical method, the software design, high-performance computing (HPC), and applications of SOLVCON.

The numerical method employed in SOLVCON is the space-time Conservation Element and Solution Element (CESE) method. The tenet of the CESE method is a unified treatment of space and time in enforcing flux conservation. The method delivers time-accurate solutions of hyperbolic PDEs by using unstructured meshes. Contrast to modern upwind methods, no Riemann solvers (which are physics-specific) or reconstruction steps are used in the CESE method. As such, one need not hard-wire the physics into a CESE code, and physical models have been made pluggable to SOLVCON. This dissertation summarizes various schemes of the CESE method, and describes the development of a unified formulation for unstructured meshes using mixed elements in multi-dimensional space.

The multi-physics nature of the CESE method and the newly developed formulation lay the foundation of a well-organized software structure for high-resolution calculations. To materialize the software structure, a software framework has been developed. A software framework is a computer library that exercises inversion of control, which can regulate PDE solvers for various physical models based on the CESE method. In particular, supportive functionalities has been extracted out of the PDE solvers and collected into the software framework for reuse.

Complex parallel computing is simplified by the framework approach. For high-fidelity solutions, conventional approach uses Message-Passing Interface (MPI) for distributed-memory parallel computing. The emerging heterogeneous architecture requires hybrid parallelism, which simultaneously utilizes shared-and distributed-memory parallel computing. Moreover, high-fidelity solutions are associated with large data, which require significant storage space and processing time. This issue is also addressed by SOLVCON.

SOLVCON is developed by using Python, a nontraditional programming language for developing PDE solvers. Essentially, Python enables efficient programming by organizing the imperative programming, the object-oriented programming, and the functional programming. Python also provides various approaches to interface with other languages, including C, Fortran, and CUDA for computational efficiency. These features are critically important to the development of SOLVCON.

In this dissertation, SOLVCON is used to calculate two different physical processes: (i) Stress waves in anisotropic elastic solids and (ii) compressible inviscid flows. Computational tasks of these two different physical processes use the same binary code that implements the CESE method. The solver kernels of the two physical models are modularized so that they are pluggable to the framework. The software framework discussed in this dissertation is a systematic approach to large-scale calculations based on hybrid parallelism.

Committee:

Sheng-Tao Yu (Advisor); Stephen Bechtel (Committee Member); Sin-Chung Chang (Committee Member); Datta Gaitonde (Committee Member); Chia-Hsiang Menq (Committee Member)

Subjects:

Aerospace Engineering; Mechanical Engineering; Mechanics

Keywords:

Conservation laws; hyperbolic PDEs; wave propagation; the CESE method; high-performance computing; software framework; computational science; numerical simulations

Zheng, MaiTowards Manifesting Reliability Issues In Modern Computer Systems
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Computer Systems are evolving all the time. Particularly, the two most fundamental components, i.e., the compute unit and the storage unit, have witnessed dramatic changes in recent years. For example, on the compute side, the graphics processing units (GPUs) have emerged as an extremely cost-effective means for achieving high performance computing. Similarly, on the storage side, flash-based solid-state drives (SSDs) are revolutionizing the whole IT industry. While these new technologies have improved the performance of computer systems to a new level, they also bring new challenges to the reliability of the systems. As a new computing platform, GPUs enforce a novel multi-threaded programming model. Like any multi-threaded environment, data races on GPUs can severely affect the correctness of the applications and may lead to data loss or corruption. Similarly, as a new storage medium, SSDs also bring potential reliability challenges to the already complicated storage stack. Among other things, the behavior of SSDs during power faults — which happen even in the leading data centers — is an important yet mostly ignored issue in this dependability-critical area. Besides SSDs, another important layer in modern storage stack is databases. The atomicity, consistency, isolation, and durability (ACID) properties modern databases provide make it easy for application developers to create highly reliable applications. However, the ACID properties are far from trivial to provide, particularly when high performance must be achieved. This leads to complex and error-prone code—even at a low defect rate of one bug per thousand lines, the millions of lines of code in a commercial OLTP database can harbor thousands of bugs. As the first step towards building robust modern computer systems, this dissertation proposes novel approaches to detect and manifest the reliability issues in three different layers of computer systems. First, in the application layer, this dissertation proposes a low-overhead method for detecting races in GPU applications. The method combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. The design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, we leverage static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, our approach can accurately detect data races with no false positives reported. Our experimental results show that comparing to previous approaches, our method is more effective in detecting races in the evaluated cases, and incurs much less runtime and space overhead. Second, in the device layer, this dissertation proposes an effective framework to expose reliability issues in SSDs under power faults. The framework includes speciallydesigned hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test fifteen commodity SSDs from five different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure. Third, in the systems software layer, this dissertation proposes a novel recordand-replay framework to expose and diagnose violations of the ACID properties in modern databases. The framework includes workloads to exercise the ACID guarantees, a record-and-replay subsystem to allow the controlled injection of simulated power faults, a ranking algorithm to prioritize where to fault based on our experience, and a multi-layer tracer to diagnose root causes. Using the framework, we study 8 widely-used databases, ranging from open-source key-value stores to high-end commercial OLTP servers. Surprisingly, all 8 databases exhibit erroneous behavior. For the open-source databases, we are able to diagnose the root causes using our tracer, and for the proprietary commercial databases we can reproducibly induce data loss.

Committee:

Feng Qin (Advisor); Gagan Agrawal (Committee Member); Xiaodong Zhang (Committee Member)

Subjects:

Computer Science

Keywords:

computer systems, reliability, storage systems, solid-state drives, SSD, databases, graphics processing units, GPU, high-performance computing, operating systems

Larkins, Darrell BrianEfficient Run-time Support For Global View Programming of Linked Data Structures on Distributed Memory Parallel Systems
Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

Developing high-performance parallel applications that use linked data structures on distributed-memory clusters is challenging. Many scientific applications use algorithms based on linked data structures like trees and graphs. These structures are especially useful in representing relationships between data which may not be known until runtime or may otherwise evolve during the course of a computation. Methods such as n-body simulation, Fast Multipole Methods (FMM), and multiresolution analysis all use trees to represent a fixed space populated by a dynamic distribution of elements. Other problem domains, such as data mining, use both trees and graphs to summarize large input datasets into a set of relationships that capture the information in a form that lends itself to efficient mining.

This dissertation first describes a runtime system that provides a programming interface to a global address space representation of generalized distributed linked data structures, while providing scalable performance on distributed memory computing systems. This system, the Global Chunk Layer (GCL), provides data access primitives at the element level, but takes advantage of coarse-grained data movement to enhance locality and improve communication efficiency. The key benefits of using the GCL system include efficient shared-memory style programming of distributed dynamic, linked data structures, the abstraction and optimization of structural elements common to linked data, and the ability to customize many aspects of the runtime to tune application performance.

Additionally, this dissertation presents the design and implementation of a tree-specific system for efficient parallel global address space computing. The Global Trees (GT) library provides a global view of distributed linked tree structures and a set of routines that operate on these structures. GT is built on top of the generalized data structure support provided by the GCL runtime and can inter-operate with other parallel programming models such as MPI, or along with existing global view approaches such as Global Arrays. This approach is based on two key insights: First, tree-based algorithms are easily expressed in a fine-grained manner, but data movement must be done at a much coarser level of granularity for good performance. Second, since GT is focused on a single data abstraction, attributes unique to tree structures can be exploited to provide optimized routines for common operations.

This dissertation also describes techniques for improving program performance and program understanding using these frameworks. Data locality has a significant impact on the communication properties of parallel algorithms. Techniques are presented that use profile-driven data reference traces to perform two types of data layout in Global Trees programs. Lastly, approaches for understanding and analyzing program performance data are presented along with tools for visualizing GT structures. These tools enable GT developers to better understand and optimize program performance.

Committee:

P. Sadayappan, PhD (Advisor); Atanas Rountev, PhD (Committee Member); Paul A.G. Sivilotti, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

pgas; trees; graphs; distributed memory; high performance computing

Bas, Erdeniz OzgunLoad-Balancing Spatially Located Computations using Rectangular Partitions
Master of Science, The Ohio State University, 2011, Computer Science and Engineering
Distributing spatially located heterogeneous workloads is an important problem in parallel scientific computing. Particle-in-cell simulators, ray tracing and partial differential equations are some of the applications with spatially located workload. We investigate the problem of partitioning such workloads (represented as a matrix of non-negative integers) into rectangles, such that the load of the most loaded rectangle (processor) is minimized. Since finding the optimal arbitrary rectangle-based partition is an NP-hard problem, we investigate particular classes of solutions: rectilinear, jagged and hierarchical. We present a new class of solutions called m-way jagged partitions, propose new optimal algorithms for m-way jagged partitions and hierarchical partitions, propose new heuristic algorithms, and provide worst case performance analyses for some existing and new heuristics. Balancing the load does not guarantee to minimize the total runtime of an application. In order to achieve that, one must also take into account the communication cost. Rectangle shaped partitioning inherently keeps communications small, yet one should proactively minimize them. The algorithms we propose are tested in simulation on a wide set of instances and compared to state of the art algorithm. Results show that m-way jagged partitions are low in total communication cost and practical to use.

Committee:

Umit V. Catalyurek, PhD (Advisor); Radu Teodorescu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

matrix partitioning; load balancing; rectangular partitioning; scientific computing; high performance computing; parallel computing

Cooper, Lee Alex DonaldHigh Performance Image Analysis for Large Histological Datasets
Doctor of Philosophy, The Ohio State University, 2009, Electrical and Computer Engineering

The convergence of emerging challenges in biological research and developments in imaging and computing technologies suggests that image analysis will play an important role in providing a better understanding of biological phenomenon. The ability of imaging to localize molecular information is a key capability in the post-genomic era and will be critical in discovering the roles of genes and the relationships that connect them. The scale of the data in these emerging challenges is daunting; high throughput microscopy can generate hundreds of gigabytes to terabytes of high-resolution imagery even for studies limited in scope to a single gene or interaction. In addition to the scale of the data, the analysis of microscopic image content presents significant problems for the state-of-the-art in image analysis.

This dissertation addresses two significant problems in the analysis of large histological images: reconstruction and tissue segmentation. The proposed methods form a framework that is intended to provide researchers with tools to explore and quantitatively analyze large image datasets.

The works on reconstruction address several problems in the reconstruction of tissue from sequences of serial sections using image registration. A scalable algorithm for nonrigid registration is presented that features a novel method for the matching small nondescript anatomical features using geometric reasoning. Methods for the nonrigid registration of images with different stains are presented for two application scenarios. Correlation sharpness is proposed as a new measure for image similarity, and is used to map tumor suppressor gene expression to structure in mouse mammary tissues. An extended process of geometric reasoning based on the matching of cliques of anatomical features is presented and demonstrated for the nonrigid registration of immunohistochemical stain to hemotoxylin and eosin stain for human cancer images. Finally, a method for the incorporation of structural constraints into the reconstruction process is proposed and demonstrated on the reconstruction of ducts in mammary tissues.

The work on tissue segmentation focuses on the use of statistical geometrical methods to describe the spatial distributions of biologically meaningful elements such as nuclei in tissue. The two point correlation function is demonstrated to be an effective feature for the segmentation of tissues, and is shown to possess a peculiar low-dimensional distribution in feature space that permits unsupervised segmentation by robust methods. The relationship between two-point functions for proximal image regions is derived and used to accelerate computation, resulting in a 7-68x improvement over a naive FFT-based implementation.

In addition to the methods proposed for reconstruction and segmentation, a significant portion of this dissertation is devoted to applying high performance computing to enable the analysis of large datasets. In particular, multi-node parallelization as well as multi-core and general purpose computing on graphics processing are used to form a heterogeneous multiprocessor platform that is used to demonstrate the segmentation and reconstruction methods on images up to 62K × 23K in size.

Committee:

Bradley Clymer (Advisor); Kun Huang (Advisor); Ashok Krishnamurthy (Committee Member)

Subjects:

Electrical Engineering

Keywords:

computer vision; image processing; bioinformatics; bioimaging; GPU; microscopy; high performance computing

Islam, Mohammad KamrulQoS In Parallel Job Scheduling
Doctor of Philosophy, The Ohio State University, 2008, Computer Science and Engineering

Considerable research has focused on the problem of scheduling dynamically arriving independent parallel jobs on a given set of resources to improve the performance with respect to various system and user metrics. However, there has been little work on provision of Quality of Service (QoS) in space-shared parallel job scheduling, in the form of hard deadline guarantees and service differentiation. Both of these functionalities offered by system providers are very desirable to the user. On the other hand, revenue maximization along with the optimal management of resources is appealing to a service provider.

This dissertation addresses these seemingly orthogonal aspects of parallel job scheduling in stages. At first a new scheme called QoPS is developed, to provide QoS in the form of response time guarantees. Essentially, QoPS implements an admission control mechanism for jobs, and provides deadline guarantees for all accepted jobs. Secondly, a pioneer model is proposed to enable proportional service differentiation (PSD) in job scheduling. A PSD framework would basically allow proportional allocation of resources across users based on relative priorities.

In order to address the revenue issue, two different charging models are investigated, determined by the resource provider and user respectively. Since no QoS-enabled charging model is currently deployed at any supercomputer center, a new provider-determined charging model is proposed. In this context, the impact of user tolerance towards missed deadlines is studied, as well as various techniques to further improve the overall revenue. Alternatively, a user-centric and market-based revenue approach originally proposed for non-QoS scheduling is adapted for QoS-aware scheduling. Using this charging model, an extension to QoPS called DVQoPS is being developed, that considers the opportunity cost using a history-based predictive technique and thus maximizes the overall revenue while maintaining the deadline guarantees in an integrated way.

Committee:

P. Sadayappan (Advisor); Dhabaleswar Panda (Committee Member); Rountev Atanas (Committee Member)

Subjects:

Computer Science

Keywords:

Job Scheduling; QoS; High performance computing; Revenue

Luo, MiaoDesigning Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors
Doctor of Philosophy, The Ohio State University, 2013, Computer Science and Engineering
High End Computing (HEC) has been growing dramatically over the past decades. The emerging multi-core systems, heterogeneous architectures and interconnects introduce various challenges and opportunities to improve the performance of communication middlewares and applications. The increasing number of processor cores and Co-Processors results in not only heavy contention on communication resources, but also much more complicated levels of communication patterns. Message Passing Interface (MPI) is the dominant parallel programming language for HPC application area in the past two decades. MPI has been very successful in implementing regular, iterative parallel algorithms with well defined communication pattern. Instead, the Partitioned Global Address Space (PGAS) programming model provides a flexible way for these applications to express parallelism. Different variations and combinations of these programming languages present new challenges in designing optimized programming model runtimes, in terms of efficient sharing of networking resources and efficient work-stealing techniques for computation load balancing across threads/processes, etc. Middlewares play a key role in delivering the benefits of new hardware techniques to support the new requirement from applications and programming models. This dissertation aims to study several critical contention problems of existing runtimes, which supports popular parallel programming models (MPI and UPC) on emerging multi-core/many-core systems. We start with shared memory contention problem within existing MPI runtime. Then we explore the network throughput congestion issue at node level, based on Unified Parallel C (UPC) runtime. We propose and implement lock-free multi-threaded runtimes for MPI/OpenMP and UPC with multi-endpoint support, respectively. Based on the multi-endpoint design, we further explore how to enhance MPI/OpenMP applications with transparent support for collective operations and minimal modifications for point-to-point operations. Finally we extend our multi-endpoint research to include GPU and MIC architecture for UPC and explore the performance features. Software developed as a part of this dissertation is available in MVAPICH2 and MVAPICH2-X. MVAPICH2 is a popular open-source implementation of MPI over InfiniBand and is used by hundreds of top computing sites all around the world. MVAPICH2-X supports both MPI and UPC hybrid programming models on InfiniBand clusters and is based on MVAPICH2 stack.

Committee:

Dhabaleswar K. Panda (Advisor); P. Sadayappan (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

High Performance Computing; MPI; UPC; GPGPU; MIC; InfiniBand; Middleware

Liu, JiuxingDesigning high performance and scalable MPI over InfiniBand
Doctor of Philosophy, The Ohio State University, 2004, Computer and Information Science

Rapid technological advances in recent years have made powerful yet inexpensive commodity PCs a reality. New interconnecting technologies that deliver very low latency and very high bandwidth are also becoming available. These developments lead to the trend of cluster computing , which combines the computational power of commodity PCs and the communication performance of high speed interconnects to provide cost-effective solutions for computational intensive applications, especially for those grand challenge applications such as weather forecasting, air flow analysis, protein searching, and ocean simulation.

InfiniBand was proposed recently as the next generation interconnect for I/O and inter-process communication. Due to its open standard and high performance, InfiniBand is becoming increasingly popular as an interconnect for building clusters. However, since it is not designed specifically for high performance computing, there exists a semantic gap between its functionalities and those required by high performance computing software such as Message Passing Interface (MPI). In this dissertation, we take on this challenge and address research issues in designing efficient and scalable communication subsystems to bridge this gap. We focus on how to take advantage of the novel features offered by InfiniBand to design different components in the communication subsystems such as protocol design, flow control, buffer management, communication progress, connection management, collective communication, and multirail network support.

Our research has already made notable contributions in the areas of cluster computing and InfiniBand. A large part of our research has been integrated into our MVAPICH software, which is a high performance and scalable MPI implementation over InfiniBand. Our software is currently used by more than 120 organizations world-wide to build InfiniBand clusters, including both research testbeds and production systems. Some of the fastest supercomputers in the world, including the 3rd ranked Virginia Tech Apple G5 cluster, are currently powered by MVAPICH. Research in this dissertation will also have impact on designing communication subsystems for systems other than high performance computing and for other high speed interconnects.

Committee:

Dhabaleswar Panda (Advisor)

Subjects:

Computer Science

Keywords:

Cluster Computing; InfiniBand; MPI; High Performance Computing; High Speed Interconnect; Performance; Scalability

Gao, XiaoyangIntegrated compiler optimizations for tensor contractions
Doctor of Philosophy, The Ohio State University, 2008, Computer and Information Science
This dissertation addresses several performance optimization issues in the context of the Tensor Contraction Engine (TCE), a domain-specific compiler to synthesize parallel, out-of-core programs for a class of scientific computations encountered in computational chemistry and physics. The domain of our focus is electronic structure calculations, where many computationally intensive components are expressible as a set of tensor contractions. These scientific applications are extremely compute-intensive and consume significant computer resources at national supercomputer centers. The manual development of high-performance parallel programs for them is usually very tedious and time consuming. The TCE system is targeted at reducing the burden on application scientists, by having them specify computations in a high-level form, from which efficient parallel programs are automatically synthesized. The goal of this research is to develop an optimization framework to derive high-performance implementations for a set of given tensor contractions. In particular, the issues investigated include: 1) Development of an efficient in-memory parallel algorithm for a tensor contraction. 2) Design of a performance-model driven framework for a parallel out-of-core tensor contraction. 3) Design of an integrated optimization framework, which incorporating multiple high-level program transformation techniques to improve locality and parallelism for a set of tensor contractions. This subject includes the following three topics: 3.1) Modeling interactions between different transformations and assess their impact on the overall performance, 3.2) Using a search-based strategy to explore the combined space of transformations, and provide efficient pruning methods to reduce the search space, 3.3) Using performance-driven models to identify the best combination and generate high-performance code. 4) Incorporation of domain-specific optimizations, such as sparse and symmetry. We have implemented and evaluated most of the proposed algorithms and optimizing strategies. Experimental results show that they work effectively in practice.

Committee:

Ponnuswamy Sadayappan (Advisor)

Subjects:

Computer Science

Keywords:

Compiler optimization; Loop transformations; High-performance computing; Parallel algorithms; Out-of-core algorithms

Elango, VenmugilTechniques for Characterizing the Data Movement Complexity of Computations
Doctor of Philosophy, The Ohio State University, 2016, Computer Science and Engineering
The execution cost of a program, both in terms of time and energy, comprises computational cost and data movement cost (e.g., cost of transferring data between CPU and memory devices, between parallel processors, etc.). Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity alone will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data movement complexity will be increasingly important. In their seminal work, Hong & Kung proposed the red-blue pebble game to model the data movement complexity of algorithms. Using the pebble game abstraction, Hong & Kung proved tight asymptotic lower bounds for the data movement complexity of several algorithms by reformulating the problem as a graph partitioning problem. In this dissertation, we develop a novel alternate graph min-cut based lower bounding technique. Using our technique, we derive tight lower bounds for different algorithms, with upper bounds matching within a constant factor. Further, we develop a dynamic analysis based automated heuristic for our technique, which enables automatic analysis of arbitrary computations. We provide several use cases for our automated approach. This dissertation also presents a technique, built upon the ideas of Christ et al., to derive asymptotic parametric lower bounds for a sub-class of computations, called affine computations. A static analysis based heuristic to automatically derive parametric lower bounds for affine parts of the computations is also presented. Motivated by the emerging interest in large scale parallel systems with interconnection networks and hierarchical caches with varying bandwidths at different levels, we extend the pebble game model to parallel system architecture to characterize the data movement requirements in large scale parallel computers. We provide interesting insights on architectural bottlenecks that limit the performance of algorithms on these parallel machines. Finally, using data movement complexity analysis, in conjunction with the roofline model for performance bounds, we perform an algorithm-architecture codesign exploration across an architectural design space. We model the maximal achievable performance and energy efficiency of different algorithms for a given VLSI technology, considering different architectural parameters.

Committee:

P Sadayappan (Advisor); Fabrice Rastello (Committee Member); Atanas Rountev (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Science

Keywords:

High-Performance Computing; IO complexity; Data movement complexity; Data access complexity; Red-blue pebble game; Lower bounds

Hartley, Timothy D. R.Accelerating Component-Based Dataflow Middleware with Adaptivity and Heterogeneity
Doctor of Philosophy, The Ohio State University, 2011, Electrical and Computer Engineering
This dissertation presents research into the development of high performance dataflow middleware and applications on heterogeneous, distributed-memory supercomputers. We present coarse-grained state-of-the-art ad-hoc techniques for optimizing the performance of real-world, data-intensive applications in biomedical image analysis and radar signal analysis on clusters of computational nodes equipped with multi-core microprocessors and accelerator processors, such as the Cell Broadband Engine and graphics processing units. Studying the performance of these applications gives valuable insights into the relevant parameters to tune for achieving efficiency, because being large-scale, data-intensive scientific applications, they are representative of what researchers in these fields will need to conduct innovative science. Our approaches shows that multi-core processors and accelerators can be used cooperatively to achieve application performance which may be many orders of magnitude above naive reference implementations. Additionally, a fine-grained programming framework and runtime system for the development of dataflow applications for accelerator processors such as the Cell is presented, along with an experimental study showing our framework leverages all of the peak performance associated with such architectures, at a fraction of the cognitive cost to developers. Then, we present an adaptive technique for automating the coarse-grained ad-hoc optimizations we developed for tuning the decomposition of application data and tasks for parallel execution on distributed, heterogeneous processors. We show that our technique is able to achieve high performance, while significantly reducing the burden placed on the developer to manually tune the relevant parameters of distributed dataflow applications. We evaluate the performance of our technique on three real-world applications, and show that it performs favorably compared to three state-of-the-art distributed programming frameworks. By bringing our adaptive dataflow middleware to bear on supporting alternative programming paradigms, we show our technique is flexible and has wide applicability.

Committee:

Umit Catalyurek, PhD (Advisor); Fusun Ozguner, PhD (Committee Member); Charles Klein, PhD (Committee Member)

Subjects:

Computer Engineering; Computer Science

Keywords:

High Performance Computing; GPUs; Heterogeneous Computing; Run-time Systems; Middleware

Next Page