Search Results (1 - 6 of 6 Results)

Sort By  
Sort Dir
 
Results per page  

Gideon, JohnThe Integration of LlamaOS for Fine-Grained Parallel Simulation
MS, University of Cincinnati, 2013, Engineering and Applied Science: Computer Engineering
LlamaOS is a custom operating system that provides much of the basic functionality needed for low latency applications. It is designed to run in a Xen-based virtual machine on a Beowulf cluster of multi/many-core processors. The software architecture of llamaOS is decomposed into two main components, namely: the llamaNET driver and llamaApps. The llamaNET driver contains Ethernet drivers and manages all node-to-node communications between user application programs that are contained within a llamaApp instance. Typically, each node of the Beowulf cluster will run one instance of the llamaNET driver with one or more llamaApps bound to parallel applicaitons. These capabilities provide a solid foundation for the deployment of MPI applications as evidenced by our initial benchmarks and case studies. However, a message passing standard still needed to be either ported or implemented in llamaOS. To minimize latency, llamaMPI was developed as a new implementation of the Message Passing Interface (MPI), which is compliant with the core MPI functionality. This provides a standardized and easy way to develop for this new system. Performance assessment of llamaMPI was achieved using both standard parallel computing benchmarks and a locally (but independently) developed program that executes parallel discrete event-driven simulations. In particular, the NAS Parallel Benchmarks are used to show the performance characteristics of llamaMPI. In the experiments, most of the NAS Parallel Benchmarks ran faster than, or equal to their native performance. The benefit of llamaMPI was also shown with the fine-grained parallel application WARPED. The order of magnitude lower communication latency in llamaMPI greatly reduced the amount of time that the simulation spent in rollbacks. This resulted in an overall faster and more efficient computation, because less time was spent off the critical path due to causality errors.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Fred Beyette, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

Parallel Computing;Time Warp Simulation;MPI;Operating Systems;Beowulf Cluster;Parallel Discrete Event Simulation

Muthalagu, KarthikeyanThreaded WARPED : An Optimistic Parallel Discrete Event Simulator for Cluster of Multi-Core Machines
MS, University of Cincinnati, 2012, Engineering and Applied Science: Computer Engineering

Parallel Discrete Event Simulation (PDES) is an active area of research for many years. Studies with PDES have focused primarily on either shared memory or distributed memory platforms. However, the emergence of low-cost multi-core and many-core processors suitable for use in Beowulf clusters provides an opportunity for PDES execution on a platform containing both shared memory and distributed memory parallelism. This thesis explores the migration of an existing PDES simulation kernel called WARPED to a Beowulf Cluster of many-core processors. More precisely, WARPED is an optimistically synchronized PDES simulation kernel that implements the Time Warp paradigm. It was originally designed for efficient execution on single-core Beowulf Clusters. The work of this thesis extends the WARPED kernel to contain parallel threaded execution on each node as well as parallelism between the nodes of the cluster. The new version of warped will be called threaded WARPED.

In this thesis, warped is redesigned with thread safe data structure protected by various constructs. In particular atomic instructions are used to deploy lock-free data-structures and synchronization. With the addition of thread to WARPED the work also required adjustments and extensions to several of the subalgorithms of Time Warp. In particular, adjustments to the algorithm for computing Global Virtual Time (GVT), and termination detection were required. This thesis explains the modifications made to implement threaded WARPED and evaluates the performance capabilities of the two solutions for managing the shared data structures.

Committee:

Philip Wilsey, PhD (Committee Chair); Fred Beyette, PhD (Committee Member); Wen Ben Jone, PhD (Committee Member)

Subjects:

Computer Engineering

Keywords:

PDES; Parallel Discrete Event Simulation; Parallel Simulation; Parallel Programming

Carver, Eric RReducing Network Latency for Low-cost Beowulf Clusters
MS, University of Cincinnati, 2014, Engineering and Applied Science: Computer Engineering
Parallel Discrete Event Simulation (PDES) is a fine-grained parallel application that can be difficult to optimize on distributed Beowulf clusters. A significant challenge on these compute platforms is the relatively high network latency compared to the high CPU performance on each node. The frequent communications and high network latency means that event information communicated between nodes can arrive after a significant delay where the processing node is either waiting for the event to arrive (conservatively synchronized solutions) or prematurely processing events while the transmitted event is in transit (optimistically synchronized solutions). Thus, solutions to reduce network latency are crucial to the deployment of PDES. Conventional attacks on network latency in cluster environments are to use high priced hardware such as Infiniband and/or lightweight messaging layers other than TCP/IP. However, clusters are generally high cost systems (tens to hundreds of thousands of dollars) that, by necessity, must be shared. The use of lower latency hardware such as Infiniband can nearly double the hardware cost and the replacement of the TCP/IP network stack on a shared platform is generally infeasible as other users of the shared platform (with coarse-grained parallel computations) are well served by the TCP/IP stack and unwilling to rewrite their applications to use the APIs of alternate network stacks. Furthermore, configuring the hardware with multiple messaging transport layers is also quite difficult to setup and not generally supported. Low cost, small-form factor compute nodes with multi-core processing chips are becoming widely available. These solutions have lower performing compute nodes and yet often still support 100Mb/1Gb Ethernet hardware (reducing the network latency/processor performance disparity). The much lower per node costs (on the order of $200 per node) can enable the deployment of non-shared, dedicated clusters and thus, may be an attractive alternative for network customization and use to support PDES applications. This thesis explores this option of using an ODROID compute node for the cluster. The conventional TCP/IP networking stack is replaced with the (publicly available) RDMA over Converged Ethernet (RoCE) networking layer which has significantly lower latency costs. We find that RoCE solution is capable of reducing end-to-end small message latency by more than 30%. This translates to a performance improvement of greater than 10% (compared to the TCP/IP solution) for PDES applications using Rensselaer's Optimistic Simulation System (ROSS). However, when comparing the ODROID-based cluster performance for cost, both in terms of operations per second and Parallel Discrete Event Simulation performance, we find that its performance does not justify its price for either application.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Wen Ben Jone, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

High Performance Computing;Cluster;Parallel Discrete Event Simulation;Infiniband;RoCE;Time Warp

King, RandallWARPED Redesigned: An API and Implementation for Discrete Event Simulation Analysis and Application Development
MS, University of Cincinnati, 2011, Engineering and Applied Science: Computer Engineering

In 1995, researchers at the University of Cincinnati released WARPED as a publically available discrete event simulation kernel. The goal of the project was to provide a system for research and analysis of the Time Warp distributed simulation synchronization protocol. WARPED was to be efficient, maintainable, flexible, configurable, and portable. It was written in C++ and used the Message Passing Interface (MPI) standard to accommodate as many parallel platforms as possible. As the software implementation was expanded with additional capabilities and optimizations, several problems with the original design became apparent. The primary problem was that the configuration of various Time Warp optimizations could only be made at compile time. As simulations increased in size and complexity, this compile time became a significant burden. Another problem, related to the first, was that WARPED could not be used and distributed as a shared library due to the compile time configuration requirement.

This thesis discusses the design and implementation of the Time Warp mechanism in a new version of WARPED, now called the WARPED v2.x series (the initial series is now called the WARPED v1.x series). The primary goal of WARPED v2.x is to provide run time configuration of the system. The goals of the previous version carry over to the new version. Several simulation models are also included in the initial release of the WARPED v2.0 distribution for use in analyzing the system. In this initial version of WARPED v2.x, the system includes sequential and parallel simulation kernels that can be configured at run time for use with any of the simulation models compliant with the WARPED API. The parallel simulation kernel uses the Time Warp distributed synchronization mechanism and includes several Time Warp optimizations, including: various cancellation strategies, fossil collection algorithms, GVT estimation algorithms, state saving algorithms, event list structures, scheduling algorithms, and support for multiple communication protocols (all organized to support run time configuration/selection). This thesis presents the issues and difficulties of implementing the optimizations along with the solutions used. The optimizations are analyzed using performance data and system profiling. With the available simulations and extensible design, WARPED v2.0 can be used to explore new optimizations for the Time Warp mechanism.

Committee:

Philip Wilsey, PhD (Committee Chair); Fred Beyette, PhD (Committee Member); Carla Purdy, PhD (Committee Member)

Subjects:

Computer Engineering

Keywords:

Parallel Discrete Event Simulation;Time Warp;Distributed Simulation

Hay, Joshua AExperiments with Hardware-based Transactional Memory in Parallel Simulation
MS, University of Cincinnati, 2014, Engineering and Applied Science: Computer Engineering
Transactional memory is a concurrency control mechanism that dynamically determines when threads may safely execute critical sections of code. It does so by tracking memory accesses performed within a transactional region, or critical section, and detecting when memory operations conflict with other threads. Transactional memory provides the performance of fine-grained locking mechanisms with the simplicity of coarse-grained locking mechanisms. Parallel Discrete Event Simulation is a problem space that has been studied for many years, but still suffers from significant lock contention on SMP platforms. The pending event set is a crucial element to PDES, and its management is critical to simulation performance. This is especially true for optimistically synchronized PDES, such as those implementing the Time Warp protocol. Rather than prevent causality errors, events are aggressively scheduled and executed until a causality error is detected. This thesis explores the use of transactional memory as an alternative to conventional synchronization mechanisms for managing the pending event set in a time warp synchronized parallel simulator. In particular, this thesis examines the use of Intel’s hardware transactional memory, TSX, to manage shared access to the pending event set by the simulation threads. In conjunction with transactional memory, other solutions to contention are explored such as the use of multiple queues to hold the pending event set and the dynamic binding of threads to these multiple queues. For each configuration a comparison between conventional locking mechanisms and transactional memory access is performed to evaluate each within the WARPED parallel simulation kernel. In this testing, evaluation of both forms of transactional memory (HLE and RTM) implemented in the Haswell architecture were performed. The results show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanisms, up to as much as 27%.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Fred Beyette, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

transactional memory;TSX;parallel simulation;parallel discrete event simulation;PDES;lock contention

Alt, Aaron JProfile Driven Partitioning Of Parallel Simulation Models
MS, University of Cincinnati, 2014, Engineering and Applied Science: Computer Engineering
A considerable amount of research into effective parallelization for discrete event driven simulation has been conducted over the past few decades. However, most of this research has targeted the parallel simulation infrastructure; focusing on data structures, algorithms, and synchronization methods for the parallel and distributed simulation kernels. While this focus has successfully improved and refined the performance of parallel discrete event simulation kernels, little effort has been directed toward analyzing and preparing the simulation model itself for parallel execution. Model specific optimizations could have significant performance implications, but have been largely ignored. This fact is complicated by the lack of a widely used simulation and modeling language for many domains. The lack of a common language is, however, not entirely insurmountable. For example, the partitioning and assignment of objects from the simulation model onto the hardware platform is generally performed by the simulation infrastructure. While partitioning can have dramatic impacts on the communication frequencies between the concurrently executed objects, most existing parallel simulation infrastructures do little to address this opportunity. This thesis addresses the partitioning and assignment of objects within a simulation model for parallel execution. The specific target of this effort is to develop a partitioning and assignment strategy for use in the WARPED parallel simulation kernel that has been developed and maintained at the University of Cincinnati. The focus of the work is to develop a general purpose solution that can function for any simulation model that has been prepared for execution on the WARPED kernel. The specific solution exploits a sequential kernel from the WARPED project to pre-simulate the simulation model to obtain profile data regarding the frequency of events communicated between objects. This event frequency data is then used to develop partitions to minimize the amount of event exchanges between the objects in the different partitions. This partition information is then used during the initialization sequences of the WARPED kernel to assign each partition i to a unique processing node in the parallel cluster. This method is independent of the simulation model and compute platform. Experimental results with existing simulation models from the WARPED project show that this method can achieve up to a six-fold improvement in run time over the naive partitioning algorithm that was previously used by the WARPED kernel.

Committee:

Philip Wilsey, Ph.D. (Committee Chair); Fred Beyette, Ph.D. (Committee Member); Karen Davis, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Engineering

Keywords:

DES;profile guided partitioning;PDES;discrete event simulation;parallel discrete event simulation;profiling