Fault-tolerance on parallel systems has always been a big challenge for High Performance Computing (HPC), and hence it has drawn a lot of attention of the community. This pursuit in fault-tolerant systems is now important more than ever due to the recent advances in hardware. As the emergence of first multi-core and more recently many-core machines evince, computing power is constantly being increased with more number of processing cores resulting in more parallelism. In order to satisfy this demand and to increase power of individual components, chips are manufactured with decreasing feature sizes. Another trend is power optimization efforts, since it might not be feasible to run all system resources at their peak levels all the time due to the factors such as heat dissipation and maintaining a total power budget. These trends in hardware also change the way that scientific applications are implemented. The community designs new and diverse parallel programming models to harvest the available computing power in new hardware architectures. These models provide additional support to programmers so that they can achieve scalable performance by tuning applications via additional API, specifications or annotations.
Unfortunately, these changes in hardware and software also bring new challenges. For instance, increasing number of components in HPC systems results in increasing probability of failure at the same time. Trends such as decreasing feature sizes and low voltage computing cause more frequent bit flip occurrences. Lastly, when incorporated incorrectly or inaccurately, programmer specifications for performance tuning might cause potential errors during execution. Considering these new problems, the community foresees that Mean Time Between Failures (MTBF) rates in the future are destined to decrease significantly so that the current fault-tolerance solutions will become completely inapplicable.
In this dissertation, we introduce fault-tolerance solutions in the context of existing and new parallel programming models and query and data-analysis frameworks. Our specific solutions target the three type of failures commonly seen; fail-stop failures, soft errors and programmer induced errors. With proposed solutions, we address the following key challenges. (1) Replication is a standard technique employed in big data analysis platforms to ensure the availability of underlying data in presence of fail-stop failures. How should we create and organize data replicas so that we guarantee efficient recovery by preserving load balance among remaining processing units when failures occur? (2) Programming models are expected to play a key role in overcoming the challenges in future HPC systems including resilience. Can we design a programming model that exposes the core execution state and the most critical computations in an application through a set of programming abstractions? (3) With the help of these abstractions, can such a programming model automate application-level checkpointing and reduce the amount of checkpointed state? Can we use the same knowledge to detect silent data corruptions with low overheads by executing a subset of the computations in an application redundantly? (4) For checkpoint/restart solutions, can we design recovery techniques that do not enforce any assumptions on the number of processing units that the execution is restarted with? (5) Fault-tolerance has been mostly addressed in the context of SPMD paradigm. Is it possible to design fault-tolerance solutions against soft errors in different parallel programming paradigms such as task graph execution model? (6) In addition to fail-stop failures and soft errors due to manufacturing issues and machine defects, can we also deal with the potential failures that are induced by programmer specifications while tuning an application to improve performance?
First, we presented the design and implementation of a fault-tolerant environment for
processing queries and data analysis tasks on large scientific datasets. For two common query and
data analysis tasks, we first provided a framework that employs the standard data indexing techniques and achieves high-efficiency of execution when there are no failures. Then, we showed how the framework recovers from failures up to a certain number of nodes efficiently and still maintains the load balance among remaining nodes after recovery completes. We achieved these goals by developing a data replication scheme, which we refer to as subchunk or subpartition replication. Our extensive evaluation demonstrated that our replication scheme outperforms the traditional solutions.
Second, we focused on designing a parallel programming paradigm that models computations and communication in iterative scientific applications through an underlying domain and interactions among domain elements. With proper abstractions, the proposed model hides the details of inter-process communication and work partitioning (including re-partitioning in presence of heterogeneous processing cores) from users. More importantly, it captures the most critical execution state and instructions in an application through the concepts of compute-function and computation-space object. The model supports automated, yet efficient, application-level checkpointing and at the same time detects soft errors occurring in processing cores and corrupting the main application state by a low-overhead redundant execution strategy. We analyze the performance of our programming model with various scenarios both on homogeneous and heterogeneous configurations.
Next, we directed our attention to task graph execution model, which is a different parallel programming paradigm than Single Program Multiple Data (SPMD) for which most of the existing fault-tolerance solutions in the literature has been proposed. We designed a fault-tolerant dynamic task graph scheduling algorithm that recovers corrupted data blocks and meta-data in a task graph from arbitrary number of soft errors with low time and space overheads. We provided a task re-execution algorithm which is selective in the sense that only the corrupted portion of the task graph is recovered. Furthermore, as opposed to the traditional checkpoint/restart solutions, recovery is performed in a non-collective fashion so that only the threads observing the failure participate in the recovery process, whereas remaining threads continue normal execution. We evaluated our fault-tolerance solution extensively under different failure scenarios and showed the recovery overheads are negligible for the common case of small number of failures.
As our last work, we focused on another type of failure caused by the tuning efforts of runtime software and programmers to improve parallel execution performance. First, we proposed a memory management scheme to reduce the total memory consumption of applications expressed as a task graph. The presented optimization technique is based on recycling data blocks among tasks and it is able to handle task graphs with dynamic dependence relations efficiently in contrast to the common use-count based memory allocators. Recycling operations are dictated by functions which are either automatically explored by runtime or specified explicitly by user annotations. Regardless, an incorrect recycling operation might lead to data races and erroneous program output. Therefore, next, to detect such cases and still benefit from data block recycling, we proposed two algorithms which prune the space of candidate recycling functions and recover the effects of any invalid choice of recycling operation efficiently during execution. We demonstrated that the proposed schemes reduce the total memory consumption significantly and yet is able to avoid any potential hazards.
Keywords: fault-tolerance, fail-stop failures, soft errors, programmer errors, application-level checkpointing, replication, task re-execution, recovery, soft error detection, big data processing, SPMD, task graph scheduling, memory management