Computer Systems are evolving all the time. Particularly, the two most fundamental components, i.e., the compute unit and the storage unit, have witnessed dramatic changes in recent years. For example, on the compute side, the graphics processing units (GPUs) have emerged as an extremely cost-effective means for achieving high performance computing. Similarly, on the storage side, flash-based solid-state drives (SSDs) are revolutionizing the whole IT industry. While these new technologies have improved the performance of computer systems to a new level, they also bring new challenges to the reliability of the systems.
As a new computing platform, GPUs enforce a novel multi-threaded programming model. Like any multi-threaded environment, data races on GPUs can severely affect the correctness of the applications and may lead to data loss or corruption. Similarly, as a new storage medium, SSDs also bring potential reliability challenges to the already complicated storage stack. Among other things, the behavior of SSDs during power faults — which happen even in the leading data centers — is an important yet mostly ignored issue in this dependability-critical area. Besides SSDs, another important layer in modern storage stack is databases. The atomicity, consistency, isolation, and durability (ACID) properties modern databases provide make it easy for application developers to create highly reliable applications. However, the ACID properties are far from trivial to provide, particularly when high performance must be achieved. This leads to complex and error-prone code—even at a low defect rate of one bug per thousand lines, the millions of lines of code in a commercial OLTP database can harbor thousands of bugs.
As the first step towards building robust modern computer systems, this dissertation proposes novel approaches to detect and manifest the reliability issues in three different layers of computer systems.
First, in the application layer, this dissertation proposes a low-overhead method for detecting races in GPU applications. The method combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. The design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, we leverage static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, our approach can accurately detect data races with no false positives reported. Our experimental results show that comparing to previous approaches, our method is more effective in detecting races in the evaluated cases, and incurs much less runtime and space overhead.
Second, in the device layer, this dissertation proposes an effective framework to expose reliability issues in SSDs under power faults. The framework includes speciallydesigned hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test fifteen commodity SSDs from five different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.
Third, in the systems software layer, this dissertation proposes a novel recordand-replay framework to expose and diagnose violations of the ACID properties in modern databases. The framework includes workloads to exercise the ACID guarantees, a record-and-replay subsystem to allow the controlled injection of simulated power faults, a ranking algorithm to prioritize where to fault based on our experience, and a multi-layer tracer to diagnose root causes. Using the framework, we study 8 widely-used databases, ranging from open-source key-value stores to high-end commercial OLTP servers. Surprisingly, all 8 databases exhibit erroneous behavior. For the open-source databases, we are able to diagnose the root causes using our tracer, and for the proprietary commercial databases we can reproducibly induce data loss.