Computer Science Department
State University of New York at Stony Brook
Stony Brook, NY 11794
Checkpointing is an essential mechanism for building fault-tolerant parallel and distributed systems. Most previous checkpointing methods were based on software mechanisms. Without hardware assists, the overheads of these methods become increasingly expensive as the performance gap between CPU and I/O enlarges. Algorithms have been developed to determine the optimal checkpoint interval to minimize a program's expected overall checkpoint overhead. However, there are cases in which checkpoints are mandatory to preserve the programs' execution semantics. In these cases, the checkpoint interval is not an sdjustable parameter, and system designers are forced to choose between performance and program integrity. This paper describes a storage architecture called Polar1 that can take a persistent snapshot of a process's address space image with significantly lower delay than conventional methods. This architecture can be used in a node of a message-passing distributed system or for checkpointing supercomputing applications. Based on a disk-memory mirroring scheme, Polar achieves this performance by exploiting the parallelism between program execution and checkpointing. Moreover, it guarantees the atomicity of a checkpoint transaction across failures, which is rather unique for asynchronous checkpointing schemes. Through a performance study based on the trace-driven simulation methodology, we show that Polar indeed achieves a significant checkpoint latency reduction, in some cases outperforming conventional checkpointiong methods by an order of magnitude.
Keywords: software fault tolerance, checkpointing, disk mirroring, storage architecture, atomicity, distributed transactions, asynchronous commit, parallel programming
Received December 12, 1992; revised April 15, 1993.
Communicated by Chuan-lin Wu.
1After the name of Polaroid, the trademark of an instant camera.