Journal of Information Science and Engineering, Vol. 20 No. 5, pp. 885-901 (September 2004)

Adaptive Communication-Induced Checkpointing
Protocols with Domino-Effect Freedom

Jichiang Tsai, Chi-Yi Lin* and Sy-Yen Kuo*
Department of Electrical Engineering
National Chung Hsing University
Taichung, 402 Taiwan
*Department of Electrical Engineering
National Taiwan University
Taipei, 106 Taiwan

The domino effect is an important problem for the checkpointing and rollback recovery in distributed systems. Communication-induced checkpointing is one way of preventing domino effect. Most existing such protocols focus on guaranteeing that every checkpoint is part of a consistent global checkpoint. This may induce high run-time overhead due to the possibly excessive number of extra forced checkpoints. In this paper, we propose several adaptive communication-induced checkpointing protocols with domino-effect freedom. These protocols allow a flexible tradeoff between the cost of checkpoint coordination and the rollback distance. Only a specific set of checkpoints needs to be part of a consistent global checkpoint. The overhead analysis shows that our generalization can significantly reduce the number of extra forced checkpoints.

Keywords: distributed systems, domino effect, communication-induced checkpointing, fault tolerance, rollback recovery

Received December 27, 2001; revised March 25, 2003; accepted April 10, 2003.
Communicated by Chu-Sing Yang.