| Previous | [ 1] | [ 2] | [ 3] | [ 4] | [ 5] | [ 6] | [ 7] | [ 8] | [ 9] | [ 10] | [ 11] | [ 12] | [ 13] | [ 14] | [ 15] | [ 16] | [ 17] | [ 18] | [ 19] | [ 20] | [ 21] | [ 22] | [ 23] | [ 24] | [ 25] |
¡@
MEHDI LOTFI AND SEYED AHMAD MOTAMEDI
Department of Electrical Engineering
Amirkabir University of Technology
Tehran, Iran
E-mail: {m_lotfi@cic.; motamedi@}aut.ac.ir
Blocking coordinated checkpointing is a well-known method for achieving fault
tolerance in cluster computing systems. In this work, we introduce a new approach for
blocking coordinated checkpointing using two-level checkpointing. The first level of
checkpointing is local checkpointing, and computing nodes save the checkpoints in local
disk. If a transient failure occurs in the computing node, the process can recover from local
disk. Second level of checkpointing is global checkpointing and computing nodes
send their checkpoints to highly reliable global stable storage. If a permanent failure occurs
in the computing node, it can not be used and the process can recover from global
storage in a new computing node. Local checkpoints are taken more frequently than
global checkpoints. Also, in the end of each local checkpointing interval, the system determines
the expected recovery time in the case of permanent failure and adaptively takes
a global checkpoint, or skips. Experimental results show that average execution time of
NAS-BT application is significantly reduced by using the proposed method. Maximum
reduction of execution time of this application is 38%.
Received October 17, 2008; revised April 9, 2009; accepted April 16, 2009.
Communicated by Makoto Takizawa.