Previous [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25]

@

Journal of Information Science and Engineering, Vol. 26 No. 3, pp. 951-966 (May 2010)

Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

MEHDI LOTFI AND SEYED AHMAD MOTAMEDI
Department of Electrical Engineering
Amirkabir University of Technology
Tehran, Iran
E-mail: {m_lotfi@cic.; motamedi@}aut.ac.ir

Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce a new approach for blocking coordinated checkpointing using two-level checkpointing. The first level of checkpointing is local checkpointing, and computing nodes save the checkpoints in local disk. If a transient failure occurs in the computing node, the process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to highly reliable global stable storage. If a permanent failure occurs in the computing node, it can not be used and the process can recover from global storage in a new computing node. Local checkpoints are taken more frequently than global checkpoints. Also, in the end of each local checkpointing interval, the system determines the expected recovery time in the case of permanent failure and adaptively takes a global checkpoint, or skips. Experimental results show that average execution time of NAS-BT application is significantly reduced by using the proposed method. Maximum reduction of execution time of this application is 38%.

Keywords: blocking coordinated checkpointing, transient failure, permanent failure, local checkpoint, global checkpoint, optimal interval

Full Text () Retrieve PDF document (201005_14.pdf)

Received October 17, 2008; revised April 9, 2009; accepted April 16, 2009.
Communicated by Makoto Takizawa.