[Previous [1] [2] [3] [4] [5] [6] [7] [8]

Journal of Inforamtion Science and Engineering, Vol.16, No.1, pp.65-80 (January 2000)

A Cost-Effective Forward Recovery Checkpointing
Scheme in Multiprocessor Systems*

Kuochen Wang and Chien-Chun Wang+
Department of Computer and Information Science
National Chiao Tung University
Hsinchu, Taiwan 300, R.O.C.
E-mail: kwang@cis.nctu.edu.tw
+Central Telecommunications Administration Station
Directorate General of Telecommunications
Taichung, Taiwan 408, R.O.C.

This paper proposes a novel and cost-effective forward recovery checkpointing scheme for multiprocessor systems with duplex modular redundancy. In our scheme, one processing module is selected to retry the questionable checkpoint, and the other processing module executes toward the next checkpoint if a mismatched comparison between the two processing modules occurs at any checkpoint. Those schemes using a spare module to retry need much time to initiate the module, and the extra cost is high. Although the traditional rollback scheme retries the questionable checkpoint without any spare module, it has longer average completion time than our scheme for a job under any fault distribution. In our scheme, besides transient faults, permanent faults can be located as well. Experimental results based on our mathematical models demonstrate that, under burst errors, the average completion time of our scheme is reduced by 50% compared with that of the traditional rollback and is comparable with that of the scheme using a spare module to retry. In addition, our scheme has the least total execution time (the most cost-effectiveness) among the three schemes under any fault distribution.

Keywords: forward recovery, multiprocessor system, cost-effective, checkpointing scheme, transient fault, permanent fault

Full Text () Retrieve PDF document (200001_04.pdf : 3,563,874 bytes)

Received October 16, 1997; revised June 3, 1998; accepted September 9, 1998.
Communicated by Lionel M. Ni.
*This work was supported in part by the National Science Council, ROC under Grant NSC87-2213-E-009-030.