Checkpointing And Failure Recovery In Distributed Systems
Dr. Dakshnamoorthy Manivannan
Department of Computer and Information Science
Ohio State University
Monday, April 14, 1997
11 a.m. - 12 noon
Fuller Labs 311
A consistent global checkpoint of a distributed computation represents a feasible system state. Finding consistent global checkpoints of a distributed computation has applications in many areas, such as transparent failure recovery, debugging distributed programs, and others. We present a theoretical foundation for addressing the issues involved in finding consistent global checkpoints of a distributed computation efficiently. Using the insight gained from the theory, we also present an algorithm for finding all the consistent global checkpoints of a distributed computation.
When processes take checkpoints asynchronously, some of the checkpoints may not be part of any consistent global checkpoint and hence are useless. Quasi- synchronous checkpointing algorithms proposed in the literature minimize the useless checkpoints using several strategies. We present a theoretical framework for characterizing and classifying the quasi-synchronous checkpointing algorithms. The framework helps in analyzing the properties and limitations of such algorithms and also provides guidelines for designing and evaluating new algorithms. This classification also sheds light on some open problems that remain to be solved.
Based on the insight gained from the classification of quasi-synchronous checkpointing algorithms, we present a low-overhead quasi-synchronous checkpointing algorithm and a low-overhead asynchronous recovery algorithm based on the checkpointing algorithm.
Maintained by webmaster@wpi.eduLast modified: Sep 27, 2006, 16:05 EDT
