University of Iowa homepage
 

 

Tolerating Software Faults in Distributed Systems

Neeraj Mittal
Department of Computer Science
University of Texas at Austin

Thursday, March 14
4:00-4:50pm, 205 MLH

Abstract

Recent advances in communication technology have led to a rapid proliferation of distributed systems. For example, a cluster of servers provided Web coverage of the Sydney Summer Olympics. As distributed systems evolve from the special case to commonplace, ensuring their reliable operation has emerged as an important and challenging problem. In spite of extensive testing and debugging, software faults persist even in commercial grade software. Many distributed systems, especially those employed in safety-critical environments, should be able to operate properly even in the presence of software faults. Monitoring the execution of a distributed system, and, on detecting a fault, initiating the appropriate corrective action is an important way to tolerate such faults.
Detecting a software fault in a distributed computation involves finding a (consistent) global state that violates some safety property. In case the failure occurred due to inadequate synchronization (e.g., violation of mutual exclusion), recovery can be accomplished by rolling back the program to a previous state and then re-executing it--with additional synchronization--to avoid a recurrence of the failure.
The above-mentioned problems of finding a faulty global state and the synchronization needed to avoid such states are known to be computationally hard in general. To address the first problem, we have introduced the notion of computation slice. A slice is a concise representation of global states satisfying certain global property. We have developed polynomial-time algorithms for computing the slice for several useful classes of global properties. A slice, in general, contains exponentially fewer number of global states than the computation. Our experimental results show that slicing can lead to an exponential improvement over existing techniques both in terms of time and space for locating a faulty global state, if any. With regard to the controlled re-execution problem, we have identified several conditions under which it is possible to compute the required synchronization efficiently.

 

Thursday, October 07, 2004, 10:21:31.
University of Iowa Logo College of Liberal Arts and Sciences Logo Computing Research Association Logo Association for Computing Machinery Logo
Translate this page automatically.
 
©2005 The University of Iowa, All Rights Reserved.