Loading...
Please wait, while we are loading the content...
Similar Documents
Software fault tolerance in distributed systems using controlled re-execution
| Content Provider | Semantic Scholar |
|---|---|
| Author | Tarafdar, Ashis Garg, Vijay K. |
| Copyright Year | 2000 |
| Abstract | Distributed applications are particularly vulnerable to synchronization faults. An important approach to tolerating synchronization faults is rollback recovery, which involves restoring a previous state and re-executing. Existing rollback recovery methods depend on chance and cannot guarantee that synchronization faults do not recur during re-execution. We propose a new rollback recovery method, controlled re-execution, based on selectively adding synchronizations during re-execution to ensure that synchronization faults do not recur. The controlled re-execution method gives rise to three interesting questions: How do we determine the synchronizations that ensure a safe re-execution? How do we monitor an application to detect faulty global conditions? How well does controlled re-execution perform in practice? The first part of the dissertation addresses the predicate control problem which takes a computation and a global property and adds synchronizations to the computation to maintain the property. We design efficient algorithms to solve the problem for many useful predicates, including disjunctive predicates and various types of mutual exclusion predicates. These predicates correspond to commonly encountered synchronization faults such as races. The second part of the dissertation investigates the predicate detection problem which involves determining whether a global property occurs in a computation. We address the problem for the useful class of conjunctive predicates and in the context of an extended model of computation that allows improved predicate detection over the conventional model. We show that, in general, the problem is NP-Complete. However, an efficient solution is demonstrated for the useful cases of receive-ordered and send-ordered computations. Further, this solution can be used to achieve an improved, though exponential, solution for the general problem. The third part of the dissertation involves an experimental study of the controlled re-execution method. We evaluate the controlled re-execution method in tolerating race faults and find that the extra tracing costs imposed are within tolerable limits and that it greatly enhanced the likelihood of recovery. We conclude that controlled re-execution is an effective and desirable method for tolerating races in long-running non-interactive distributed applications. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.cs.utexas.edu/ftp/techreports/tr00-38.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |