A foundation for component verification in RESOLVE

Scalable Self-Stabilization via Composition

William Leal and Anish Arora
ICDCS 2004

Systems tend to be subject to transient faults, including hardware faults, errors in program code, inappropriate composition of
components, wrong input, operator error, and so forth. In large or distributed systems it is often impractical or impossible to mask
the effect of these faults. Self-stabilization is an approach that addresses the problem by allowing a system to recover from
arbitrary faults.

Objections to the practical use of stabilization have centered around problems of scale. Because of potential interferences
between actions, global reasoning over the entire system is in general necessary. The complexity of this task increases dramatically as systems grow in size. Alternatives to dealing with this complexity focus on reset and composition. For reset, the problem is that any fault, no matter how minor, will cause a complete system reset with potentially significant lack of availability. For existing compositional alternatives, including compositional reset, severe restrictions on candidate systems are imposed.

To address these issues, we give a framework for composition in which global reasoning and detailed system knowledge are not
necessary, and which apply to a significantly wider range of systems than has hitherto been possible. In this framework, we explicitly identify for each component which other components it can corrupt. Additionally, the correction of one component often depends on the prior correction of one or more other components, constraining the order in which correction can take place. Given appropriate component stabilizers such as detectors and correctors, we offer several ways to coordinate system correction, depending on what is actually known about the corruption and correction relations.

By reducing the design of and reasoning about stabilization to local activities involving each component and the neighbors with which it interacts, the framework is scalable. Reset is generally avoided by using the correction relation to check and correct only where necessary. By including both correction and corruption relations, the framework subsumes and extends other compositional approaches.

Though not directly a part of this work, we mention tools and techniques that can be used to help calculate the dependency and corruption relations and to help create create the necessary stabilizers.

To illustrate the theory, we show how this framework has been applied in our work in sensor networks.

[pdf] [bibtex entry]
[pdf] [bibtex entry] Extended technical report version