A Foundation for Fault Tolerant Components

Bill Leal

Department of Computer and Information Science, The Ohio State University, Columbus, Ohio 43210, USA leal@cse.ohio-state.edu

PhD Dissertation, 2001
The Ohio State University

Abstract. Components for data abstraction have been useful building blocks in the construction of large systems. Rather than design and reason about systems monolithically, components can be composed into a hierarchy so that at each level the properties of the lower components are used to develop new components and to reason about their behavior. This simplifies construction and reasoning as a whole. To date, however, this has generally been conducted with the implicit assumption that the system being constructed is free from faults. Modular approaches to fault tolerance, on the other hand, have been generally limited to narrow disciplines. Either the lower levels hide the effect of faults, or layers are composed such that lower-levels are oblivious to higher ones, thus permitting tolerance at a lower level without interference from a higher one. Both of these approaches are too restricted. The requirement that fault affect be hidden is too demanding in some situations, particularly distributed ones. Likewise the requirement that a subcomponent be oblivious to its environment does not suffice for components that may achieve tolerance only if higher-level components invoke it correctly.

This study provides a foundation for the study of data abstraction components that are combined with faults in a compositional framework. We can reason in isolation about components with respect to their specification, both in absence and presence of faults, and we can reason about composition in terms of the specifications. Hence we achieve modular reasoning in a compositional setting. 

This study adopts a unified view: specifications, rather than being different kinds of objects, are also components. We define the behaviors of a component and in terms of these behaviors define what it means for one component to be "good enough" for another one; that is, for one component to refine another, usually by exhibiting less nondeterminism. Key features of the component definition are that they can model distributed components, and that they can be affected by a fault environment. Additionally, component methods can be spontaneous, reflection background processes. 

Refinement is hard to work with directly since it involves reasoning over infinite computations; simulations provide a more local way to reason. We give two kinds of simulations, state-level (reasoning over the states of clients of the components) and value-level (reasoning over the values that make up the states). These reasoning approaches are shown to be sound, and state-level simulations to be relatively complete. Value-level simulations are less complex, generally preferred, and have been thought to be complete; but we show that they are not unless the components involved are severely restricted. We investigate the behavior of components in the presence of faults, regarding a component as fault tolerant if, in the absence of faults, it refines some specification, and in the presence of faults, it refines some tolerance specification. This general approach permits the definition of the three standard tolerance specifications of masking (never showing the effect of faults), failsafe (never doing anything wrong) and nonmasking (temporarily showing the effect of faults; stabilization is an extremal case). We show that refining a specification does not mean refinement in the presence of faults of the tolerant specification. The converse is also false even in the case of masking. Hence when a component is designed for tolerance, it is necessary to verify separately that it refines its specification and that, in the presence of faults, it refines the tolerance specification.

We investigate stepwise composition by considering clients that use tolerant components. A client of a masking or failsafe component will also be masking or failsafe, respectively, but this is false for nonmasking. Hence in general we must check the behavior of the client not only with respect to the original specification but also with respect to the tolerance specification. This gives important implications for synthesis approaches. For example, tolerance compilers have been generally regarded as producing refinements of the source; but in the case of nonmasking, the compiler must also produce a concrete component that is a refinement of the nonmasking specification in the presence of faults.

The setup is illustrated by a case study of mutual exclusion using a fault-tolerant vector clock subcomponent. It shows, among other things, that access patterns by the client are critical not only for tolerance of the subcomponent but also for normal, fault-free behavior.

The major contributions of this study are as follows.

  • We have a rich component model that accommodates distributed fault tolerance.

  • Refinement can be used to define component fault tolerance.

  • We have composition and locality of reasoning through refinement and substitution.

  • We demonstrate that ideal and tolerance specifications must be separately satisfied.