Project Description

Continuous Self-Maintenance of .NET Services Project description.

.NET services are distributed, large-scale, and long running. As such, continuous maintenance is necessary in .NET platforms. And it is desirable that the maintenance be as autonomous as is possible, thus involving human operators only when it is unavoidable. In an ongoing joint project with MSR's Marvin Theimer, we are building a framework for continuous self-monitoring and evolution of .NET service components. We propose to complement that project by building frameworks for (i) dynamic composition of fault-tolerance components and (ii) continuous self-testing of .NET services. Further, we propose to integrate and demonstrate these three frameworks in the Herald .NET service being currently developed in MSR.

Why Continuous Self-Maintenance?

The following considerations imply that only machine-enforceable aspects of a large-scale, long running .NET service can be assumed hold continuously.

Lack of global developer knowledge. Distributed large-scale .NET services contain many components that change constantly. Few people— if anyone— understand all the evolving state guarantees and behaviors that an entire service is supposed to maintain. Service complexity is traditionally constrained by requiring that its components interact through well-defined, narrow interfaces that precisely specify the invariants that should be maintained before and after their invocation; however, developers are not always aware of all the invariants and don't always implement the interface specifications correctly. Since it is difficult, if not impossible, to evolve a large-scale distributed service in a globally consistent manner; such services will generally tend to be incorrect.
In-place updates. Long-running .NET services get updated in-place and a second complete "test" service is typically not available for a priori testing of all the possible interactions that an updated component will have with the running service. It is inevitable that incorrectly implemented components will be introduced during service updates, which will propagate errors into the state of a running service.
Inevitable unanticipated faults. Any large-scale, long-running service, even if not updated, will inevitably be subject to unanticipated faults, such as faults that occur due to defects in implementation, only occur when the service is under load, or do not occur when instrumented for debugging (" Heisenbugs").
Incomplete specifications. Incomplete or partial specifications are unavoidable in service maintenance— especially now that design time scales can be as low as 12 weeks and much less for those of critical services.

We thus claim that properties that are not machine enforceable will eventually become invalid in the .NET context. Continuous maintenance by the system itself is therefore
necessary.

Requirements for Continuous Self-Maintenance.

We focus our attention on the following desiderata.

Reliability. Service faults should be detected, diagnosed and tolerated. Since service aspects that deal with faults are rarely tested during normal operation, reliability should be ensured proactively if possible. But most importantly, service function should not degrade as a result of continuous self-maintenance, even across administrative boundaries.
Availability. Services that are not functioning as desired should quickly resume desired function.
Evolution. If the environment changes so that the maintenance of services becomes inadequate, the maintenance procedures should be evolved (automatically if possible and manually otherwise).
Low-cost. New resources should not be necessary for continuous maintenance. Instead all existing residual system resources should be aggressively, with the provision for their quick release whenever the demand for services grows.

Our Approach to Continuous Self-Maintenance.

Our approach to meeting the above requirements consists of three orthogonal elements. Various combinations of the three can be used depending on the particular .NET scenario
being considered.

Dynamic Composition of Tolerance Components. For reliability against faults, our approach is to wrap .NET service components with tolerance components, namely detectors and correctors [1]. Tolerance components detect the health of service components, e. g. in terms of data invariants or timeliness of execution. They may also perform replication, correction, consistency enforcement, failing which they may notify service managers so to enable fault diagnosis.
Continuous Self-Monitoring and Evolution. For availability, our approach is to continuously monitor .NET services (and their tolerance components). The frequency of monitoring is dynamic, and depends on the service state, the evolving service composition, and the resources available in the system. It enables detection of dependencies both within and across services; e. g. a service consisting of multiple server instances for reliability and performance will be able to analyze its degree of replication by monitoring dynamic, global conditions. It also enables service evolution.
Continuous Self-Testing. For proactive reliability, our approach is to continuously testing of .NET services via injected faults. Such testing is carefully designed to not interfere with normal service operation. At the same time, it exploits available resources in the system.

Previous and Current Related Work.

Aladdin [2-3]. Jointly with MSR's Yi-Min Wang and Wilf Russell, we built this system for local/ remote home networking using only undependable commodity devices networks (phoneline and powerline) and operating systems (Windows 98). Typical scenarios of Aladdin use include: remotely sending an email home to shut the garage door and receiving a video confirmation; receiving a cellphone message in case of a water leak in the basement; and using a natural language interface at home to switch off the lights in any part of the home. Its architecture, software services (such as SoftStateStore), and protocols met the goals of a dependable and extensible system with low-complexity and good performance. By dependability, we mean that its tolerance components enable the system to mask common faults, self-stabilize from rare faults, and notify the homeowner in case of catastrophic faults. By extensibility, we mean that new scenarios of use can be added to it with little effort. Outside of the lab setting, Aladdin has been in daily use in a large single-family home in Seattle for well over a year now. Aladdin was showcased in a PBS serial aired in California in Summer 2000, as well as in a Seattle P-I news article; it has led to four patents. Its architecture is compatible with .NET, and has motivated Wang and Russell's recent work on .NET Alert Services.

Continuous Monitoring Framework for .NET Services [4-5]. This project accommodates both self-monitoring and human-in-the-loop monitoring of services. Specifically, mechanisms are provided to continuously monitor the liveness of all services as well as their detectors and correctors, which are tasked with satisfaction of service-specific data invariants. This includes a means for dynamically controlling which services and detectors/ correctors get checked and how often; a means for dynamically detecting conflicting data invariants that have been posited for the service; a means of analyzing the "error" behavior of a service; a means of evolving the service and its associated data invariants; and a means for allowing all residual system resources to be made available continual monitoring.

Proposed Projects.

Framework for Dynamic Composition of Tolerance Components

Dynamic Composition. This framework will include mechanisms for tolerance components to be added, removed, and updated while the .NET service is running. Distributed implementation of tolerance components, i. e. between clients and servers, will be enabled. Calls to a component will be dynamically redirected to its tolerance component, which will adjust component input and output as necessary to invoke new distributed coordination/ fault-tolerance protocols. Layered addition of tolerance components will be allowed.
Consistency Semantics. The framework will ensure that (i) tolerance components updates are atomic with respect to the ongoing service interactions; (ii) a tolerance component is removed only if all of its instances are removed; (iii) a tolerance component update implies update of all of its instances; and (iv) orphans (unused tolerance components) and widows (tolerance components that call nonexistent components) are handled correctly.
Multiple Abstraction Levels. The framework will allow composition of tolerance components per instance, per component, or per group of instances/ components.
Minimal Modification. The framework will require only minimal modification of the underlying service component source code, i. e. extending service components to inherit from a special platform class.

Framework for Continuous Self-Testing of Services

Dynamic Injection. This framework will include mechanisms for simulating faults by execution of fault methods (" faultors") by an environment. It will enable dynamic scripting of faultors. Going further than the dynamic composition framework, it will also enable dynamic addition of methods to a component.
Test Specifications. The framework will provide mechanisms for testing of temporal event predicates, in addition to global state predicates, for each faultor group.
Non-Interference. The framework will include mechanisms to quickly reconfigure services into functioning and test versions, and to prevent interference of service interactions during reconfiguration. Dually, quick release of testing resources and scaling up of functioning services will supported to deal with increases in service demand. Design patterns will be included to hide the effects of simulated faults from clients, e. g. via replication and checkpointing-rollback.
Discovery Support. The framework will include mechanisms to support discovery of new faults, invariants, and dependencies. Both frameworks will be implemented in C#. Browser-based interfaces will be provided to access the various facilities interactively. Interfaces will allow tracing and reporting facilities so the current composition status of the system can be monitored.

Integration with ongoing MSR .NET projects.

.NET Services built in the Systems and Networking group at MSR provide a compelling avenue for demonstrating and validating the proposed frameworks. Thus, after having consulted with Marvin Theimer, we propose to integrate and demonstrate the frameworks described above in the context of the Herald global eventing service that Marvin and his colleagues are building. By maintaining data invariants for Herald service components at all levels (simple components, single machines, groups of machines, and system wide), we will be able to measure the effect on availability of satisfying these invariants. By monitoring intra-component as well as inter-component dependencies in Herald, we will be able to implement and compare various replica management policies, thereby obtaining a better understanding of scenarios where global information is necessary and how to distribute the cost of repair over the global system. By injecting faults in Herald, we will be able to collect fault data about long running .NET services much faster than we have collected fault that really occurred in the Aladdin home network installations.

References

1. Arora, A. and S. Kulkarni. Detectors and correctors: A theory of fault-tolerance components. in International Conference on Distributed Computing Systems. 1998.

2. Wang, Y.-M., W. Russell, A. Arora, J. Xu, and R. Jagannathan. Towards dependable home networking: An experience report. in International Conference on Dependable Systems and Networks (ICDSN'2000). 2000. Seattle.

3. Wang, Y.-M., W. Russell, and A. Arora. A toolkit for building dependable and extensible home networking applications. in Fourth USENIX Windows Systems Symposium USENIX-WIN'2000. 2000. Seattle.

4. Arora, A. and M. Theimer, Agenda for fault-tolerant distributed computing in .NET. 2001, The Ohio State University and Microsoft Research.

5. Arora, A. and M. Theimer, Service evolution framework. 2001, The Ohio State University and Microsoft Research.