Continuous Self-Maintenance of .NET Services Project description.

.NET services are distributed, large-scale, and long running. As such, continuous maintenance is necessary in .NET platforms. And it is desirable that the maintenance be as autonomous as is possible, thus involving human operators only when it is unavoidable. In an ongoing joint project with MSR's Marvin Theimer, we are building a framework for continuous self-monitoring and evolution of .NET service components. We propose to complement that project by building frameworks for (i) dynamic composition of fault-tolerance components and (ii) continuous self-testing of .NET services. Further, we propose to integrate and demonstrate these three frameworks in the Herald .NET service being currently developed in MSR.

Why Continuous Self-Maintenance?

The following considerations imply that only machine-enforceable aspects of a large-scale, long running .NET service can be assumed hold continuously.
 

We thus claim that properties that are not machine enforceable will eventually become invalid in the .NET context. Continuous maintenance by the system itself is therefore
necessary.

Requirements for Continuous Self-Maintenance.

We focus our attention on the following desiderata.

Our Approach to Continuous Self-Maintenance.

Our approach to meeting the above requirements consists of three orthogonal elements. Various combinations of the three can be used depending on the particular .NET scenario
being considered.

Previous and Current Related Work.

Aladdin [2-3]. Jointly with MSR's Yi-Min Wang and Wilf Russell, we built this system for local/ remote home networking using only undependable commodity devices networks (phoneline and powerline) and operating systems (Windows 98). Typical scenarios of Aladdin use include: remotely sending an email home to shut the garage door and receiving a video confirmation; receiving a cellphone message in case of a water leak in the basement; and using a natural language interface at home to switch off the lights in any part of the home. Its architecture, software services (such as SoftStateStore), and protocols met the goals of a dependable and extensible system with low-complexity and good performance. By dependability, we mean that its tolerance components enable the system to mask common faults, self-stabilize from rare faults, and notify the homeowner in case of catastrophic faults. By extensibility, we mean that new scenarios of use can be added to it with little effort. Outside of the lab setting, Aladdin has been in daily use in a large single-family home in Seattle for well over a year now. Aladdin was showcased in a PBS serial aired in California in Summer 2000, as well as in a Seattle P-I news article; it has led to four patents. Its architecture is compatible with .NET, and has motivated Wang and Russell's recent work on .NET Alert Services.

Continuous Monitoring Framework for .NET Services [4-5]. This project accommodates both self-monitoring and human-in-the-loop monitoring of services. Specifically, mechanisms are provided to continuously monitor the liveness of all services as well as their detectors and correctors, which are tasked with satisfaction of service-specific data invariants. This includes a means for dynamically controlling which services and detectors/ correctors get checked and how often; a means for dynamically detecting conflicting data invariants that have been posited for the service; a means of analyzing the "error" behavior of a service; a means of evolving the service and its associated data invariants; and a means for allowing all residual system resources to be made available continual monitoring.

Proposed Projects.

Framework for Dynamic Composition of Tolerance Components

 Framework for Continuous Self-Testing of Services

Integration with ongoing MSR .NET projects.

.NET Services built in the Systems and Networking group at MSR provide a compelling avenue for demonstrating and validating the proposed frameworks. Thus, after having consulted with Marvin Theimer, we propose to integrate and demonstrate the frameworks described above in the context of the Herald global eventing service that Marvin and his colleagues are building. By maintaining data invariants for Herald service components at all levels (simple components, single machines, groups of machines, and system wide), we will be able to measure the effect on availability of satisfying these invariants. By monitoring intra-component as well as inter-component dependencies in Herald, we will be able to implement and compare various replica management policies, thereby obtaining a better understanding of scenarios where global information is necessary and how to distribute the cost of repair over the global system. By injecting faults in Herald, we will be able to collect fault data about long running .NET services much faster than we have collected fault that really occurred in the Aladdin home network installations.

References

1. Arora, A. and S. Kulkarni. Detectors and correctors: A theory of fault-tolerance components. in International Conference on Distributed Computing Systems. 1998.

2. Wang, Y.-M., W. Russell, A. Arora, J. Xu, and R. Jagannathan. Towards dependable home networking: An experience report. in International Conference on Dependable Systems and Networks (ICDSN'2000). 2000. Seattle.

3. Wang, Y.-M., W. Russell, and A. Arora. A toolkit for building dependable and extensible home networking applications. in Fourth USENIX Windows Systems Symposium USENIX-WIN'2000. 2000. Seattle.

4. Arora, A. and M. Theimer, Agenda for fault-tolerant distributed computing in .NET. 2001, The Ohio State University and Microsoft Research.

5. Arora, A. and M. Theimer, Service evolution framework. 2001, The Ohio State University and Microsoft Research.