Continuous Self-Maintenance of .NET Services Project
description.
.NET services are distributed, large-scale, and long running. As such,
continuous maintenance is necessary in .NET platforms. And it is desirable that
the maintenance be
as autonomous as is possible, thus involving human operators only when it is
unavoidable. In an ongoing joint project with MSR's Marvin Theimer, we are
building a
framework for continuous self-monitoring and evolution of .NET service
components. We propose to complement that project by building frameworks for (i)
dynamic
composition of fault-tolerance components and (ii) continuous self-testing of
.NET services. Further, we propose to integrate and demonstrate these three
frameworks in the Herald .NET service being currently developed in MSR.
Why Continuous Self-Maintenance?
The following considerations imply that only machine-enforceable aspects of a
large-scale, long running .NET service can be assumed hold continuously.
- Lack of global developer knowledge. Distributed large-scale .NET
services contain many components that change constantly. Few people— if
anyone— understand all the evolving state guarantees and behaviors that an
entire service is supposed to maintain. Service complexity is traditionally
constrained by requiring that its components interact through well-defined,
narrow interfaces that precisely specify the invariants that should be
maintained before and after their invocation; however, developers are not
always aware of all the invariants and don't always implement the interface
specifications correctly. Since it is difficult, if not impossible, to evolve
a large-scale distributed service in a globally consistent manner; such
services will generally tend to be incorrect.
- In-place updates. Long-running .NET services get updated in-place
and a second complete "test" service is typically not available for a priori
testing of all the possible interactions that an updated component will have
with the running service. It is inevitable that incorrectly implemented
components will be introduced during service updates, which will propagate
errors into the state of a running service.
- Inevitable unanticipated faults. Any large-scale, long-running
service, even if not updated, will inevitably be subject to unanticipated
faults, such as faults that occur due to defects in implementation, only occur
when the service is under load, or do not occur when instrumented for
debugging (" Heisenbugs").
- Incomplete specifications. Incomplete or partial specifications are
unavoidable in service maintenance— especially now that design time scales can
be as low as 12 weeks and much less for those of critical services.
We thus claim that properties that are not machine enforceable will
eventually become invalid in the .NET context. Continuous maintenance by the
system itself is therefore
necessary.
Requirements for Continuous Self-Maintenance.
We focus our attention on the following desiderata.
- Reliability. Service faults should be detected, diagnosed and
tolerated. Since service aspects that deal with faults are rarely tested
during normal operation, reliability should be ensured proactively if
possible. But most importantly, service function should not degrade as a
result of continuous self-maintenance, even across administrative boundaries.
- Availability. Services that are not functioning as desired should
quickly resume desired function.
- Evolution. If the environment changes so that the maintenance of
services becomes inadequate, the maintenance procedures should be evolved
(automatically if possible and manually otherwise).
- Low-cost. New resources should not be necessary for continuous
maintenance. Instead all existing residual system resources should be
aggressively, with the provision for their quick release whenever the demand
for services grows.
Our Approach to Continuous Self-Maintenance.
Our approach to meeting the above requirements consists of three orthogonal
elements. Various combinations of the three can be used depending on the
particular .NET scenario
being considered.
- Dynamic Composition of Tolerance Components. For reliability
against faults, our approach is to wrap .NET service components with tolerance
components, namely detectors and correctors [1]. Tolerance
components detect the health of service components, e. g. in terms of data
invariants or timeliness of execution. They may also perform replication,
correction, consistency enforcement, failing which they may notify service
managers so to enable fault diagnosis.
- Continuous Self-Monitoring and Evolution. For availability, our
approach is to continuously monitor .NET services (and their tolerance
components). The frequency of monitoring is dynamic, and depends on the
service state, the evolving service composition, and the resources available
in the system. It enables detection of dependencies both within and across
services; e. g. a service consisting of multiple server instances for
reliability and performance will be able to analyze its degree of replication
by monitoring dynamic, global conditions. It also enables service evolution.
- Continuous Self-Testing. For proactive reliability, our approach is
to continuously testing of .NET services via injected faults. Such testing is
carefully designed to not interfere with normal service operation. At the same
time, it exploits available resources in the system.
Previous and Current Related Work.
Aladdin [2-3]. Jointly with
MSR's Yi-Min Wang and Wilf Russell, we built this system for local/ remote home
networking using only undependable commodity devices networks (phoneline and
powerline) and operating systems (Windows 98). Typical scenarios of Aladdin use
include: remotely sending an email home to shut the garage door and receiving a
video confirmation; receiving a cellphone message in case of a water leak in the
basement; and using a natural language interface at home to switch off the
lights in any part of the home. Its architecture, software services (such as
SoftStateStore), and protocols met the goals of a dependable and extensible
system with low-complexity and good performance. By dependability, we mean that
its tolerance components enable the system to mask common faults, self-stabilize
from rare faults, and notify the homeowner in case of catastrophic faults. By
extensibility, we mean that new scenarios of use can be added to it with little
effort. Outside of the lab setting, Aladdin has been in daily use in a large
single-family home in Seattle for well over a year now. Aladdin was showcased in
a PBS serial aired in California in Summer 2000, as well as in a Seattle P-I
news article; it has led to four patents. Its architecture is compatible with
.NET, and has motivated Wang and Russell's recent work on .NET Alert Services.
Continuous Monitoring Framework for .NET Services [4-5].
This project accommodates both self-monitoring and human-in-the-loop monitoring
of services. Specifically, mechanisms are provided to continuously monitor the
liveness of all services as well as their detectors and correctors, which are
tasked with satisfaction of service-specific data invariants. This includes a
means for dynamically controlling which services and detectors/ correctors get
checked and how often; a means for dynamically detecting conflicting data
invariants that have been posited for the service; a means of analyzing the
"error" behavior of a service; a means of evolving the service and its
associated data invariants; and a means for allowing all residual system
resources to be made available continual monitoring.
Proposed Projects.
Framework for Dynamic Composition of Tolerance Components
- Dynamic Composition. This framework will include mechanisms for
tolerance components to be added, removed, and updated while the .NET service
is running. Distributed implementation of tolerance components, i. e. between
clients and servers, will be enabled. Calls to a component will be dynamically
redirected to its tolerance component, which will adjust component input and
output as necessary to invoke new distributed coordination/ fault-tolerance
protocols. Layered addition of tolerance components will be allowed.
- Consistency Semantics. The framework will ensure that (i) tolerance
components updates are atomic with respect to the ongoing service
interactions; (ii) a tolerance component is removed only if all of its
instances are removed; (iii) a tolerance component update implies update of
all of its instances; and (iv) orphans (unused tolerance components) and
widows (tolerance components that call nonexistent components) are handled
correctly.
- Multiple Abstraction Levels. The framework will allow composition
of tolerance components per instance, per component, or per group of
instances/ components.
- Minimal Modification. The framework will require only minimal
modification of the underlying service component source code, i. e. extending
service components to inherit from a special platform class.
Framework for Continuous Self-Testing of Services
- Dynamic Injection. This framework will include mechanisms for
simulating faults by execution of fault methods (" faultors") by an
environment. It will enable dynamic scripting of faultors. Going further than
the dynamic composition framework, it will also enable dynamic addition of
methods to a component.
- Test Specifications. The framework will provide mechanisms for
testing of temporal event predicates, in addition to global state predicates,
for each faultor group.
- Non-Interference. The framework will include mechanisms to quickly
reconfigure services into functioning and test versions, and to prevent
interference of service interactions during reconfiguration. Dually, quick
release of testing resources and scaling up of functioning services will
supported to deal with increases in service demand. Design patterns will be
included to hide the effects of simulated faults from clients, e. g. via
replication and checkpointing-rollback.
- Discovery Support. The framework will include mechanisms to support
discovery of new faults, invariants, and dependencies. Both frameworks will be
implemented in C#. Browser-based interfaces will be provided to access the
various facilities interactively. Interfaces will allow tracing and reporting
facilities so the current composition status of the system can be monitored.
Integration with ongoing MSR .NET projects.
.NET Services built in the Systems and Networking group at MSR provide a
compelling avenue for demonstrating and validating the proposed frameworks.
Thus, after having consulted with Marvin Theimer, we propose to integrate and
demonstrate the frameworks described above in the context of the Herald global
eventing service that Marvin and his colleagues are building. By maintaining
data invariants for Herald service components at all levels (simple components,
single machines, groups of machines, and system wide), we will be able to
measure the effect on availability of satisfying these invariants. By monitoring
intra-component as well as inter-component dependencies in Herald, we will be
able to implement and compare various replica management policies, thereby
obtaining a better understanding of scenarios where global information is
necessary and how to distribute the cost of repair over the global system. By
injecting faults in Herald, we will be able to collect fault data about long
running .NET services much faster than we have collected fault that really
occurred in the Aladdin home network installations.
References
1. Arora, A. and S. Kulkarni. Detectors and correctors:
A theory of fault-tolerance components. in International Conference on
Distributed Computing Systems. 1998.
2. Wang, Y.-M., W. Russell, A. Arora, J. Xu, and R.
Jagannathan. Towards dependable home networking: An experience report. in
International Conference on Dependable Systems and Networks (ICDSN'2000).
2000. Seattle.
3. Wang, Y.-M., W. Russell, and A. Arora. A toolkit for
building dependable and extensible home networking applications. in
Fourth USENIX Windows Systems Symposium USENIX-WIN'2000. 2000. Seattle.
4. Arora, A. and M. Theimer, Agenda for fault-tolerant
distributed computing in .NET. 2001, The Ohio State University and Microsoft
Research.
5. Arora, A. and M. Theimer, Service evolution
framework. 2001, The Ohio State University and Microsoft Research.