autonomic distributed systems

Autonomic distributed systems

2

Think about this

Human population

1980 1990 2000 2010

5

4

6

7

x109 computer population

3

Think about this

Machines will fail from time to time, regardless of how carefully

they are designed. But who will manage these systems? Even if everyone joins IT, it is not enough! Isn’t this a crisis?

Systems have to take care of themselves.

Self-help is the best help.

4

What does it mean?

These are many such desirable self-- properties that be added to theWish list. These properties collectively called self-* properties characterize an Autonomic System.

Self-help

Self-healing

Self-organizing

Self-optimizing

Self-protecting

Self-managing

Self-stabilizing

5

Self-healing

The Spirit Mars rover has a

radiation-hardened R6000 CPU from

Lockheed-Martin Federal Systems.

One day, while performing a crucial

task, Spirit Mars Rover fell silent,

alone on the emptiness of Mars.

What next?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Courtesy: Jet Propulsion Lab

6

Self-healing

The problem was eventually remotely detected by ground control.

The operating system tried to allocate more files than the RAM-based directory structure could accommodate. It caused an exception that suspended the task that attempted the allocation. NASA ground control deleted some files, and reformatted the entire flash memory system. On February 6, 2004 the rover was restored to its original working condition, and science activities resumed.

It would have been nice if the detection and repair could be done by the rover itself …

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Courtesy: Jet Propulsion Lab

Self-stabilization

• Technique for spontaneous restoration of a system predicate.

• Forward error recovery (memoryless) -- does not bother about

the impact of the failure as long as the recovery is

guaranteed.

• Guarantees eventual safety following failures.

Feasibility demonstrated by Dijkstra (CACM 1974)

Self-stabilizing systems

Starting from any initial configuration, the system is guaranteed

to recover to a legitimate configuration (L is true) in a bounded

number of steps, as long as the codes are not corrupted.


Transient failures perturb the global state. The ability to spontaneously recover from any initial state implies that no initialization is ever required.

State space

legal


Self-stabilizing systems exhibits non-masking

fault-tolerance. It satisfies the following two

criteria

fault

1. Convergence

2. Closure

Not L Lconvergence

closure

Adaptive Distributed Systems

System behavior spontaneously changes when the environment changes

A traffic control system

AM / PM

AM L AM holdsPM L PM holds

L = (AM L AM ) (PM L PM )

defines the system invariant

Example 1: Stabilizing mutual exclusion

01 62 4 753

N-1

Consider a unidirectional ring of processes. In the legal configuration, exactly one tokenwill circulate in the network

A solution

1 4320

{Process 0} repeat x[0] = x[N-1] x[0] := x[0] N 1 forever

{Process j > 0} repeat x[j] ≠ x[j -1] x[j] := x[j-1] forever

The state of process j is x[j] {0, 1, 2, K-1}, and N > K

TOKEN = ENABLED GUARD

Guard or condition

action

0n

Does it work?

First, be convinced that it works.

Then think about why it will work.

Example 2: Stabilizing spanning tree

• Given a connected graph G = (V,E) and a root r,

design an algorithm for maintaining a spanning

tree in presence of transient failures that may

corrupt the local states of processes.

• Let n = |V|

A solution

Each process i has two variables L(i) and P(i):L(i) = Distance from the root via tree edgesP(i) = parent of process i

By definition L(r) = 0, and P(r) is undefined. In a legal state

i V | i ≠ r : L(i) ≠ n L(i) = L(P(i)) +1.

Sample case

0

1

2

5

4

3

0

1

2

5

4

3

1

2

3 4

5

P(2) is corrupted

The algorithm

(R0) (L(i) ≠ n) (L(i) ≠ L(P(i)) +1) (L(P) ≠ n) L(i) :=L(P(i)) +1

(R1) (L(i) n) (L(P(i)) =n) L(i):=n

(R2) (L(i) =n) (k Neighbors(i):L(k) < n-1) L(i) :=L(k)+1; P(i):=k

The algorithm has three rules R0, R1, R2:

Proof of stabilization

Define an edge from i to P(i) to be well-formed,

when L(i) ≠ n, L(P(i) ≠ n and L(i) = L(P(i)) +1.

In any configuration, the well-formed edges form

a spanning forest. Delete all edges that are not

well-formed. Designate each tree T(k) in the

forest by the lowest value of L in it.

Example

In the sample graph shown earlier.T(0) = {0, 1, T(2) = {2, 3, 4, 5}

Let F(k) denote the number of T(k)’s in the forest.

Define a tuple F= (F(0), F(1), F(2) …, F(n)).

For the sample graph, F = (1, 0, 1, 0, 0, 0) after node 2

had the transient failure that changed P(2) from 2 to 4.

Skeleton of the proof

Minimum F = (1,0,0,0,0,0) {legal configuration}

Maximum F = (1, n-1, 0, 0, 0, 0).

With each action, F decreases lexicographically.

Verify the claim!

This proves that eventually F becomes (1,0,0,0,0,0) and

the spanning tree stabilizes.

autonomic distributed systems

Documents

desirable self properties

autonomic system

selfhealingthe problem

stabilizing spanning

autonomic distributed

system invariantexample

initial state

system predicate