robustness of computer networks is a matter of the heart or how to make distributed computer...

Robustness of Computer Networks is a Matter of the Heart

orHow to Make Distributed Computer

Networks Extremely Robust

Ariel Daliot1, Danny Dolev1, Hanna Parnas2

1School of Engineering and Computer Science,2Department of Neurobiology and the Otto Loewi Center for

Cellular and Molecular Neurobiology, The Hebrew University of Jerusalem, Israel

---------------------

Lecture Outline

Definition of RobustnessRobustness of Biological Systems vs. Engineered SystemsCarrying Over a Mechanism for Robustness (pulse synchronization) from Biology to Distributed Networks“Riding Two Tigers” – Superimposing two orthogonal fault models to attain extreme robustness of almost any distributed algorithm

General Definition of Robustness

“Robustness is what enables a system to maintain its functionalities in spite of external and internal perturbations”

Robustness as modeled in non-linear dynamics

The system state is a point in phase spaceThere is an attractor (point or limit cycle) in phase space which represents the desired functionality of the systemPerturbations forcefully move the point representing the system’s stateThe robustness is the property of attraction The degree of robustness is characterized by the basin of attractionUpon a perturbation robustness can manifest itself in one of two ways: – The system returns to its current attractor (e.g. heart-beat when resting)– The system moves to a new attractor that maintains the system’s

functionality (e.g. heart-beat when walking at a constant speed)

Otherwise, unstable regions of phase space can be reached (e.g. heart-beat rate that starts to do damage to the organism)

Some Intuitive Conjectures on Robustness

It is advantageous for a system to have at least some degree of robustnessDemand for more robustness higher complexity There is a tradeoff between robustness and performanceThere is a tradeoff between robustness and costThere is a tradeoff between robustness and resource demandsThe degree of robustness of a system is a function of the nature and probability of the faultsThe world is dynamic thus robustness facilitates evolvability (the capacity for non-lethal heritable variation) and evolution selects robust traits

Robustness in Biologybiological systems rarely “blue-screen”

Robustness in Biology

Biological systems have extraordinary evolvabilityBiological systems are complex and fine tuned over the long course of evolutionPower laws are ubiquitous in nature (Pareto distribution), i.e. “20% of the causes account for 80% of the cases”Thus biological systems have evolved to be robust to most of the perturbation occurrencesI.e. Most perturbations falls on trajectories in the basin of attraction of some attractor that maintains system functionalityOn the other hand they are very fragile as certain perturbations can cause catastrophic, cascading failures (e.g. dinosaurs vs. mammals)Thus very robust to random failures but very vulnerable to targeted attacks; are or behave like scale-free networks(Barabasi, Nature, 2000)

Examples of Robustness in BiologyProtein Interaction Networks (Giot et. al., Science, 2003)

Chemotaxis in bacteria (Alon et. al., Nature, 99)

Circadian clockRobustness against mutationsHomeostasis (e.g. mammalian temperature regulation)Adaptability of organisms to changing environmentsMammals vs. dinosaursCardiac pacemaker (Sivan, Dolev & Parnas, 2000)

Mechanisms that Facilitate Biological Robustness

(Kitano, Nature 2004)

System Control (feedback mechanisms)Fail-safe mechanisms (redundancy and diversity)ModularityDecoupling (containment of faults inside the modules)

Similar principles are used to attain robustness in engineered systems!

Thus it makes much sense to search for and understand biological mechanisms for robustness that can be carried over to computer systems

Robustness in Engineered Systems

Low evolvability, sometimes constrained by historical and non-technical considerationsStart to be complex but much less than biological systemsSystems typically designed according to the very extremes of the power laws, i.e. robust to only the uttermost frequent perturbations, if at all.Design effort usually invested in performance and cost, less in robustness as this is costly and “not needed in the average case”Engineered systems are typically not robust to many perturbationsFurthermore are very fragile as certain perturbations can cause catastrophic or cascading failures (e.g. NYC blackout); are typically vulnerable to targeted attacks thought they shouldn't be!

Importance of Robustness in Distributed Computer Systems

Distributed systems become an integral part of daily systems

Distributed systems become increasingly more complex

This leads to an increased need for robustness

Example of Robustness of an Engineered System – AFCS

(maintaining of direction, altitude, velocity)

Fault Models in Distributed Computer Systems

Link/Omission faultsCrash/Stop/Napping faultsByzantine failures (ongoing “malicious” faults)Transient faults (system temporarily forced into an arbitrary state or total chaos)

Tolerating on-going perturbations and converging from any point in state space at the same time is a “wishful” property for a robust distributed system

Byzantine Faults

Maliciousness; two-faced behavior; code bugs that express themselves over time; hardware corruption; unpredictable behavior; unpredictable local faults; code corruptionUsually requires n>3f to tolerate f faults (without authentication)Byzantine algorithms typically focus on confining the influence of ongoing faults assuming an initial consistent state of correct nodesCan be modeled by arbitrary perturbations in a fraction of the n dimensions of state space in a certain time windowWithin that time window no perturbations whatsoever are allowed in the rest of the dimensionsA transient violation of the above will typically throw the system state forever into an unstable region of state space

Self-StabilizationAddresses the situation when ALL nodes can concurrently be faulty for a limited period of timeSelf-stabilizing algorithms focus on realizing the task following a “catastrophic” state, once the system is back within the assumption boundariesTypically modeled by an arbitrary perturbation in state space followed by no perturbation whatsoever in any of the dimensions until the state returns to the attractor

assumption boundaries

Byzantine Faults and Self-Stabilization

Self-stabilization is “orthogonal” to Byzantine failures, i.e. these are uncorrelated fault models

Are “complementary” fault models, superimposed both make an algorithm overcome any type of fault from any state

Very few protocols (~3) posses both properties despite decades of research. Two of these protocols have super-exponential time complexity

Cardiac ganglion of the lobster (Sivan, Dolev & Parnas, 2000)

Four interneurons tightly synchronize their pulses in order to give the heart its optimal pulse, fault tolerantly

Able to adjust the synchronized firing pace, up to a certain bound (e.g. while escaping a predator)

motor

neurons

|..|.. |.|.||.

|..|.. |..|.. |..|..

The target is to synchronize pulses from any state and any faults

.....|.............|..................|.....................|...................|....

……...|.............|..................|.....................|..............|..........

.......|.............|..................|.....................|..................|..... t

……………......|.............|..................|.....................|.......................

…......|.............|..................|.....................|................|........

…………….|.............|..................|.....................|.....|...................

.……......|.............|..................|.....................|...........|.............

.....||||||........||.....|||......||......||......|.......||.||.||.....|......||.......……

…….....|.............|||.||.||.||||...............|.......|||||||||||||||||||||...||||.||||…...

cycle

Synchronized state (σ)Arbitrary state

Faultynodes

“Pulse Synchronization”, in distributed computer systems

The computers are required to:

Invoke regular pulses in tight synchronySynchronize from ANY state (self-stabilization)Have a bounded pulse frequencyTolerate upto a third permanent Byzantine faults

Examples of other synchronization problems: “Firing Squad”, “Clock Synchronization”, etc.

Making Any* Byzantine Algorithm Stabilize

I.e. Robust to any arbitrary transient perturbation in the system state with any arbitrary permanent perturbation in up-to a third of the dimensionsAlmost as robust as a distributed computer system can getIn a sense more robust than typical biological systems as it covers much of state space to defend against targeted attacks but lacks real evolvability and adaptabilityAdding evolvability and adaptability could make it as robust as algorithms can be

*Restrictions on the Basic Algorithm

Can be initialized σ (pulse skew) time units apartHas sampling points where the application state is safe to readSampling points can be identified by reading PCDuring legal executions of the basic algorithm all the sampling points are within Δ time of each other A snapshot of application states that are read Δ real time of each other is “consistent”, i.e. meaningful

Outline of the Scheme

At “pulse” event– Send local state to all nodes and Byzantine Agree it;– All correct nodes now see the same global snapshot;– Check if global snapshot represents a legal state;– If yes but your state is corrupt then repair state;– If not then reset basic algorithm;

Scheme Pitfalls

The general scheme may seem very simple but…When basic algorithm is not synchronized, how close do the sampling points need to be in order to get a “consistent” snapshot?And if they are not close how do you detect that?And if the basic algorithm is synchronized what happens if the sampling points are around the pulse, s.t. some correct nodes send their states and some don’t subsequent to the pulseAssume the global snapshot seems “consistent”, can the predicate detection module always detect if the application is in an illegal state considering the uncertainties in the consistencies?

ByzStabilizer: Stabilizes any* Byzantine Algorithm

ByzStabilizerAt “pulse” eventBegin

1. Abort any other ByzStabilizer;2. If (must-reset) then reset basic algorithm;3. When reaching an identified state, exchange the state values and the elapsed time

since the pulse;4. “Byzantine Agree” on the (state, elapsed-time) sent by each node;5. Sift through agreed values for a set of values with elapsed times within some Δ of

each other comprising a consistent global snapshot;6. If no such set then do must-reset:=true and propose pulse;7. Do predicate evaluation on the consistent global snapshot;8. If predicate is satisfied but you are not part of the set then repair your state;9. If predicate is satisfied then basic algorithm is in a legal state do nothing;10. Else do must-reset:=true and propose pulse;

End

Agreed set of values

Pulse uncertainty

First “Δ” uncertainty

Identify the f+1st value in the safe region

Define the end of the region with respect to its “elapsed time”

Different nodes invoke their pulse at different times

Agreement completion time uncertainty

Safe region

Agreed set within this region

Time Complexity and Convergence Time of ByzStabilizer

Ω[σ+Δ+Σ+(2f+1).RTT] ≈ Ω[Σ+(2t+3).RTT]– σ is the pulse skew – Δ is the sampling point skew– Σ is the time complexity of the basic algorithm– RTT is the Round Trip Time– t is the actual number of permanent Byzantine faults

This is roughly the complexity of the Byzantine Agreement and the basic algorithm combinedTime complexity equals the convergence time. I.e. even when everything is ok you pay the price of the convergence timeIf solving the basic problem can be reduced to consensus on one value then we can give a scheme that has time complexity of 2 RTTs !! I.e when everything is ok you pay almost nothing

SummaryBiological systems are more robust than engineered systems Both use supposedly the same principles for robustnessWe defined “pulse synchronization” in distributed systems, carried over from the cardiac pacemakerUsing this mechanism we presented the first algorithm that stabilizes any Byzantine algorithm that conforms to certain very reasonable restrictionsThe cost of the algorithm is relatively low and thus we show that self-stabilization in distributed computer systems facing Byzantine faults does not carry a significant additional cost beyond the cost of tolerating Byzantine faultsIt also implies that robustness does not necessarily mean high cost

Biological synchronization…

“The importance of being synchronized…”

Sometimes you miss…

“But everything will be fine again…”

Questions?

Drosophila melanogaster -Protein Interaction Network

robustness of computer networks is a matter of the heart or how to make distributed computer...

Documents

biological robustness

extreme robustness

perturbation robustness

degree of robustness

robustness higher complexity

biology biological systems

systems functionality

engineered systems