analysis of the spider fault-tolerance protocols

Langley Research Center

Analysis of the SPIDER Fault-Tolerance Protocols

Paul S. [email protected]

5th NASA Langley Formal Methods WorkshopWilliamsburg, VA

June 14, 2000

June 14, 2000 Lfm2000 2


What is SPIDER?• A general purpose fault-tolerant architecture

– Scalable Processor-Independent Design for Electromagnetic Resilience

• Intended to serve as a platform to explore recovery strategies for HIRF/EMI induced faults

• Developed as part of an FAA funded case-study to exercise RTCA DO-254: Design Assurance Guidance for Airborne Electronic Hardware

June 14, 2000 Lfm2000 3


RTCA DO-254/EUROCAE ED-76

• Developed by RTCA Special Committee 180 and EUROCAE Working Group 46

• Approved by RTCA Program Management Committee in April 2000

• Approved by EUROCAE(?)• FAA Advisory Circular (?)

– Earliest would be sometime this fall

June 14, 2000 Lfm2000 4


Formal Methods in DO-254

• Formal Methods is one of the advanced analysis techniques suggested when developing hardware to support safety-critical (Level A or B) aircraft functions

• DO-254 section on Formal Methods based upon material from NASA Formal Methods Guidebook, Volume II, (NASA-GB-001-97)

June 14, 2000 Lfm2000 5


DO-254 Case-Study Participants

• NASA LaRC (Design Team)• Paul Miner, Project Lead• Mahyar Malekpour, Design Engineer• Wilfredo Torres-Pomales, Design Engineer• Victor Carreño, Process Assurance

• FAA (Sponsor and Certification Liaison)• Leanna Rierson, Pete Saraceni, Dennis Wallace, Connie

Beane, and Will Struck

June 14, 2000 Lfm2000 6


Strategy for DO-254 Case-Study

• Fault-tolerance protocols and reliability models use the same fault classifications

• Reliability analysis using SURE (Butler)– Calculates P(enough good hardware)

• Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation

June 14, 2000 Lfm2000 7


SPIDER Design Concept

• Inspired by several earlier designs– Main concept inspired by Palumbo’s Fault-tolerant

processing system (U.S. Patent 5,533,188)• Developed as part of Fly-By-Light/Power-By-Wire project

– Other ideas from Draper’s FTPP, FTP, and FTMP; Allied-Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; . . .

June 14, 2000 Lfm2000 8


SPIDER Architecture

• N simplex general purpose nodes logically connected via a Reliable Optical BUS (ROBUS)

• A ROBUS is an ultra-reliable unit providing basic fault-tolerant services

• A ROBUS is implemented as a special purpose fault-tolerant device

June 14, 2000 Lfm2000 9


SPIDER Architecture

34

01

2

7

6

5

ROBUS

June 14, 2000 Lfm2000 10


Logical View of ROBUS

• ROBUS operates as a time-division multiplexed access broadcast bus

• ROBUS strictly enforces write access – no babbling idiots

• Processing nodes may be grouped to provide differing degrees of fault-tolerance– some voting available within ROBUS

June 14, 2000 Lfm2000 11


Logical view of ROBUS(Sample Configuration)

ROBUS

0 4 21 3 56 7

June 14, 2000 Lfm2000 12


ROBUS Characteristics• Bus access schedule statically determined

– similar to SAFEbus, TTA• Some fault-tolerance functions provided by

processing nodes– ROBUS will not have general purpose processing

capabilities• Processing Elements need not be uniform

– support for dissimilar architectures

June 14, 2000 Lfm2000 13


ROBUS Requirements

• All fault-free nodes observe the exact same sequence of messages

• ROBUS provides a reliable time source (RTS)– The nodes are synchronized relative to this RTS

• ROBUS provides correct and consistent system diagnostic information to all fault-free nodes

• For 10 hour mission, P(ROBUS Failure) < 10-10

June 14, 2000 Lfm2000 14


Interactive Consistency(Byzantine Agreement)

Agreement: For any message, all non-faulty receiving nodes will agree on the value of the message

Validity: If the originator of the message is non-faulty, good receivers will receive the message sent

June 14, 2000 Lfm2000 15


Clock Synchronization

Precision: There is a small positive constant dmax such that for any two clocks that are good at t,

|C1(t) - C2(t)| dmax

Accuracy: All good clocks maintain an accurate measure of the passage of time (within a linear envelope of real time)

June 14, 2000 Lfm2000 16


Diagnosis

Correctness: Every node diagnosed as faulty by a good node is faulty– A good node can never conclude that another good

node is faulty

Completeness: Every faulty node is (eventually) diagnosed as being faulty– This is not always possible (pathological case

involves asymmetric fault)

June 14, 2000 Lfm2000 17


Physical Segregation• ROBUS decomposed into physically isolated

Fault Containment Regions (FCR)– Two main design elements

• Bus Interface Unit (BIU)• Redundancy Management Unit (RMU)

– Processing elements may form separate FCRs• FCRs fail independently

– This is necessary to achieve reliability goals

June 14, 2000 Lfm2000 18


ROBUS Topology

PE 1

PE 2

PE 3

PE N BIU N

BIU 3

BIU 2

BIU 1

ROBUSN,M

RMU M

RMU 2

RMU 1

June 14, 2000 Lfm2000 19


Hybrid Fault Assumptions• The failure status of an FCR is subdivided into

four mutually exclusive cases– Good (or fault-free)– Benign Faulty (Known bad by all good)– Symmetric Faulty (Same to all good)– Asymmetric Faulty (Byzantine, Malicious)

• This is a global classification, individual FCRs do not know the failure status of other FCRs

June 14, 2000 Lfm2000 20


Fault Classification

• Partition the RMUs into disjoint subsets based upon fault classification– GR, BR, SR, and AR for good, benign, symmetric,

and asymmetric RMUs respectively• Similarly partition the BIUs

– GB, BB, SB, and AB

June 14, 2000 Lfm2000 21


Tolerating Asymmetric Faults• Requires 3f + 1 participants in protocol to

withstand f simultaneous asymmetric faults• Requires 2f + 1 disjoint communication paths

between any two participants• Requires f + 1 levels of communication• ROBUS Topology satisfies these conditions for

N 3, M 3, f = 1• For target reliability, we must tolerate at least 1

asymmetric fault

June 14, 2000 Lfm2000 22


Interactive Consistency• SPIDER IC protocol is simple adaptation of IC

algorithm for Draper FTP Architecture– Existing PVS proof due to Lincoln and Rushby,

COMPASS’94, pages 107-120 • Protocol generalizes one suggested in

Daniel Davies and John Wakerly, Synchronization and Matching in Redundant Systems, IEEE Trans. on Computers, Vol. C-27, No. 6, June 1978

June 14, 2000 Lfm2000 23


SPIDER IC ProtocolAlgorithm OMS (ignoring hybrid fault model):

1. Processing element j sends value v to BIU j2. BIU j broadcasts v to all RMUs3. All RMUs broadcast value received from BIU j to all

BIUs4. Each BIU votes on the values received from the

RMUs to determine value from j5. Each BIU forwards the voted value to its PE

June 14, 2000 Lfm2000 24


Adapting for hybrid faults• Simple modification to steps 3 and 4 to enable

special handling of manifestly bad messages• from benign faulty or asymmetrically faulty sources

• OMHS(p,v,q) denotes the value received by q, when p broadcast value v using hybrid oral messages protocol on SPIDER

• Verified in PVS, using simple modifications to Lincoln and Rushby’s proof of the Draper FTP Interactive Consistency Protocol

June 14, 2000 Lfm2000 25


Interactive Consistency ResultsAgreement: For BIU g, if (|AR| = 0) or (g AB

and |GR| > |SR| + |AR|), then for p,q GB: OMHS(g,v,p) = OMHS(g,v,q)

Validity: If |GR| > |SR| + |AR|, then for p GB :– If g GB, then OMHS(g,v,p) = v

– If g BB, then OMHS(g,v,p) = Error

– If g SB, then OMHS(g,v,p) = sent(g,v)

June 14, 2000 Lfm2000 26


Alternative Verification Options

• For a fixed number of participants, it is easy to demonstrate correctness of Interactive Consistency protocol using symbolic simulation

• Amount of effort needed to verify using theorem prover is negligible

• PVS proof is mostly symbolic evaluation– there is a small amount of deductive reasoning to

evaluate the abstract specification of hybrid majority

June 14, 2000 Lfm2000 27


Clock Synchronization Goal• To match the degree of Fault-Tolerance of the IC

protocol, the synchronization protocol should ensure synchronization using the following fault assumptions:• |GR| > |SR| + |AR|• Not (|AR| > 0 & |AB| > 0)

June 14, 2000 Lfm2000 28


Clock Synchronization• Need added assumption:

|GB| > |SB| + |AB|– With this assumption, a modified form of the Davies

and Wakerly protocol (1978, IEEE ToC) ensures synchronization of the RMUs

– Modified protocol is similar to the Srikanth and Toueg protocol (1987, JACM)

June 14, 2000 Lfm2000 29


Clock Synchronization Basics• Clocks (counters) driven by oscillator with a

bounded drift () from its stated frequency• Periodically (every P ticks) clocks will adjust the

count based on exchange with other clocks• The periods are indexed by round number (k)• Protocol seeks to ensure that at beginning of

every round, all good clocks are within dmin

June 14, 2000 Lfm2000 30


Synchronization Basics• If all good clocks start round within dmin of each

other, by the end of the round they can be at most dmin + 2 P apart

• If good clocks then make a small adjustment so that they start the next round within dmin, then both precision and accuracy are satisfied

• Several machine checked proofs exist

June 14, 2000 Lfm2000 31


Network Imprecision• There is imprecision in communication

– If a node transmits a message at time t, observing nodes will receive it within time interval

[t + d, t + d + e]– d is the minimum communication delay– e is the inherent imprecision (e > (1 + ) ticks)

• e is a lower bound on synchronization precision

• For many Byzantine Resilient protocols, dmin 2e

June 14, 2000 Lfm2000 32


Simple Protocol (for round k)

• RMU: (Perform the following concurrently)– If Ready?(k) then broadcast (round k) to all BIUs– If Accept?(k) then reset counter for round k

• BIU:– If Accept?(k) then broadcast (round k) to all RMUs

June 14, 2000 Lfm2000 33


Informal Description

• Each good RMU broadcasts when its clock reaches a specific value

• Each good BIU waits until it knows it has received a message from at least one good RMU. It then relays this information to all RMUs

• Each RMU waits until it knows it has a message from at least one good BIU before resetting

June 14, 2000 Lfm2000 34


Ready and Accept• Ready?(k) is an event triggered by a pre-

determined local counter value, kP - a, a is a constant offset for communication delays

P is the nominal duration of a round

k is the round index

• Degree of fault-tolerance is determined by Accept?(k)

June 14, 2000 Lfm2000 35


Accept?(k)• Wait until there is a hybrid majority of observed

(round k) events to trigger Accept?(k)• A selection function under the hybrid fault model

ignores manifestly bad values • This protocol ensures that all good RMUs accept

(round k) within a short time interval, provided the maximum fault assumptions are not violated– Can also synchronize BIUs by echoing RMU accept

June 14, 2000 Lfm2000 36


Verification in PVS• Built generic clock synchronization theory in

PVS– PVS theory from Ulm introduced too much potential

error in formulation of clock drift assumptions– New theory allows proofs of clock skew as tight as

best theoretical results– Support for some protocols absent– All pieces in place to complete SPIDER verification

• Estimate 1-2 weeks effort to tie up loose ends

June 14, 2000 Lfm2000 37


Alternative Verification Options• Protocol is (almost) finite state

– Should be possible to use a model checker to confirm that all good nodes start each round within dmin of each other

– Plausible tools for this are HyTech, UPPAAL– Still need theorem prover to get Precision and

Accuracy results– Theorem prover can verify for arbitrary number of

participants

June 14, 2000 Lfm2000 38


Diagnosis• Plan to adapt MAFT on-line diagnosis algorithms

to SPIDER architecture– MAFT algorithms previously verified using PVS

• Chris Walter, Patrick Lincoln, and Neeraj Suri. Formally verified on-line diagnosis, IEEE Trans. On Software Engineering, Nov. 1997

• For diagnosis of Processing Elements, there exist verified Group Membership protocols

• Katz, Lincoln, and Rushby, Low overhead time-triggered group membership, In 11th Workshop on Distributed Algorithms (WDAG’97), pages 155-169, LNCS 1320

June 14, 2000 Lfm2000 39


Summary• New conceptual design for a family of fault-

tolerant systems• Design being developed using DO-254 guidance• Critical fault-tolerance verified using PVS

– Able to reuse or adapt many existing proofs of fault-tolerance protocols

– Unable to reuse existing Clock synchronization proofs

June 14, 2000 Lfm2000 40


Future Plans

• Report documenting SPIDER Conceptual design and proofs of fault-tolerance by end of summer

• First laboratory prototype implementation of SPIDER by December

• Second generation design in 2001

analysis of the spider fault-tolerance protocols

Documents