analysis of the spider fault-tolerance protocols
DESCRIPTION
Analysis of the SPIDER Fault-Tolerance Protocols. Paul S. Miner [email protected] 5th NASA Langley Formal Methods Workshop Williamsburg, VA June 14, 2000. What is SPIDER?. A general purpose fault-tolerant architecture - PowerPoint PPT PresentationTRANSCRIPT
Langley Research Center
Analysis of the SPIDER Fault-Tolerance Protocols
Paul S. [email protected]
5th NASA Langley Formal Methods WorkshopWilliamsburg, VA
June 14, 2000
June 14, 2000 Lfm2000 2
Langley Research Center
What is SPIDER?• A general purpose fault-tolerant architecture
– Scalable Processor-Independent Design for Electromagnetic Resilience
• Intended to serve as a platform to explore recovery strategies for HIRF/EMI induced faults
• Developed as part of an FAA funded case-study to exercise RTCA DO-254: Design Assurance Guidance for Airborne Electronic Hardware
June 14, 2000 Lfm2000 3
Langley Research Center
RTCA DO-254/EUROCAE ED-76
• Developed by RTCA Special Committee 180 and EUROCAE Working Group 46
• Approved by RTCA Program Management Committee in April 2000
• Approved by EUROCAE(?)• FAA Advisory Circular (?)
– Earliest would be sometime this fall
June 14, 2000 Lfm2000 4
Langley Research Center
Formal Methods in DO-254
• Formal Methods is one of the advanced analysis techniques suggested when developing hardware to support safety-critical (Level A or B) aircraft functions
• DO-254 section on Formal Methods based upon material from NASA Formal Methods Guidebook, Volume II, (NASA-GB-001-97)
June 14, 2000 Lfm2000 5
Langley Research Center
DO-254 Case-Study Participants
• NASA LaRC (Design Team)• Paul Miner, Project Lead• Mahyar Malekpour, Design Engineer• Wilfredo Torres-Pomales, Design Engineer• Victor Carreño, Process Assurance
• FAA (Sponsor and Certification Liaison)• Leanna Rierson, Pete Saraceni, Dennis Wallace, Connie
Beane, and Will Struck
June 14, 2000 Lfm2000 6
Langley Research Center
Strategy for DO-254 Case-Study
• Fault-tolerance protocols and reliability models use the same fault classifications
• Reliability analysis using SURE (Butler)– Calculates P(enough good hardware)
• Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation
June 14, 2000 Lfm2000 7
Langley Research Center
SPIDER Design Concept
• Inspired by several earlier designs– Main concept inspired by Palumbo’s Fault-tolerant
processing system (U.S. Patent 5,533,188)• Developed as part of Fly-By-Light/Power-By-Wire project
– Other ideas from Draper’s FTPP, FTP, and FTMP; Allied-Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; . . .
June 14, 2000 Lfm2000 8
Langley Research Center
SPIDER Architecture
• N simplex general purpose nodes logically connected via a Reliable Optical BUS (ROBUS)
• A ROBUS is an ultra-reliable unit providing basic fault-tolerant services
• A ROBUS is implemented as a special purpose fault-tolerant device
June 14, 2000 Lfm2000 9
Langley Research Center
SPIDER Architecture
34
01
2
7
6
5
ROBUS
June 14, 2000 Lfm2000 10
Langley Research Center
Logical View of ROBUS
• ROBUS operates as a time-division multiplexed access broadcast bus
• ROBUS strictly enforces write access – no babbling idiots
• Processing nodes may be grouped to provide differing degrees of fault-tolerance– some voting available within ROBUS
June 14, 2000 Lfm2000 11
Langley Research Center
Logical view of ROBUS(Sample Configuration)
ROBUS
0 4 21 3 56 7
June 14, 2000 Lfm2000 12
Langley Research Center
ROBUS Characteristics• Bus access schedule statically determined
– similar to SAFEbus, TTA• Some fault-tolerance functions provided by
processing nodes– ROBUS will not have general purpose processing
capabilities• Processing Elements need not be uniform
– support for dissimilar architectures
June 14, 2000 Lfm2000 13
Langley Research Center
ROBUS Requirements
• All fault-free nodes observe the exact same sequence of messages
• ROBUS provides a reliable time source (RTS)– The nodes are synchronized relative to this RTS
• ROBUS provides correct and consistent system diagnostic information to all fault-free nodes
• For 10 hour mission, P(ROBUS Failure) < 10-10
June 14, 2000 Lfm2000 14
Langley Research Center
Interactive Consistency(Byzantine Agreement)
Agreement: For any message, all non-faulty receiving nodes will agree on the value of the message
Validity: If the originator of the message is non-faulty, good receivers will receive the message sent
June 14, 2000 Lfm2000 15
Langley Research Center
Clock Synchronization
Precision: There is a small positive constant dmax such that for any two clocks that are good at t,
|C1(t) - C2(t)| dmax
Accuracy: All good clocks maintain an accurate measure of the passage of time (within a linear envelope of real time)
June 14, 2000 Lfm2000 16
Langley Research Center
Diagnosis
Correctness: Every node diagnosed as faulty by a good node is faulty– A good node can never conclude that another good
node is faulty
Completeness: Every faulty node is (eventually) diagnosed as being faulty– This is not always possible (pathological case
involves asymmetric fault)
June 14, 2000 Lfm2000 17
Langley Research Center
Physical Segregation• ROBUS decomposed into physically isolated
Fault Containment Regions (FCR)– Two main design elements
• Bus Interface Unit (BIU)• Redundancy Management Unit (RMU)
– Processing elements may form separate FCRs• FCRs fail independently
– This is necessary to achieve reliability goals
June 14, 2000 Lfm2000 18
Langley Research Center
ROBUS Topology
PE 1
PE 2
PE 3
PE N BIU N
BIU 3
BIU 2
BIU 1
ROBUSN,M
RMU M
RMU 2
RMU 1
June 14, 2000 Lfm2000 19
Langley Research Center
Hybrid Fault Assumptions• The failure status of an FCR is subdivided into
four mutually exclusive cases– Good (or fault-free)– Benign Faulty (Known bad by all good)– Symmetric Faulty (Same to all good)– Asymmetric Faulty (Byzantine, Malicious)
• This is a global classification, individual FCRs do not know the failure status of other FCRs
June 14, 2000 Lfm2000 20
Langley Research Center
Fault Classification
• Partition the RMUs into disjoint subsets based upon fault classification– GR, BR, SR, and AR for good, benign, symmetric,
and asymmetric RMUs respectively• Similarly partition the BIUs
– GB, BB, SB, and AB
June 14, 2000 Lfm2000 21
Langley Research Center
Tolerating Asymmetric Faults• Requires 3f + 1 participants in protocol to
withstand f simultaneous asymmetric faults• Requires 2f + 1 disjoint communication paths
between any two participants• Requires f + 1 levels of communication• ROBUS Topology satisfies these conditions for
N 3, M 3, f = 1• For target reliability, we must tolerate at least 1
asymmetric fault
June 14, 2000 Lfm2000 22
Langley Research Center
Interactive Consistency• SPIDER IC protocol is simple adaptation of IC
algorithm for Draper FTP Architecture– Existing PVS proof due to Lincoln and Rushby,
COMPASS’94, pages 107-120 • Protocol generalizes one suggested in
Daniel Davies and John Wakerly, Synchronization and Matching in Redundant Systems, IEEE Trans. on Computers, Vol. C-27, No. 6, June 1978
June 14, 2000 Lfm2000 23
Langley Research Center
SPIDER IC ProtocolAlgorithm OMS (ignoring hybrid fault model):
1. Processing element j sends value v to BIU j2. BIU j broadcasts v to all RMUs3. All RMUs broadcast value received from BIU j to all
BIUs4. Each BIU votes on the values received from the
RMUs to determine value from j5. Each BIU forwards the voted value to its PE
June 14, 2000 Lfm2000 24
Langley Research Center
Adapting for hybrid faults• Simple modification to steps 3 and 4 to enable
special handling of manifestly bad messages• from benign faulty or asymmetrically faulty sources
• OMHS(p,v,q) denotes the value received by q, when p broadcast value v using hybrid oral messages protocol on SPIDER
• Verified in PVS, using simple modifications to Lincoln and Rushby’s proof of the Draper FTP Interactive Consistency Protocol
June 14, 2000 Lfm2000 25
Langley Research Center
Interactive Consistency ResultsAgreement: For BIU g, if (|AR| = 0) or (g AB
and |GR| > |SR| + |AR|), then for p,q GB: OMHS(g,v,p) = OMHS(g,v,q)
Validity: If |GR| > |SR| + |AR|, then for p GB :– If g GB, then OMHS(g,v,p) = v
– If g BB, then OMHS(g,v,p) = Error
– If g SB, then OMHS(g,v,p) = sent(g,v)
June 14, 2000 Lfm2000 26
Langley Research Center
Alternative Verification Options
• For a fixed number of participants, it is easy to demonstrate correctness of Interactive Consistency protocol using symbolic simulation
• Amount of effort needed to verify using theorem prover is negligible
• PVS proof is mostly symbolic evaluation– there is a small amount of deductive reasoning to
evaluate the abstract specification of hybrid majority
June 14, 2000 Lfm2000 27
Langley Research Center
Clock Synchronization Goal• To match the degree of Fault-Tolerance of the IC
protocol, the synchronization protocol should ensure synchronization using the following fault assumptions:• |GR| > |SR| + |AR|• Not (|AR| > 0 & |AB| > 0)
June 14, 2000 Lfm2000 28
Langley Research Center
Clock Synchronization• Need added assumption:
|GB| > |SB| + |AB|– With this assumption, a modified form of the Davies
and Wakerly protocol (1978, IEEE ToC) ensures synchronization of the RMUs
– Modified protocol is similar to the Srikanth and Toueg protocol (1987, JACM)
June 14, 2000 Lfm2000 29
Langley Research Center
Clock Synchronization Basics• Clocks (counters) driven by oscillator with a
bounded drift () from its stated frequency• Periodically (every P ticks) clocks will adjust the
count based on exchange with other clocks• The periods are indexed by round number (k)• Protocol seeks to ensure that at beginning of
every round, all good clocks are within dmin
June 14, 2000 Lfm2000 30
Langley Research Center
Synchronization Basics• If all good clocks start round within dmin of each
other, by the end of the round they can be at most dmin + 2 P apart
• If good clocks then make a small adjustment so that they start the next round within dmin, then both precision and accuracy are satisfied
• Several machine checked proofs exist
June 14, 2000 Lfm2000 31
Langley Research Center
Network Imprecision• There is imprecision in communication
– If a node transmits a message at time t, observing nodes will receive it within time interval
[t + d, t + d + e]– d is the minimum communication delay– e is the inherent imprecision (e > (1 + ) ticks)
• e is a lower bound on synchronization precision
• For many Byzantine Resilient protocols, dmin 2e
June 14, 2000 Lfm2000 32
Langley Research Center
Simple Protocol (for round k)
• RMU: (Perform the following concurrently)– If Ready?(k) then broadcast (round k) to all BIUs– If Accept?(k) then reset counter for round k
• BIU:– If Accept?(k) then broadcast (round k) to all RMUs
June 14, 2000 Lfm2000 33
Langley Research Center
Informal Description
• Each good RMU broadcasts when its clock reaches a specific value
• Each good BIU waits until it knows it has received a message from at least one good RMU. It then relays this information to all RMUs
• Each RMU waits until it knows it has a message from at least one good BIU before resetting
June 14, 2000 Lfm2000 34
Langley Research Center
Ready and Accept• Ready?(k) is an event triggered by a pre-
determined local counter value, kP - a, a is a constant offset for communication delays
P is the nominal duration of a round
k is the round index
• Degree of fault-tolerance is determined by Accept?(k)
June 14, 2000 Lfm2000 35
Langley Research Center
Accept?(k)• Wait until there is a hybrid majority of observed
(round k) events to trigger Accept?(k)• A selection function under the hybrid fault model
ignores manifestly bad values • This protocol ensures that all good RMUs accept
(round k) within a short time interval, provided the maximum fault assumptions are not violated– Can also synchronize BIUs by echoing RMU accept
June 14, 2000 Lfm2000 36
Langley Research Center
Verification in PVS• Built generic clock synchronization theory in
PVS– PVS theory from Ulm introduced too much potential
error in formulation of clock drift assumptions– New theory allows proofs of clock skew as tight as
best theoretical results– Support for some protocols absent– All pieces in place to complete SPIDER verification
• Estimate 1-2 weeks effort to tie up loose ends
June 14, 2000 Lfm2000 37
Langley Research Center
Alternative Verification Options• Protocol is (almost) finite state
– Should be possible to use a model checker to confirm that all good nodes start each round within dmin of each other
– Plausible tools for this are HyTech, UPPAAL– Still need theorem prover to get Precision and
Accuracy results– Theorem prover can verify for arbitrary number of
participants
June 14, 2000 Lfm2000 38
Langley Research Center
Diagnosis• Plan to adapt MAFT on-line diagnosis algorithms
to SPIDER architecture– MAFT algorithms previously verified using PVS
• Chris Walter, Patrick Lincoln, and Neeraj Suri. Formally verified on-line diagnosis, IEEE Trans. On Software Engineering, Nov. 1997
• For diagnosis of Processing Elements, there exist verified Group Membership protocols
• Katz, Lincoln, and Rushby, Low overhead time-triggered group membership, In 11th Workshop on Distributed Algorithms (WDAG’97), pages 155-169, LNCS 1320
June 14, 2000 Lfm2000 39
Langley Research Center
Summary• New conceptual design for a family of fault-
tolerant systems• Design being developed using DO-254 guidance• Critical fault-tolerance verified using PVS
– Able to reuse or adapt many existing proofs of fault-tolerance protocols
– Unable to reuse existing Clock synchronization proofs
June 14, 2000 Lfm2000 40
Langley Research Center
Future Plans
• Report documenting SPIDER Conceptual design and proofs of fault-tolerance by end of summer
• First laboratory prototype implementation of SPIDER by December
• Second generation design in 2001