analysis of the spider fault-tolerance protocols

40
Langley Research Center Analysis of the SPIDER Fault-Tolerance Protocols Paul S. Miner [email protected] 5th NASA Langley Formal Methods Workshop Williamsburg, VA June 14, 2000

Upload: tejano

Post on 12-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Analysis of the SPIDER Fault-Tolerance Protocols. Paul S. Miner [email protected] 5th NASA Langley Formal Methods Workshop Williamsburg, VA June 14, 2000. What is SPIDER?. A general purpose fault-tolerant architecture - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis of the SPIDER  Fault-Tolerance Protocols

Langley Research Center

Analysis of the SPIDER Fault-Tolerance Protocols

Paul S. [email protected]

5th NASA Langley Formal Methods WorkshopWilliamsburg, VA

June 14, 2000

Page 2: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 2

Langley Research Center

What is SPIDER?• A general purpose fault-tolerant architecture

– Scalable Processor-Independent Design for Electromagnetic Resilience

• Intended to serve as a platform to explore recovery strategies for HIRF/EMI induced faults

• Developed as part of an FAA funded case-study to exercise RTCA DO-254: Design Assurance Guidance for Airborne Electronic Hardware

Page 3: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 3

Langley Research Center

RTCA DO-254/EUROCAE ED-76

• Developed by RTCA Special Committee 180 and EUROCAE Working Group 46

• Approved by RTCA Program Management Committee in April 2000

• Approved by EUROCAE(?)• FAA Advisory Circular (?)

– Earliest would be sometime this fall

Page 4: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 4

Langley Research Center

Formal Methods in DO-254

• Formal Methods is one of the advanced analysis techniques suggested when developing hardware to support safety-critical (Level A or B) aircraft functions

• DO-254 section on Formal Methods based upon material from NASA Formal Methods Guidebook, Volume II, (NASA-GB-001-97)

Page 5: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 5

Langley Research Center

DO-254 Case-Study Participants

• NASA LaRC (Design Team)• Paul Miner, Project Lead• Mahyar Malekpour, Design Engineer• Wilfredo Torres-Pomales, Design Engineer• Victor Carreño, Process Assurance

• FAA (Sponsor and Certification Liaison)• Leanna Rierson, Pete Saraceni, Dennis Wallace, Connie

Beane, and Will Struck

Page 6: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 6

Langley Research Center

Strategy for DO-254 Case-Study

• Fault-tolerance protocols and reliability models use the same fault classifications

• Reliability analysis using SURE (Butler)– Calculates P(enough good hardware)

• Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation

Page 7: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 7

Langley Research Center

SPIDER Design Concept

• Inspired by several earlier designs– Main concept inspired by Palumbo’s Fault-tolerant

processing system (U.S. Patent 5,533,188)• Developed as part of Fly-By-Light/Power-By-Wire project

– Other ideas from Draper’s FTPP, FTP, and FTMP; Allied-Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; . . .

Page 8: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 8

Langley Research Center

SPIDER Architecture

• N simplex general purpose nodes logically connected via a Reliable Optical BUS (ROBUS)

• A ROBUS is an ultra-reliable unit providing basic fault-tolerant services

• A ROBUS is implemented as a special purpose fault-tolerant device

Page 9: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 9

Langley Research Center

SPIDER Architecture

34

01

2

7

6

5

ROBUS

Page 10: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 10

Langley Research Center

Logical View of ROBUS

• ROBUS operates as a time-division multiplexed access broadcast bus

• ROBUS strictly enforces write access – no babbling idiots

• Processing nodes may be grouped to provide differing degrees of fault-tolerance– some voting available within ROBUS

Page 11: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 11

Langley Research Center

Logical view of ROBUS(Sample Configuration)

ROBUS

0 4 21 3 56 7

Page 12: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 12

Langley Research Center

ROBUS Characteristics• Bus access schedule statically determined

– similar to SAFEbus, TTA• Some fault-tolerance functions provided by

processing nodes– ROBUS will not have general purpose processing

capabilities• Processing Elements need not be uniform

– support for dissimilar architectures

Page 13: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 13

Langley Research Center

ROBUS Requirements

• All fault-free nodes observe the exact same sequence of messages

• ROBUS provides a reliable time source (RTS)– The nodes are synchronized relative to this RTS

• ROBUS provides correct and consistent system diagnostic information to all fault-free nodes

• For 10 hour mission, P(ROBUS Failure) < 10-10

Page 14: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 14

Langley Research Center

Interactive Consistency(Byzantine Agreement)

Agreement: For any message, all non-faulty receiving nodes will agree on the value of the message

Validity: If the originator of the message is non-faulty, good receivers will receive the message sent

Page 15: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 15

Langley Research Center

Clock Synchronization

Precision: There is a small positive constant dmax such that for any two clocks that are good at t,

|C1(t) - C2(t)| dmax

Accuracy: All good clocks maintain an accurate measure of the passage of time (within a linear envelope of real time)

Page 16: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 16

Langley Research Center

Diagnosis

Correctness: Every node diagnosed as faulty by a good node is faulty– A good node can never conclude that another good

node is faulty

Completeness: Every faulty node is (eventually) diagnosed as being faulty– This is not always possible (pathological case

involves asymmetric fault)

Page 17: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 17

Langley Research Center

Physical Segregation• ROBUS decomposed into physically isolated

Fault Containment Regions (FCR)– Two main design elements

• Bus Interface Unit (BIU)• Redundancy Management Unit (RMU)

– Processing elements may form separate FCRs• FCRs fail independently

– This is necessary to achieve reliability goals

Page 18: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 18

Langley Research Center

ROBUS Topology

PE 1

PE 2

PE 3

PE N BIU N

BIU 3

BIU 2

BIU 1

ROBUSN,M

RMU M

RMU 2

RMU 1

Page 19: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 19

Langley Research Center

Hybrid Fault Assumptions• The failure status of an FCR is subdivided into

four mutually exclusive cases– Good (or fault-free)– Benign Faulty (Known bad by all good)– Symmetric Faulty (Same to all good)– Asymmetric Faulty (Byzantine, Malicious)

• This is a global classification, individual FCRs do not know the failure status of other FCRs

Page 20: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 20

Langley Research Center

Fault Classification

• Partition the RMUs into disjoint subsets based upon fault classification– GR, BR, SR, and AR for good, benign, symmetric,

and asymmetric RMUs respectively• Similarly partition the BIUs

– GB, BB, SB, and AB

Page 21: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 21

Langley Research Center

Tolerating Asymmetric Faults• Requires 3f + 1 participants in protocol to

withstand f simultaneous asymmetric faults• Requires 2f + 1 disjoint communication paths

between any two participants• Requires f + 1 levels of communication• ROBUS Topology satisfies these conditions for

N 3, M 3, f = 1• For target reliability, we must tolerate at least 1

asymmetric fault

Page 22: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 22

Langley Research Center

Interactive Consistency• SPIDER IC protocol is simple adaptation of IC

algorithm for Draper FTP Architecture– Existing PVS proof due to Lincoln and Rushby,

COMPASS’94, pages 107-120 • Protocol generalizes one suggested in

Daniel Davies and John Wakerly, Synchronization and Matching in Redundant Systems, IEEE Trans. on Computers, Vol. C-27, No. 6, June 1978

Page 23: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 23

Langley Research Center

SPIDER IC ProtocolAlgorithm OMS (ignoring hybrid fault model):

1. Processing element j sends value v to BIU j2. BIU j broadcasts v to all RMUs3. All RMUs broadcast value received from BIU j to all

BIUs4. Each BIU votes on the values received from the

RMUs to determine value from j5. Each BIU forwards the voted value to its PE

Page 24: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 24

Langley Research Center

Adapting for hybrid faults• Simple modification to steps 3 and 4 to enable

special handling of manifestly bad messages• from benign faulty or asymmetrically faulty sources

• OMHS(p,v,q) denotes the value received by q, when p broadcast value v using hybrid oral messages protocol on SPIDER

• Verified in PVS, using simple modifications to Lincoln and Rushby’s proof of the Draper FTP Interactive Consistency Protocol

Page 25: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 25

Langley Research Center

Interactive Consistency ResultsAgreement: For BIU g, if (|AR| = 0) or (g AB

and |GR| > |SR| + |AR|), then for p,q GB: OMHS(g,v,p) = OMHS(g,v,q)

Validity: If |GR| > |SR| + |AR|, then for p GB :– If g GB, then OMHS(g,v,p) = v

– If g BB, then OMHS(g,v,p) = Error

– If g SB, then OMHS(g,v,p) = sent(g,v)

Page 26: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 26

Langley Research Center

Alternative Verification Options

• For a fixed number of participants, it is easy to demonstrate correctness of Interactive Consistency protocol using symbolic simulation

• Amount of effort needed to verify using theorem prover is negligible

• PVS proof is mostly symbolic evaluation– there is a small amount of deductive reasoning to

evaluate the abstract specification of hybrid majority

Page 27: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 27

Langley Research Center

Clock Synchronization Goal• To match the degree of Fault-Tolerance of the IC

protocol, the synchronization protocol should ensure synchronization using the following fault assumptions:• |GR| > |SR| + |AR|• Not (|AR| > 0 & |AB| > 0)

Page 28: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 28

Langley Research Center

Clock Synchronization• Need added assumption:

|GB| > |SB| + |AB|– With this assumption, a modified form of the Davies

and Wakerly protocol (1978, IEEE ToC) ensures synchronization of the RMUs

– Modified protocol is similar to the Srikanth and Toueg protocol (1987, JACM)

Page 29: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 29

Langley Research Center

Clock Synchronization Basics• Clocks (counters) driven by oscillator with a

bounded drift () from its stated frequency• Periodically (every P ticks) clocks will adjust the

count based on exchange with other clocks• The periods are indexed by round number (k)• Protocol seeks to ensure that at beginning of

every round, all good clocks are within dmin

Page 30: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 30

Langley Research Center

Synchronization Basics• If all good clocks start round within dmin of each

other, by the end of the round they can be at most dmin + 2 P apart

• If good clocks then make a small adjustment so that they start the next round within dmin, then both precision and accuracy are satisfied

• Several machine checked proofs exist

Page 31: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 31

Langley Research Center

Network Imprecision• There is imprecision in communication

– If a node transmits a message at time t, observing nodes will receive it within time interval

[t + d, t + d + e]– d is the minimum communication delay– e is the inherent imprecision (e > (1 + ) ticks)

• e is a lower bound on synchronization precision

• For many Byzantine Resilient protocols, dmin 2e

Page 32: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 32

Langley Research Center

Simple Protocol (for round k)

• RMU: (Perform the following concurrently)– If Ready?(k) then broadcast (round k) to all BIUs– If Accept?(k) then reset counter for round k

• BIU:– If Accept?(k) then broadcast (round k) to all RMUs

Page 33: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 33

Langley Research Center

Informal Description

• Each good RMU broadcasts when its clock reaches a specific value

• Each good BIU waits until it knows it has received a message from at least one good RMU. It then relays this information to all RMUs

• Each RMU waits until it knows it has a message from at least one good BIU before resetting

Page 34: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 34

Langley Research Center

Ready and Accept• Ready?(k) is an event triggered by a pre-

determined local counter value, kP - a, a is a constant offset for communication delays

P is the nominal duration of a round

k is the round index

• Degree of fault-tolerance is determined by Accept?(k)

Page 35: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 35

Langley Research Center

Accept?(k)• Wait until there is a hybrid majority of observed

(round k) events to trigger Accept?(k)• A selection function under the hybrid fault model

ignores manifestly bad values • This protocol ensures that all good RMUs accept

(round k) within a short time interval, provided the maximum fault assumptions are not violated– Can also synchronize BIUs by echoing RMU accept

Page 36: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 36

Langley Research Center

Verification in PVS• Built generic clock synchronization theory in

PVS– PVS theory from Ulm introduced too much potential

error in formulation of clock drift assumptions– New theory allows proofs of clock skew as tight as

best theoretical results– Support for some protocols absent– All pieces in place to complete SPIDER verification

• Estimate 1-2 weeks effort to tie up loose ends

Page 37: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 37

Langley Research Center

Alternative Verification Options• Protocol is (almost) finite state

– Should be possible to use a model checker to confirm that all good nodes start each round within dmin of each other

– Plausible tools for this are HyTech, UPPAAL– Still need theorem prover to get Precision and

Accuracy results– Theorem prover can verify for arbitrary number of

participants

Page 38: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 38

Langley Research Center

Diagnosis• Plan to adapt MAFT on-line diagnosis algorithms

to SPIDER architecture– MAFT algorithms previously verified using PVS

• Chris Walter, Patrick Lincoln, and Neeraj Suri. Formally verified on-line diagnosis, IEEE Trans. On Software Engineering, Nov. 1997

• For diagnosis of Processing Elements, there exist verified Group Membership protocols

• Katz, Lincoln, and Rushby, Low overhead time-triggered group membership, In 11th Workshop on Distributed Algorithms (WDAG’97), pages 155-169, LNCS 1320

Page 39: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 39

Langley Research Center

Summary• New conceptual design for a family of fault-

tolerant systems• Design being developed using DO-254 guidance• Critical fault-tolerance verified using PVS

– Able to reuse or adapt many existing proofs of fault-tolerance protocols

– Unable to reuse existing Clock synchronization proofs

Page 40: Analysis of the SPIDER  Fault-Tolerance Protocols

June 14, 2000 Lfm2000 40

Langley Research Center

Future Plans

• Report documenting SPIDER Conceptual design and proofs of fault-tolerance by end of summer

• First laboratory prototype implementation of SPIDER by December

• Second generation design in 2001