dugan comp sys fta tutor
TRANSCRIPT
-
8/14/2019 Dugan Comp Sys Fta Tutor
1/83
slide(1)
RELIABILITY and MAINTAINABILITY Symposium
Fault Tree Analysisof Computer-Based Systems
Joanne Bec hta DuganProfessor of Elec tric a l & Computer Eng ineering
University of Virginia([email protected])
-
8/14/2019 Dugan Comp Sys Fta Tutor
2/83
slide(2)
RELIABILITY and MAINTAINABILITY Symposium
Presentation Outline
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
-
8/14/2019 Dugan Comp Sys Fta Tutor
3/83
slide(3)
RELIABILITY and MAINTAINABILITY Symposium
Introduction to fault tree analysis
Fault trees provide a good framework for both qualitative andquantitative analysis because they have both a logical (boolean
algebra) and probabilistic basis.
What is a fault tree?
not a tree (in the graph-theoretic sense)
a graphical representation of a logical function shows logical relationship between an event (failure) and its causes provides a logical framework for expressing combinations of compo-
nent failures that can lead to system failure
-
8/14/2019 Dugan Comp Sys Fta Tutor
4/83
slide(4)
RELIABILITY and MAINTAINABILITY Symposium
Why use fault tree analysis?
A fault tree model provides a logical framework for analyzing the failurebehavior of a system.
A fault tree model precisely documents which failure scenarios havebeen considered and which have not.
Fault tree analysis can be used to support engineering and
management decisions, trade-off analysis and risk assessment.
The fault tree model has a well-defined boolean algebraic andprobabilistic basis which relates probability calculations to boolean
logic functions.
-
8/14/2019 Dugan Comp Sys Fta Tutor
5/83
slide(5)
RELIABILITY and MAINTAINABILITY Symposium
Basic Static Fault Tree Constructs
Basic Events
Basic Event: corresponds to a basic failure
Characterized by failure rate or failure probability
event (usually a component failure) in the system.
k * name
Undeveloped Basic Event: A basic event that is notcompletely developed, usually because of unavailable
information.
Characterized by failure rate or failure probability
name
name
Replicated Basic Event; represents k statisticallyidentical copies of a component
Characterized by failure rate or failure probability
-
8/14/2019 Dugan Comp Sys Fta Tutor
6/83
slide(6)
RELIABILITY and MAINTAINABILITY Symposium
Static fault tree gates
m/n
AND gate - output event occurs
only if ALL input events occur
OR gate - output event occurs
if one or more input events occur
m/n gate - output event occurs ifm or more of the n inputs occur
-
8/14/2019 Dugan Comp Sys Fta Tutor
7/83
slide(7)
RELIABILITY and MAINTAINABILITY Symposium
Example Fault Tree
Washing Machine Overflows
fill mode too long
valve
stuck open
timeout
control
failed
full
sensorfailed
Structure Function:
Fail = valve_failed OR
F = A + BC
(timer_failed AND sensor_failed)
-
8/14/2019 Dugan Comp Sys Fta Tutor
8/83
slide(8)
RELIABILITY and MAINTAINABILITY Symposium
Probabilistic Fault Tree Analysis of Example
F = A + BC
Pr[F] = Pr[ A + BC]
= Pr[A] + Pr[BC] - Pr[ABC]
Suppose
Pr[A] = 0.01
Pr[B] = 0.05
Pr[C] = 0.075= 0.01 + 0.00375 - 0.0000375
= 0.0137125(all failures are independent)
-
8/14/2019 Dugan Comp Sys Fta Tutor
9/83
slide(9)
RELIABILITY and MAINTAINABILITY Symposium
Fault Tree Analysis - Cutsets
Most fault tree analysis techniques start with the generation of cutsets
A cutset is a set of basic events; if all the basic events in a cutset occur,then the top event (system failure) occurs.
A mincut (minimum cutset) is one that contains no redundant elements.If an element is removed from a mincut, it ceases to be a mincut.
Washing Machine Overflows
fill mode too long
valve
stuck open
timeout
control
failed
full
sensor
failed
Cutsets:
{valve} (single point of failure)
{timer, sensor}
-
8/14/2019 Dugan Comp Sys Fta Tutor
10/83
slide(10)
RELIABILITY and MAINTAINABILITY Symposium
Cutset generation by example
G2
G5
G3
G1
G4
A1 A2 A3 A4
A5
G3 {A4,A5}
{A1,A3}
{A2,A3}
{A2,A4}
{A1,A4}
G2 {G4,G5}
{A1,G5}
{A2,G5}
G1
-
8/14/2019 Dugan Comp Sys Fta Tutor
11/83
slide(11)
RELIABILITY and MAINTAINABILITY Symposium
Probabilistic Analysis using Cutsets
The probability of system failure is simply the probability that one ormore of the cutsets occur.
But the cutsets are not disjoint so we cannot sum their individual
probabilities. We must account for the overlap of the events.
Prob(failure) = Prob(valve failure) + Prob(both timer and sensor fail)- Prob(all three fail)
Washing Machine Overflows
fill mode too long
valve
stuck open
timeout
control
failed
full
sensor
failed
Cutsets:
{valve} (single point of failure)
{timer, sensor}
-
8/14/2019 Dugan Comp Sys Fta Tutor
12/83
slide(12)
RELIABILITY and MAINTAINABILITY Symposium
Probabilistic Analysis using Inclusion-Exclusion
Pr(A) + Pr(B) + Pr(C)- Pr(A and B) - Pr(B and C) - Pr(A and C)+ Pr(A and B and C)
ABCAB
BCAC
C
BA
-
8/14/2019 Dugan Comp Sys Fta Tutor
13/83
slide(13)
RELIABILITY and MAINTAINABILITY Symposium
Probabilistic Analysis using Sum-of-Disjoint-Products
Pr(A) + Pr(not A and B) + Pr (not A and not B and C)
A AB
_
ABC__
-
8/14/2019 Dugan Comp Sys Fta Tutor
14/83
slide(14)
RELIABILITY and MAINTAINABILITY Symposium
Probabilistic Ana lysis
using Binary Decision Diagrams
A1
B1
A2
0
A2
B2
0
0 1
0
AB
B1
1
A1 B1
A2
AB
B2
BDD representationFault tree model
-
8/14/2019 Dugan Comp Sys Fta Tutor
15/83
slide(15)
RELIABILITY and MAINTAINABILITY Symposium
Presentation Outline
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
-
8/14/2019 Dugan Comp Sys Fta Tutor
16/83
slide(16)
RELIABILITY and MAINTAINABILITY Symposium
An example control system including software
Consider a simple tank level and flow control system.
The key features of this system are:
a water tank, fed by a water pump on the inflow and regulated bycontrol and stop valves on the inflow and outflow pipes.
a tank and level control system with three sensors (level, inflow andoutflow) implemented in software
a tank bypass to prevent overflow, controlled by the three stop
valves
valve actuation and control implemented in software
-
8/14/2019 Dugan Comp Sys Fta Tutor
17/83
slide(17)
RELIABILITY and MAINTAINABILITY Symposium
Function of example system
The function is to maintain the water level and downstream flow rateat particular values by opening and closing control valves cv1 and cv2.
The controller receives inputs from the three sensors, implements thecontrol logic and then gives commands to the two control valves andthe two stop valves.
(This example is adapted from an example in: S. Guarro, M. Yau and M. Motamed,Development of tools for safety analysis of control software in advanced reactors,
NUREG/CR-6465, April 1996.)
-
8/14/2019 Dugan Comp Sys Fta Tutor
18/83
slide(18)
RELIABILITY and MAINTAINABILITY Symposium
Diagram of example system
ControlValvecv2
FlowSensor
FlowSensor
Normally
Closed
valve v2
Normally
Open
valve v1
Check
valveCheckvalve
Digital
Controller
PumpControlValvecv1
LevelSensor
Normally
Open
valve v3
-
8/14/2019 Dugan Comp Sys Fta Tutor
19/83
slide(19)
RELIABILITY and MAINTAINABILITY Symposium
Control Flow
close v1
open v2
open v3
cv1 = min
cv2 = max
open v1
close v2
open v3
calculate cv1
calculate cv2
open v1
close v2
close v3cv1 - max
cv2 = min
measurementflow measurementflowdownstreamupstream
level
level too high
level too low
levelwithin
bounds
compare level
with level
set points
Drain water
from tank
control valves
positions for
calculate
in tank
replenish water
flow set point
-
8/14/2019 Dugan Comp Sys Fta Tutor
20/83
slide(20)
RELIABILITY and MAINTAINABILITY Symposium
Software Operational modes
very
lowveryhigh
UnderflowOverflow
correct operation
non-critical failure behavior
critical failure behavior
low normal high
-
8/14/2019 Dugan Comp Sys Fta Tutor
21/83
slide(21)
RELIABILITY and MAINTAINABILITY Symposium
System Failure modes
The control system is not designed to be fault tolerant, so that mosthardware failures present either an overflow or underflow hazard.These are the two system-level failure modes being considered. Ifeither of the control valves or any of the three sensors fail, the systemfails, as the software will be unable to control the system. A tank leak,pipe leak or pump failure are also considered single points of failure.
Further, an underflow can occur if valve v1 or v2 fails, thus preventing
proper inflow, unless valve v3 can be closed. Therefore, if v3 and eitherv1 or v2 fail, the system fails. Overflow can occur if valve v3 fails, thuspreventing outflow, unless v1 can be closed. Thus the failure of both v1and v3 leads to system failure.
-
8/14/2019 Dugan Comp Sys Fta Tutor
22/83
slide(22)
RELIABILITY and MAINTAINABILITY Symposium
Software Failure Modes
Software failures have been characterized as an improper change ofstate. Software failures are further classified as critical and non-critical.
Non-critical software failures lead to less than optimal performance butdo not lead to system failure.
A software failure when the tank water level is very low or very highcan lead to system failure.
-
8/14/2019 Dugan Comp Sys Fta Tutor
23/83
slide(23)
RELIABILITY and MAINTAINABILITY Symposium
Fault Tree Model for Example System
SW-VH-Failure Control Valve 1
Control Valve 2
Stop Valve 3Stop Valve 1 Level sensor
inflow sensor
outflow sensor
pump failure
pipe leak
tank leak
Mechanical systemsControl valvesOverflow Underflow
SW-VL-Failure
Stop Valve 3Stop Valve 2 Stop Valve 3Stop Valve 1
Processor
-
8/14/2019 Dugan Comp Sys Fta Tutor
24/83
slide(24)
RELIABILITY and MAINTAINABILITY Symposium
Presentation Outline
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
-
8/14/2019 Dugan Comp Sys Fta Tutor
25/83
slide(25)
RELIABILITY and MAINTAINABILITY Symposium
Fault trees as a design aid for software systems
Fault tree analysis can help to insure that the software system does notdowhat it is notsupposed to do. (As contrasted with a formal designreview which helps insure that the software doeswhat it issupposed todo.)
For robust software systems, fault trees can help identify high-riskareas (either quantitatively or qualitatively).
Can manage risk by preventive or protective measures applied toidentified high-risk areas.
exhaustive testing formal methods
exception handling acceptance tests interlocks redesign
-
8/14/2019 Dugan Comp Sys Fta Tutor
26/83
slide(26)
RELIABILITY and MAINTAINABILITY Symposium
Using fault trees to manage risk
AND gates can be protected by disallowing one of the inputs[1]
exhaustive testing or formal proof to show module cannot fail test for failure condition and provide recovery routine
OR gate can be protected by disallowing allinputs or by providingdetection and recovery point. (The detection and recovery routinesmust be simple enough to be certifiably correct.)
[1] Herbert Hecht and Myron Hecht, Fault Tolerant Software. In D.K. Pradhan, editor, Fault-Tolerant Computing: Theory and Techniques, volume 2, pages 658-696. Prentice-Hall, 1986.
-
8/14/2019 Dugan Comp Sys Fta Tutor
27/83
slide(27)
RELIABILITY and MAINTAINABILITY Symposium
Example Risk Mitigation
G2
G5
G3
G1
G4
A1 A2 A3 A4
A5
Suppose basic events represent software modules.
Can protect G3 by preventing failure of module A4
- exhaustive testing
- proof of correctness
Then can protect G2 by
-preventing failure of A3
- or by providing detection and recovery handler for G4
- preventing failure of both A1 and A2
-
8/14/2019 Dugan Comp Sys Fta Tutor
28/83
slide(28)
RELIABILITY and MAINTAINABILITY Symposium
Presentation Outline
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
-
8/14/2019 Dugan Comp Sys Fta Tutor
29/83
slide(29)
RELIABILITY and MAINTAINABILITY Symposium
Modeling Fault-Tolerant Computer Systems
Fault Tolerant Computer (FTC) systems can actively handle manyfaults and errors that may occur.
Because FTC are adaptive and flexible, faulty components can beswitched out automatically, and spares switched in.
However, adaptability and flexibility often result in increasedcomplexity. Increased complexity can mean decreased reliability.
If the fault tolerance mechanisms (error detection, recovery,reconfiguration) fail, this failure could lead to overall system failure,even if adequate functioning resources remain.
A coverage model is used to analyze the behavior of the computersystem in the presence of a fault. The results of the coverage modelare then incorporated into the overall system model.
-
8/14/2019 Dugan Comp Sys Fta Tutor
30/83
slide(30)
RELIABILITY and MAINTAINABILITY Symposium
Covered vs. Uncovered Faults
A coveredfault is one from which the system can automatically
recover. Recovery from transient does not change system state. Recovery from a permanent fault discards faulty component.
An uncoveredfault is one which leads to immediate system failure,
regardless of the state of the system.
-
8/14/2019 Dugan Comp Sys Fta Tutor
31/83
slide(31)
RELIABILITY and MAINTAINABILITY Symposium
How to estimate coverage?
If a working or prototype version of the system exists, or if enoughinformation is available about a system being designed, then coverage
probabilities can be estimated.
A model of the recovery process can be developed. The parameters forthe model can be measured from a fault injection on the workingprototype and or estimated from data collected in the field.
A detailed simulation model of the system recovery process can bedeveloped.
If the details of the recovery process are not known, reasonable
parameters can be deduced from other, similar systems.
-
8/14/2019 Dugan Comp Sys Fta Tutor
32/83
slide(32)
RELIABILITY and MAINTAINABILITY Symposium
General structure of a coverage model
The entry point to the model is the occurrence of the fault, andthe three exits (R,C, and S) are the three possible outcomes.
CoverageModel
permanent coverage
R exit
C exit
S exit
single-point failure
transient restoration(covered fault)
(uncovered failure)
(covered failure}
fault occurs
-
8/14/2019 Dugan Comp Sys Fta Tutor
33/83
slide(33)
RELIABILITY and MAINTAINABILITY Symposium
R exit: Transient Restoration
Correct recognition of and recovery from a transient fault. A transientis usually caused by external or environmental factors, such asexcessive heat or a glitch in the power line.
The vast majority of faults are transient.
Successful recovery from a transient fault restores the system to anoperational state without discarding any components - for exampleby masking the error, retrying an instruction, or rolling back to aprevious checkpoint.
Reaching this exit successfully requires:
timely detection of an error produced by the fault; performance of an effective recovery procedure; and swift disappearance of the fault (the cause of the error).
-
8/14/2019 Dugan Comp Sys Fta Tutor
34/83
slide(34)
RELIABILITY and MAINTAINABILITY Symposium
C exit: Permanent coverage
Determination of the permanent nature of the fault, and the successfulisolation and removal of the faulty component.
S exit: Single Point failure
A single fault causes the system to fail, generally whenan undetected error propagates through the system, or if the faultyunit cannot be isolated and the system cannot be reconfigured.
-
8/14/2019 Dugan Comp Sys Fta Tutor
35/83
slide(35)
RELIABILITY and MAINTAINABILITY Symposium
Typical fault recovery for a processorA processor contains built-in test circuitry so that error checking occursconcurrently with instruction execution. If an error is detected, theinstruction is retried immediately. Partial results are stored in case theretry is unsuccessful, so that the computation can be continued fromsome intermediate point (called a checkpoint).
The process of continuing a computation from a previously saved
checkpoint is called a rollback. In some cases the fault is such that therollback is not successful, so the computation must start over after asystem-level recovery procedure is invoked.
-
8/14/2019 Dugan Comp Sys Fta Tutor
36/83
slide(36)
RELIABILITY and MAINTAINABILITY Symposium
Example coverage model for processors
Wait
Retry
Rollback
Recovery
Permanent
Permanent
Coverage
Exit C
Failure
Exit S
Exit R
RestorationTransient
-
8/14/2019 Dugan Comp Sys Fta Tutor
37/83
slide(37)
RELIABILITY and MAINTAINABILITY Symposium
Example of transient restoration
Transient Restoration attempt: Assume that the fault is transient, andbegin a multi-step recovery procedure that continues as long as anerror is detected. If an error persists after all three steps have been
performed, then a permanent recovery procedure must be invoked.
Step 1: Wait for 0.1 second and do nothing. If the fault is transient itmay disappear during this time, allowing rollback to succeed.
Step 2: Retry the current instruction several times, for as long as ahalf-second. The probability that the retry will be successful (i.e., noerror is detected) is 0.5.
Step 3: If an error persists, perform a rollback to a previous check-point, followed by recomputation, taking 2 sec. total. The rollback suc-ceeds in removing the error 80% of the time.
-
8/14/2019 Dugan Comp Sys Fta Tutor
38/83
slide(38)
RELIABILITY and MAINTAINABILITY Symposium
Example of permanent coverage and
single-point failure
If an error still persists after the rollback, it is assumed to be caused bya permanent fault, and a system level permanent fault recovery
process is begun, to remove the offending processor from the set ofactive units and to reconfigure the system to continue without it.The permanent fault recovery process succeeds with probability 0.875.
The permanent coverage procedure is invoked against a a persistent
transient fault as well as against a permanent fault.
If the permanent fault recovery process fails, then a single-point failureis said to occur.
-
8/14/2019 Dugan Comp Sys Fta Tutor
39/83
slide(39)
RELIABILITY and MAINTAINABILITY Symposium
Coverage model for memories
Single bit
Memory error
Error masked
in zero time
Multiple bit
Memory error
Attempt
recovery
Error Occurs
successful unsuccessful
not detecteddetected
0.980.02
0.05
0.850.15
0.95
Transient
Restoration
Exit R
Permanent
Coverage
Exit C
Failure
Exit S
Failure
Exit S
-
8/14/2019 Dugan Comp Sys Fta Tutor
40/83
slide(40)
RELIABILITY and MAINTAINABILITY Symposium
Example of recovery process for memory faults
The memory uses an error correcting code, so a single-bit error isalways detectable and correctable, and no reconfiguration is required.If 98% of all memory faults affect only a single bit, thenthe probability of reaching the R exit is 0.98.
The 2% of faults that affect more than one memory bit are 95%detectable. When a multiple memory error is detected, the affectedportion of memory is discarded, the memory mapping function is
updated, and the needed information is reloaded from a previouscheckpoint and updated to represent the current state of the system.
Experimentation on a prototype system revealed that this recoveryfrom the detected multiple memory errors works 85% of the time.
Thus, the probability of reaching the C exit is the probability that amultiple fault occurs, is detected, and is recovered from is:
c 0.02 0.95 0.85( ) 0.01615= =
-
8/14/2019 Dugan Comp Sys Fta Tutor
41/83
slide(41)
RELIABILITY and MAINTAINABILITY Symposium
Single point failure for memory faults
There are two paths to the single point failure exit.
The memory fault causes a single-point failure if a multiple-bit error isnot detected (with probability 0.02 x 0.05)
A multiple-bit memory error is detected, but the attempted recovery isnot successful, with probability 0.02 x 0.95 x 0.15
Thus the probability of single point failure is the sum of these twocases, or 0.00385.
-
8/14/2019 Dugan Comp Sys Fta Tutor
42/83
slide(42)
RELIABILITY and MAINTAINABILITY Symposium
Example - 3P2M system
3 Processors and 2 memories connected via a bus. 2 Processors, onememory and the bus are needed for correct operation.
Bus
Processors
Memories
-
8/14/2019 Dugan Comp Sys Fta Tutor
43/83
slide(43)
RELIABILITY and MAINTAINABILITY Symposium
3P2M fault tree
2/3
System Failure
P3P1 P2
B
M2M1
Processors Memories
Bus
-
8/14/2019 Dugan Comp Sys Fta Tutor
44/83
slide(44)
RELIABILITY and MAINTAINABILITY Symposium
Adding coverage to fault tree
P3P2P1 M2 M1
B
System Failure
MemoriesProcessors
2/3
Bus
R R R R R
S S S S SC C C C C
-
8/14/2019 Dugan Comp Sys Fta Tutor
45/83
slide(45)
RELIABILITY and MAINTAINABILITY Symposium
Probabilities for covered and uncovered basic
events
Note that the events fail covered and fail uncovered are mutuallyexclusive (i.e. not independent).
No fault
or transient
restoration
covered fault
uncovered
fault
Pr[component fault] = p
Pr[component operational] = q
Pr[covered failure] = cp
Pr[Uncovered failure] = sp
Pr[No fault or transient restoration] = q + rp
-
8/14/2019 Dugan Comp Sys Fta Tutor
46/83
slide(46)
RELIABILITY and MAINTAINABILITY Symposium
Presentation Outline
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
-
8/14/2019 Dugan Comp Sys Fta Tutor
47/83
slide(47)
RELIABILITY and MAINTAINABILITY Symposium
Sequence dependencies
Traditional fault trees cannot model sequence dependent failures, in
which the orderthat events occur is important.
We define special purpose gates for modeling sequencedependencies, and solve the resulting fault tree as a Markov chain.
The development of a correct Markov model for a complex system canbe difficult. Our approach is to use the fault tree for model developmentand automatically convert the fault tree to the equivalent Markov chain.The dynamic fault tree model is considerably simpler than theequivalent Markov chain.
Coverage models are automatically added to the resulting Markovchain which is solved via a numerical differential equation solver.
-
8/14/2019 Dugan Comp Sys Fta Tutor
48/83
slide(48)
RELIABILITY and MAINTAINABILITY Symposium
Example of sequence dependency
If switch fails after primary fails (and after spare is activated) then the
system is still operational.
If the switch fails before the primary fails, then the spare cannot beactivated and the system fails, even though the spare is operational.
Failure criteria depends on orderin which failures occur. This systemcan be solved correctly via a Markov model.
Switch
Primary
Spare
-
8/14/2019 Dugan Comp Sys Fta Tutor
49/83
slide(49)
RELIABILITY and MAINTAINABILITY Symposium
Sequence dependency gates
Several special purpose gates have been added to the traditional faulttree gates. These special dynamicgates capture sequencedependencies which frequently arise when modeling fault tolerantcomputer systems. If a dynamic gate is part of a fault tree then it issolved via a Markov chain, rather than by using traditional methods.
The special dynamic gates include:
Functional dependencygate for modeling situations where one com-ponents correct operation is dependent upon the correct operation ofsome other component
Sparegate for modeling cold, warm and hot pooled spares
Priority-ANDgate for modeling ordered ANDing of events. Note thatmany traditional fault trees include the Priority AND gate; most simplyapproximate with an AND gate
-
8/14/2019 Dugan Comp Sys Fta Tutor
50/83
slide(50)
RELIABILITY and MAINTAINABILITY Symposium
dependent basic events
are forced to occur when trigger
event occurs
FDEP
(may be subtree)
Trigger event
FDEP produces no logical output.
Its only effect is on propogating failures.
Functional Dependency gate
S t
-
8/14/2019 Dugan Comp Sys Fta Tutor
51/83
slide(51)
RELIABILITY and MAINTAINABILITY Symposium
fail at (possibly) reduced ratebefore being switched into active use
Primary
SPARE
Spare units which are assumed tocomponent
Spare gate
Priority AND gate
-
8/14/2019 Dugan Comp Sys Fta Tutor
52/83
slide(52)
RELIABILITY and MAINTAINABILITY Symposium
Priority-AND gate
which may occur in any order
A B
occur, and A occurs before B
Output occurs if A and B
Inputs (may be subtrees)
Cascading Priority AND gates
-
8/14/2019 Dugan Comp Sys Fta Tutor
53/83
slide(53)
RELIABILITY and MAINTAINABILITY Symposium
Cascading Priority-AND gates
A B
A before B before C
C
HECS: Hypothetical Example Computer System
-
8/14/2019 Dugan Comp Sys Fta Tutor
54/83
slide(54)
RELIABILITY and MAINTAINABILITY Symposium
HECS: Hypothetical Example Computer System
Operator console
Operator
& Software
A2
A1Memory
InterfaceUnit 1
M2M1 M3 M4 M5
Cold Spare A
Bus
Redundant
Memory
InterfaceUnit 2
HECS system description
-
8/14/2019 Dugan Comp Sys Fta Tutor
55/83
slide(55)
RELIABILITY and MAINTAINABILITY Symposium
HECS system description
HECS consists of dual-redundant processors A1 and A2 and a coldspare which can replace either upon failure. A cold spare is one whichis assumed not to fail before being used.
HECS has 5 memory units; three are required. These memory unitsare connected to the bus via two memory interface units. If the memoryinterface unit fails, the memory units connected to it are unusable.Memory unit 3 (M3) is connected to both interfaces for redundancy;thus M3 is accessible as long as either interface unit is operational.
There is also a human operator who interfaces with the system via aconsole, and runs some software application.
HECS requires at least one of the three A processors, at least 3 of thememory units, at least one of the redundant busses, and the operator,console and software to be operating correctly.
Modeling the cold spares
-
8/14/2019 Dugan Comp Sys Fta Tutor
56/83
slide(56)
RELIABILITY and MAINTAINABILITY Symposium
Modeling the cold spares
Notice that the cold spare is shared between the two processors. Firstto fail is replaced with the spare; the spare is then unavailable if theother fails.
Cold Spare Cold Spare
A1 A2
and spare
Cold Spare
A
A processors
Modeling the memory units
-
8/14/2019 Dugan Comp Sys Fta Tutor
57/83
slide(57)
RELIABILITY and MAINTAINABILITY Symposium
Modeling the memory units
M5M3
Functional
Dependency
M4M2M1
Functional
Dependency
Functional
Dependency
MIU 1MIU 2
3/5
memory
units
HECS system-level fault tree model
-
8/14/2019 Dugan Comp Sys Fta Tutor
58/83
slide(58)
RELIABILITY and MAINTAINABILITY Symposium
HECS system level fault tree model
hypothetical system failure
operator
console software
console
Operator
Cold Spare Cold Spare
A1 A2
M5M3
Functional
Dependency
M4M2M1
Functional
Dependency
Functional
Dependency
MIU 1MIU 2
2*Bus
operator,
console & SW
A processors
and spare
Cold Spare
3/5
memory
units
A
Example: Fault Tolerant Parallel Processor
-
8/14/2019 Dugan Comp Sys Fta Tutor
59/83
slide(59)
RELIABILITY and MAINTAINABILITY Symposium
Example: Fault Tolerant Parallel Processor
Processingelements
Network
elements
NE2 NE4
NE3
NE1
FTPP configuration #1
-
8/14/2019 Dugan Comp Sys Fta Tutor
60/83
slide(60)
RELIABILITY and MAINTAINABILITY Symposium
FTPP configuration #1
One spare per triad
D1 C1 B1 A1
A2
B2
C2
D2
A3 B3 C3 D3
AS
BS
CS
DS
NE2 NE4
NE3
NE1
Fault tree for FTPP configuration #1
-
8/14/2019 Dugan Comp Sys Fta Tutor
61/83
slide(61)RELIABILITY and MAINTAINABILITY Symposium
Fault tree for FTPP configuration #1
3/4 3/4 3/4 3/4
A1 A2 A3 AS B2 B3B1 C2 C3 D1 D2 D3 DSC1BS CS
NE4NE3NE2NE1
FDEPFDEP FDEP FDEP
B1A1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS
FTPP configuration #2
-
8/14/2019 Dugan Comp Sys Fta Tutor
62/83
slide(62)RELIABILITY and MAINTAINABILITY Symposium
g
One spare per NE
B3
D1
S2
A3 C1 D2 S3
C2
D3
S4
A1B2C3S1
B1
A2
NE2 NE4
NE3
NE1
Failure conditions for FTPP configuration #2
-
8/14/2019 Dugan Comp Sys Fta Tutor
63/83
slide(63)RELIABILITY and MAINTAINABILITY Symposium
g
Consider as an example, the first member of the A triad, specificallycomponent A1. Now A1 will fail if
both A1 and its spare (S1) fails OR if either of the other processors on the same NE fail before A1does,
thus using the spare first. In this case there will be no spare availablewhen A1 fails.
A1
B2 C3
A1S1
Fault tree for FTPP configuration #2
-
8/14/2019 Dugan Comp Sys Fta Tutor
64/83
slide(64)RELIABILITY and MAINTAINABILITY Symposium
g
A2 A2A1
B2 C3
A1S1 S2
B3 D1
A3 S3
C1 D2
A3 B1 B1 B2 B3 B3S4 B2 S1 S2
C2 D3 C3 A1 D1 A2
C1 C1 C2 C2 C3 C3S3 S4 S1
D2 A3 D3 B1 A1 B2
D1 D1 D2 D2 D3 DS2 S3 S4
A2 B3 A3 C1 B1 C2
2/3 2/32/32/3
FDEP
NE1
A1 B2 C3 S1
FDEP
NE2
A2 B3 D1 S2
FDEP
B1
NE4
C2 D3 S4
FDEP
A3 C1 D2 S3
NE3
Alternative Fault Tree for configuration #2
-
8/14/2019 Dugan Comp Sys Fta Tutor
65/83
slide(65)RELIABILITY and MAINTAINABILITY Symposium
NE4
AS BS CS DS
FDEP
A3 B3 C3 D3
NE3
FDEP
A2 B2 C2 D2
NE2
FDEP
NE1
B1A1 C1 D1
FDEP
A1 A2 A3 B1 B2 B3 C1 C2 C3
Spare Spare Spare SpareSpare Spare Spare Spare Spare Spare Spare
D1 D2 D3
AS
BS
CS
DS
Spare
2/3 2/32/3 2/3
Mission Avionics System Example
-
8/14/2019 Dugan Comp Sys Fta Tutor
66/83
slide(66)RELIABILITY and MAINTAINABILITY Symposium
The suc c ess/ fa ilure of the system is d riven by the need
to p rovide c erta in software func tiona lity
c rew sta tion managementsc ene & obsta c le p roc essingloc a l pa th genera tionsystem management func tionsvehic le ma nag ement
Fault to leranc e is ac hieved via redundant p roc essors(hot spares), poolsof c old sparesand redundant buses.
MAS t hit t
-
8/14/2019 Dugan Comp Sys Fta Tutor
67/83
slide(67)RELIABILITY and MAINTAINABILITY Symposium
MAS system architecture
Background Data Bus
Mission Management Bus
vehiclemgmt 1a
vehiclemgmt 1b
vehiclemgmt 2b
VMSPARE 2
VMSPARE 1
vehiclemgmt 2a
Vehicle Management Bus
Memory 1 Memory 2
scene&obstacle a
crewstation b
crewstation a
scene &obstacle b
local pathgen. a
local pathgen. b
systemmgmt a
systemmgmt b
SPARE 1
SPARE 2
-
8/14/2019 Dugan Comp Sys Fta Tutor
68/83
Redundant Software Architecture
-
8/14/2019 Dugan Comp Sys Fta Tutor
69/83
slide(69)RELIABILITY and MAINTAINABILITY Symposium
Next Consider the p a th genera tion (Pa thGen) andsc ene&obstac le (S&O) func tions:
Eac h func tion needs a sing le p roc essor to p rovide fullfunctionality.
There is a lso a reduc ed version of ea c h func tion tha t c anprovide minimum func tiona lity (Pa thGenMin and S&OMin).
In the event o f a detec ted softwa re fault in Pa thGen thesystem c an switc h to Pa thGenMin (simila rly for S&O)
Further, if there a re no longer 2 full p roc essors ava ilab le, thesystem will switc h to Pa thGenMin a nd S&OMin running on a
sing le p roc essor.
Redundant Software MAS model
-
8/14/2019 Dugan Comp Sys Fta Tutor
70/83
slide(70)RELIABILITY and MAINTAINABILITY Symposium
Loss
S1
S2
Software
Sys Mgt1a
CSP
Sys Mgt
Sys Mgt1b
CSP
Sys1a
Sys1b
Path Gen1a
CSP
Path Gen
Path Gen1b
CSP
Path1a
Path1b
S&O
S&O 1b
CSP
S&O1b
S&O 1a
CSP
S&O1a
S&O SW
CSP
S&OSWfull
S&OSWMin
PathSWfull
PathSWMin
Path SW
CSP
One Proc
Minimize
FDEP
Crew
Both Proc
Crew 1a
CSP
Crew 1b
CSP
Crew1a
Crew1b
-
8/14/2019 Dugan Comp Sys Fta Tutor
71/83
Presentation Outline
-
8/14/2019 Dugan Comp Sys Fta Tutor
72/83
slide(72)RELIABILITY and MAINTAINABILITY Symposium
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
Modular Approach to fault tree analysis
-
8/14/2019 Dugan Comp Sys Fta Tutor
73/83
slide(73)RELIABILITY and MAINTAINABILITY Symposium
Given a fault
tree model as input
Find Independent subtrees
Solve each subtree separately
Combine the results
The best of both worlds
-
8/14/2019 Dugan Comp Sys Fta Tutor
74/83
slide(74)RELIABILITY and MAINTAINABILITY Symposium
Divide-and-c onquer helps p roduc e models tha t ma ynot be too la rge to solve.
For sta tic modules (c onta ining AND, OR, K-of-M) ga tes,
use the fast and effic ient BDD (Binary Dec ision Diagram)approach.
For dynamic modules, c onvert to equiva lent Ma rkov
model for solution.
Different solution methods c an a id in va lida tion a ndtesting.
Modula riza tion a llows c onsidera tion of d ifferent solutionmethods (i.e. simula tion).
Modular Solution of HECS
-
8/14/2019 Dugan Comp Sys Fta Tutor
75/83
slide(75)RELIABILITY and MAINTAINABILITY Symposium
operator
consolesoftware
console
Operator
Cold Spare Cold Spare
A1 A2
M5
M3
Functional
Dependency
M4M2M1
Functional
Dependency
Functional
Dependency
MIU 1MIU 2
type: static
type: dynamic
independent subtree 2
Independent subtree 1
operator,
console & SW
A processors
and spare
Cold Spare
3/5
memory
units
A
2*Bus
Independent subtree 4 (buses)type: static
hypothetical system failure
type:dynamic
independent subtree 3 (memories)
Presentation Outline
-
8/14/2019 Dugan Comp Sys Fta Tutor
76/83
slide(76)RELIABILITY and MAINTAINABILITY Symposium
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions
Sensitivity analysis
-
8/14/2019 Dugan Comp Sys Fta Tutor
77/83
slide(77)RELIABILITY and MAINTAINABILITY Symposium
Reliability Analysis tells only part of the story.
What are the weak points in the system?
How do my results change with changing input parameters?
What is the most cost-effective way to improve reliability?
These questions require sensitivity analysis of reliability analysisresults.
Modular approach to sensitivity analysis
-
8/14/2019 Dugan Comp Sys Fta Tutor
78/83
slide(78)RELIABILITY and MAINTAINABILITY Symposium
pp y y
Sensitivity analysis (also called importance analysis) can use partialderivative.
Sensitivity results from different submodels can be easily combinedusing chain-rule.
Sensitivity analysis for BDD is almost free while calculating reliability.
Sensitivity analysis for Markov chain is more troublesome but we havedeveloped an interesting, efficient approximation.
Example: Cardiac Assist System
-
8/14/2019 Dugan Comp Sys Fta Tutor
79/83
slide(79)RELIABILITY and MAINTAINABILITY Symposium
TEDTSController
Battery
PowerSupply
PaceLeads
PrimaryCPU
BackupCPU
TEDTSCoil
MotorAmplifier
MotorCableMotorPump
CrossbarSwitch
WSP
Cardiac Assist System Fault TreeSystem
-
8/14/2019 Dugan Comp Sys Fta Tutor
80/83
slide(80)RELIABILITY and MAINTAINABILITY Symposium
FDEP WSP
TEDTS
Coil
TEDTS
Contr.Battery
Motor
Cable.
MotorMotor
Amp.Backup
CPU
Primary
CPU
System
Superv.
Crossbar
Switch.
Pace
LeadsPump
Power
Supply
System
Failure
Pump,Motor
Leads
Power
TEDTS
CPU
Motor
Section
TEDTS
Trigger
M1
M2
M3
M4
M5
-
8/14/2019 Dugan Comp Sys Fta Tutor
81/83
Summary
-
8/14/2019 Dugan Comp Sys Fta Tutor
82/83
slide(82)
RELIABILITY and MAINTAINABILITY Symposium
The DFT(dynamic fault tree) methodology is ideally suited for theanalysis of computer-based systems.
DFTuses a modular approach to FTA, detecting modules using a fastand efficient algorithm.
Modules are classified as static or dynamic, depending on the types ofgates included.
Static modules are solved using the BDD approach; dynamic modulesare solved using Markov chain methods.
Coverage models can assess the effect of complex recoverymechanisms.
Dynamic gates can allow modeling of sequence dependencies thatarise from complex redundancy management.
-
8/14/2019 Dugan Comp Sys Fta Tutor
83/83
slide(83)
RELIABILITY and MAINTAINABILITY Symposium
Software for Dynamic Fault Tree Analysis
Galileo/ ASSAPis a software package for fault tree
analysis which embodies the DFT approach.
(Being developed for NASA Langley Research
Center, expected completion Nov. 2001. Beta ver-sion available for evaluation.)