university of massachusetts dept. of …krishna/655/fall06/part2...during dt, or m(t)=n and no fault...
TRANSCRIPT
Page 1
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .1
C. M. KrishnaFall 2006
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer Engineering
Fault Tolerant ComputingECE 655
Part 2Canonical Structures
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .2
Canonical Structures
♦Larger structures can be constructed out of the individual components
♦Complex structures can be constructed out of some basic structures
♦We will assume statistical independencebetween failures in the individual components
♦The basic structures are ∗A Series System∗A Parallel System∗An M out of N System
Page 2
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .3
A Series System
♦A Series System - a set of components connected so that the failure of any one component causes the entire system to fail
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .4
Reliability of a Series System
♦Reliability of a series system - Rs(t) -the product of the reliabilities of its N modules
♦Ri(t) is the reliability of component i
N Rs(t) = Π Ri (t)
i=1
Page 3
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .5
Series System - Constant Failure Rates
♦If every module i has a constant failure rate λ i
♦ - λ i t Ri(t) = e
♦ - λs t - Σλ i tRs(t) = e = e
♦ λs =Σλi is the constant failure rate of the series system
♦ Mean Time To Failure of a series system -
MTTFs = 1/λs = 1/ Σλi
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .6
A Parallel System
♦A Parallel System - a set of modules connected so that all the modules must fail before the system fails
Page 4
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .7
Reliability of a Parallel System
♦Rp(t) - reliability of a parallel system
♦ N 1 - Rp(t) = Π (1-Ri(t))
i=1
♦ NRp(t) = 1 - Π (1-Ri(t))
i=1
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .8
Parallel System - Constant Failure Rates
♦Module i has a constant failure rate, λ i
♦ -λ i t N - λ itRi(t)=e ; Rp(t) = 1 - Π (1-e )
i=1♦Example - a parallel system with two modules
- λ1 t - λ2 t -(λ1+ λ2)tRp(t) = e +e - e
♦MTTF of a parallel system with the same λN
MTTFp=Σ 1/(i λ ) i=1
Page 5
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .9
M out of N Systems
♦An M-of-N system is one which consists of Nidentical components, with failure occurring if fewer than M components are still functional
♦Best-known example - The Triplex (TMR)♦Three identical components whose outputs are
voted on. This is a 2-of-3 system: as long as a majority of the processors produce correct results, the system will be functional
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .10
Reliability of M out of N Systems
♦N identical components♦R(t) - reliability of an individual component♦The reliability of the system is the probability
that N-M or fewer components have failed♦
N-M i N-iRm_of_n(t) = Σ C(N,i) (1-R(t) ) R(t)
i=0
♦ where C(N,i) = N ! / [i ! (N-i) !]
Page 6
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .11
Correlated Failures in M of N Systems
♦Statistical independence of failures in components - key to the high reliability
♦Correlated failure can greatly diminish reliability♦Example: Pcor - probability that the entire
system suffers a global failure
♦ N-M i N-iRm_of_n_cor (t) = (1-Pcor) Σ C(N,i) (1-R(t) ) R(t)
i=0
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .12
M out of N Systems - Modes of Correlation
♦If system is not designed carefully, the correlated failure factor can dominate the overall failure probability
♦Different modes of correlation among components exist - not necessarily a global failure
♦Correlated failure rates are extremely difficult to estimate
♦From now on we will deal with statisticallyindependent failures in components
Page 7
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .13
TMR - Triple Modular Redundant Cluster
♦TMR - perhaps the most important M-of-N system ♦M=2, N=3 - system good if at least two
components are operational ♦A voter picks the majority output
♦Voter can fail - reliability Rvot(t)♦ 1 i 3-i
Rtmr(t) = Rvot(t) Σ C(3,i) (1-R(t) ) R(t) i=0
= Rvot(t) ( 3R² (t) - 2R³ (t) )
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .14
TMR - Constant Failure Rates
♦ - λ tR(t)=e
♦No voter failures - Rvot(t)=1
♦ -2λt -3λt Rtmr(t)=3e - 2e
♦MTTFtmr = Rtmr(t) dt=5/(6 λ) <1/ λ = MTTFsimplex0
∞
Page 8
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .15
NMR - N-Modular Redundant Cluster
♦M-of-N cluster with N odd and M = (N+1)/2♦Voter failure rate negligible - Rvot(t)=1
♦Below R=0.5 - redundancy becomes a disadvantage♦Usually R >> 0.5 - triplex offers significant
reliability gains
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .16
TMR - Compensating Faults♦Conservative assumption - every failure of the
voter will lead to an erroneous output and any failure of two modules is fatal
♦Counter Example - one module produces a permanent logical 1 and a second module has a permanent logical 0 - the TMR will function properly regarding this bit
♦A similar situation may arise regarding certain faults within the voter circuit
♦Another example - non-overlapping faults - one module has a faulty adder and another module has a faulty multiplier
♦If the circuits are disjoint, they are unlikely to generate wrong outputs simultaneously
Page 9
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .17
Voters
♦A voter receives inputs X1, X2,...,XN from an M-of-N cluster and generates a representative output
♦Simplest voter - bit-by-bit comparison of the outputs producing the majority vote
♦This only works when all functional processors generate outputs that match bit by bit∗Processors must be identical and use the same software
♦Otherwise - two correct outputs can diverge slightly, in the lower significant bits
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .18
Plurality Voting
♦We declare two outputs X and Y as practically identical if |x-y| < δ for some specified δ
♦A k-plurality voter looks for a set of at least k practically identical outputs, and picks any of them (or their median) as the representative
♦Example - δ = 0.1, five outputs ♦1.10, 1.11, 1.32, 1.49, 3.00♦The subset {1.10, 1.11} would be selected by
a 2-plurality voter
Page 10
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .19
Duplex Systems
♦Both processors execute the same task∗ If outputs are in agreement - result is assumed to be correct
∗If results are different - we can not identify the failed processor
∗a higher-level software has to decide how failure is to be handled
♦This can be done using one of several methods
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .20
First Method - Acceptance Tests
♦Acceptance Test - a range check of each processor's output
♦Example - the pressure in a boiler must be in some known range
♦We use semantic information of the task to predict which values of output indicate an error
♦How should the acceptance range be picked?
Page 11
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .21
Acceptance Test - Sensitivity Vs. Specificity
♦Narrow acceptance range: high probability of identifying an incorrect output, but also a high probability that a correct output will be misidentified as erroneous (false positive)
♦Wide acceptance range: low probability of both♦Sensitivity - the probability that the test will
recognize an erroneous output as such♦Specificity - the probability that the test will
identify a correct output as such♦Narrow range - high sensitivity but low specificity♦Wide range - low sensitivity but high specificity
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .22
Second Method - Testing
♦Both processors are subjected to some test ♦The processor which fails the test is identified
as faulty ♦Real-life tests are never perfect♦Test Coverage - same as test sensitivity - the
probability that the test can identify a faulty processor as such
♦Test Transparency - the complement of the test coverage - the probability that the test passes a faulty processor as good
Page 12
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .23
Third Method - Forward Recovery
♦Use a third processor to repeat the computation carried out by the duplex
♦If only one of the three processors is faulty, then the one that disagrees is the faulty one
♦It is possible to use a combination of these methods
♦Acceptance test - quickest to run but often the least sensitive
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .24
Duplex Reliability
♦Two active identical processors with reliability R(t)each
♦Lifetime of duplex - the time until both processors fail
♦C - Coverage Factor - the probability that a faulty processor will be correctly diagnosed, identified and disconnected
♦Rduplex(t) - the reliability of the duplex system: ♦
Rduplex(t) = Rcomp(t) [ R² (t)+2C R(t)(1-R(t) ]Rcomp(t) - reliability of comparator
Page 13
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .25
Duplex - Constant Failure Rates
♦Each processor has a constant failure rate λ♦Ideal comparator - Rcomp(t)=1
♦Duplex reliability -♦ -2λt - λt - λt
Rduplex(t) = e + 2Ce (1-e )
♦MTTFduplex = 1/(2λ) + C/λ
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .26
Duplex with Redundancy
♦Duplex with two active identical processors and an unlimited number of spares
♦When a processor fails, failure is detected with probability Pd and a new processor is replacing the one that failed
♦Probability that this process will result in failure of the entire duplex - 1-Ps
♦Induction process is instantaneous, spares are always functional
Page 14
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .27
Duplex with Redundancy - Model
♦Each processor has a constant failure rate λ♦Lifetime of a processor has an Exponential
distribution with parameter λ♦Time between two consecutive failures of the
same logical processor is Exponentially distributed with a parameter λ
♦M(t) - number of failures in one logical processor during the time interval [0,t]
♦N(t) - number of failures in the duplex system during the time interval [0,t]
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .28
Duplex with Redundancy -Distribution of M(t)
♦∆t - a small interval of time so that the probability of more than one failure occurring in ∆t is negligible
♦M(t+ ∆t )=n if either M(t)=n-1 and a fault occurred during ∆t, or M(t)=n and no fault occurred during ∆t
♦ Prob[M(t+ ∆t)=n]=Prob[M(t)=n-1] λ ∆t+Prob[M(t)=n](1- λ ∆t)♦This results in the differential equation♦ d Prob[M(t)=n]/dt=- λ Prob[M(t)=n] + λ Prob[M(t)=n-1]♦ Prob[M(0)=n]=0 for n≥1 and Prob[M(0)=0]=1♦The solution is - -λt n
Prob[M(t)=n]=e (λt ) /n! for n=0,1,2,...♦M(t) has a Poisson distribution with the parameter λt
Page 15
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .29
Duplex with redundancy - Reliability Calculation
♦Duplex has two processors - failure rate of system is 2λ
♦Comparator failure rate - negligible♦Probability of n failures in duplex in [0,t] -
-2λt nProb[N(t)=n]=e (2λt ) /n! for n=0,1,2,...
♦For the duplex not to fail, each of these failures must be detected and successfully replaced -
nprobability C=Pd Ps ; for n failures - C
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .30
Duplex - Reliability
n Rduplex(t) = Σ Prob[n failures] C
= Σ exp(-2 λ t) (2 λ t ) C / n!
= exp(-2 λ t) Σ (2 λ t C) / n!
= exp(-2 λ t) exp(2 λ t C)
Rduplex(t) = exp (-2 λ (1-C) t )
n=0
∞
n
n∞
nn=0
n=0
∞
Page 16
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .31
Duplex Reliability - Alternative Derivation
♦Individual processors fail at rate λ♦Rate of failures in the duplex is 2λ♦Probability C of each failure to be successfully
dealt with, and 1-C to cause duplex failure ♦Failures that crash the duplex occur with rate
2λ(1-C)
- 2λ(1-C)tThe reliability of the system is e
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .32
More Complex Systems♦NMR systems in which failing processors are
identified and replaced from an infinite pool of spares - similar calculation
♦Finite set of spares - the summation in the reliability derivation is capped at that number of spares, rather than going to infinity
♦Other variations of duplex systems -∗ One processor is active while the second is a standby spare ∗ Processors can be repaired when they become faulty
♦Combinatorial arguments may be insufficient for reliability calculation in more complex systems
♦If failure rates are constant, we can use MarkovModels for reliability calculations
Page 17
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .33
Markov Chains - Introduction♦Markov Models provide a structured approach for the
derivation of the reliability of complex systems ♦A Markov Chain is a stochastic process X(t) - an
infinite sequence of random variables indexed by time t , with a special probabilistic structure
♦For a stochastic process to be a Markov Chain, its future behavior must depend only on its present state, and not on any past state
♦X(t+s) depends on X(t), but given X(t), X(t+s) does not depend on any X(τ) for τ < t
♦If X(t)=i - the chain is in state i at time t♦We deal only with Markov Chains with continuous time
(0≤t≤ ∞ ) and discrete state (X(t)=0,1,2,…)
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .34
Markov Chain - Probabilistic Interpretation
♦P(X(t+s)=j/X(t)=i,X(τ)=k)=P(X(t+s)=j/X(t)=i) (τ<t)♦Once the chain moves into state i, it stays there
for a length of time which is Exponentially distributed with parameter λ i - it has a constant rate λ i of leaving state i
♦The probability that when leaving state i the chain will move to state j (with j≠i) - Pij
♦Transition rate from state i to state j - λij=Pijλ i
Σ Pij=1 Σ λij= λ ij≠ij≠i
Page 18
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .35
State Probabilities ♦Pi (t) - probability that the process is in state i at
time t, given it started at state i0 at time 0♦Differential equations for Pi(t), (i=0,1,2,...) -♦For given time instant t, state i and a very small
interval of time ∆t, the chain can be in state i at time t+∆t in one of the following cases: ∗It was in state i at time t and has not moved during the interval ∆t - probability Pi(t)(1- λi∆t)
∗It was at some state j at time t (j≠i) and moved from j to i during the interval ∆t - probability Pj(t) λji ∆t
∗No more than one transition if ∆t is small enough
♦Pi(t+∆t)=Pi(t)(1-λi ∆t)+ Σ Pj(t) λ ji ∆tj≠i
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .36
Differential Equations for Pi(t)
♦dPi(t)/dt = - λi Pi(t)+ Σ λji Pj(t)
♦Since λ i = Σ λ ij
♦dPi(t)/dt = - Σ λijPi (t) + Σ λji Pj (t)
♦This (for i=0,1,2,...) can now be solved, using the initial conditions
♦Pi0(0)=1 and Pi(0)=0 for i ≠ i0
j≠i
j≠i
j≠i
j≠i
Page 19
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .37
Duplex with a Standby
♦Example: One active processor and a one standby spare -connected when the active unit fails
♦Constant failure rate λ of an active processor ♦C- coverage factor - probability that a failure of
the active processor is correctly detected and the spare processor is successfully connected
♦The Markov chain -
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .38
Differential Equations for Duplex with Standby
♦dP2(t)/dt = - λ P2(t) ♦dP1(t)/dt = λ C P2(t) - λ P1(t)♦dP0(t)/dt = λ (1-C) P2(t) + λ P1(t)♦Initial conditions:♦P2(0)=1, P1(0)=P0(0)=0
dPi(t)/dt = - Pi (t) Σ λij + Σ λji Pj (t)j≠i j≠i
Page 20
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .39
Reliability of Duplex with Standby
♦Solution of differential equations:♦P2(t)=exp(- λ t)♦P1(t)=C λ t exp(- λ t)♦P0(t)=1-P2(t)-P1(t)
♦Rsystem=1-P0(t)= P2(t)+P1(t) = exp(- λt)+Cλt exp(- λt)
♦Exercise - derive this expression using combinatorial arguments
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .40
Duplex with Repair♦Two active processors: each with failure rate λ
and repair rate µ♦The Markov model
♦The differential equations -♦dP2(t)/dt=-2λ P2(t)+µ P1(t)♦dP1(t)/dt= 2λ P2(t)+2µ P0(t)-(λ+µ)P1(t) ♦dP0(t)/dt= λ P1(t) -2µ P0(t)♦Initial conditions -♦P2(0)=1, P1(0)=P0(0)=0
dPi(t)/dt = - Pi (t) Σ λij + Σ λji Pj (t)j≠i j≠i
Page 21
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .41
Duplex with Repair - State Probabilities
♦The solution to the differential equations -
♦P2(t)=µ /(λ+µ) +2λµ/(λ+µ) exp[-(λ+µ)t]
+λ /(λ+µ) exp[-2(λ+µ)t]
♦ P1(t)=2λµ/(λ+µ) +2λ(λ-µ)/(λ+µ) exp[-(λ+µ)t]
-2λ /(λ+µ) exp[-2(λ+µ)t]
♦P0(t)=1-P2(t)-P1(t)
22
22 2
22
22
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .42
Different Measures♦In systems without repair, mainly the reliability
measure is of significance♦With repair all three - reliability, availability and
point availability - are meaningful♦ Point Availability - Ap(t)
= Prob{The system is operational at time t}=1-P0(t)♦Reliability - R(t)=Prob {The system is operational
during [0,t] } - can be calculated by removing the transition from state 0 to state 1, solving the resulting new differential equations - R(t)=1-P0(t)
♦Availability - A(t) - average proportion of time during the time interval [0,t] that the system is operational -most relevant in systems with repair
Page 22
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .43
Steady State Availability
♦We calculate the steady state availability - A(∞) (or A) - the proportion of time in the long run that the system is operational
♦We first calculate the steady-state probabilities -P2(∞), P1(∞), and P0(∞) (or P2,P1,P0)
♦These steady-state probabilities can be calculated in one of the two methods:∗ letting t approach ∞ in Pi(t)∗ setting dPi(t)/dt=0 (i=0,1,2) and solving the linear equations for Pi, using the relationship P2+P1+P0=1
♦ A=1-P0
Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .44
Duplex with Repair - Steady State♦Steady state probabilities -♦P2= µ /(λ+µ)
♦P1= 2λµ/(λ+µ)
♦P0= λ /(λ+µ)
♦Steady state availability -♦A=A(∞) = P2 + P1 = 1 - P0
= (µ +2λµ )/(λ+µ) = 1 - λ /(λ+µ)
2
2 2
22
2
2 2
2