university of massachusetts dept. of …krishna/655/fall06/part2...during dt, or m(t)=n and no fault...

Copyright 2004 Koren & Krishna ECE655/Krishna Part.2 .1

C. M. KrishnaFall 2006

UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer Engineering

Fault Tolerant ComputingECE 655

Part 2Canonical Structures


Canonical Structures

♦Larger structures can be constructed out of the individual components

♦Complex structures can be constructed out of some basic structures

♦We will assume statistical independencebetween failures in the individual components

♦The basic structures are ∗A Series System∗A Parallel System∗An M out of N System


A Series System

♦A Series System - a set of components connected so that the failure of any one component causes the entire system to fail


Reliability of a Series System

♦Reliability of a series system - Rs(t) -the product of the reliabilities of its N modules

♦Ri(t) is the reliability of component i

N Rs(t) = Π Ri (t)

i=1


Series System - Constant Failure Rates

♦If every module i has a constant failure rate λ i

♦ - λ i t Ri(t) = e

♦ - λs t - Σλ i tRs(t) = e = e

♦ λs =Σλi is the constant failure rate of the series system

♦ Mean Time To Failure of a series system -

MTTFs = 1/λs = 1/ Σλi


A Parallel System

♦A Parallel System - a set of modules connected so that all the modules must fail before the system fails


Reliability of a Parallel System

♦Rp(t) - reliability of a parallel system

♦ N 1 - Rp(t) = Π (1-Ri(t))

i=1

♦ NRp(t) = 1 - Π (1-Ri(t))

i=1


Parallel System - Constant Failure Rates

♦Module i has a constant failure rate, λ i

♦ -λ i t N - λ itRi(t)=e ; Rp(t) = 1 - Π (1-e )

i=1♦Example - a parallel system with two modules

- λ1 t - λ2 t -(λ1+ λ2)tRp(t) = e +e - e

♦MTTF of a parallel system with the same λN

MTTFp=Σ 1/(i λ ) i=1


M out of N Systems

♦An M-of-N system is one which consists of Nidentical components, with failure occurring if fewer than M components are still functional

♦Best-known example - The Triplex (TMR)♦Three identical components whose outputs are

voted on. This is a 2-of-3 system: as long as a majority of the processors produce correct results, the system will be functional


Reliability of M out of N Systems

♦N identical components♦R(t) - reliability of an individual component♦The reliability of the system is the probability

that N-M or fewer components have failed♦

N-M i N-iRm_of_n(t) = Σ C(N,i) (1-R(t) ) R(t)

i=0

♦ where C(N,i) = N ! / [i ! (N-i) !]


Correlated Failures in M of N Systems

♦Statistical independence of failures in components - key to the high reliability

♦Correlated failure can greatly diminish reliability♦Example: Pcor - probability that the entire

system suffers a global failure

♦ N-M i N-iRm_of_n_cor (t) = (1-Pcor) Σ C(N,i) (1-R(t) ) R(t)

i=0


M out of N Systems - Modes of Correlation

♦If system is not designed carefully, the correlated failure factor can dominate the overall failure probability

♦Different modes of correlation among components exist - not necessarily a global failure

♦Correlated failure rates are extremely difficult to estimate

♦From now on we will deal with statisticallyindependent failures in components


TMR - Triple Modular Redundant Cluster

♦TMR - perhaps the most important M-of-N system ♦M=2, N=3 - system good if at least two

components are operational ♦A voter picks the majority output

♦Voter can fail - reliability Rvot(t)♦ 1 i 3-i

Rtmr(t) = Rvot(t) Σ C(3,i) (1-R(t) ) R(t) i=0

= Rvot(t) ( 3R² (t) - 2R³ (t) )


TMR - Constant Failure Rates

♦ - λ tR(t)=e

♦No voter failures - Rvot(t)=1

♦ -2λt -3λt Rtmr(t)=3e - 2e

♦MTTFtmr = Rtmr(t) dt=5/(6 λ) <1/ λ = MTTFsimplex0

∞


NMR - N-Modular Redundant Cluster

♦M-of-N cluster with N odd and M = (N+1)/2♦Voter failure rate negligible - Rvot(t)=1

♦Below R=0.5 - redundancy becomes a disadvantage♦Usually R >> 0.5 - triplex offers significant

reliability gains


TMR - Compensating Faults♦Conservative assumption - every failure of the

voter will lead to an erroneous output and any failure of two modules is fatal

♦Counter Example - one module produces a permanent logical 1 and a second module has a permanent logical 0 - the TMR will function properly regarding this bit

♦A similar situation may arise regarding certain faults within the voter circuit

♦Another example - non-overlapping faults - one module has a faulty adder and another module has a faulty multiplier

♦If the circuits are disjoint, they are unlikely to generate wrong outputs simultaneously


Voters

♦A voter receives inputs X1, X2,...,XN from an M-of-N cluster and generates a representative output

♦Simplest voter - bit-by-bit comparison of the outputs producing the majority vote

♦This only works when all functional processors generate outputs that match bit by bit∗Processors must be identical and use the same software

♦Otherwise - two correct outputs can diverge slightly, in the lower significant bits


Plurality Voting

♦We declare two outputs X and Y as practically identical if |x-y| < δ for some specified δ

♦A k-plurality voter looks for a set of at least k practically identical outputs, and picks any of them (or their median) as the representative

♦Example - δ = 0.1, five outputs ♦1.10, 1.11, 1.32, 1.49, 3.00♦The subset {1.10, 1.11} would be selected by

a 2-plurality voter


Duplex Systems

♦Both processors execute the same task∗ If outputs are in agreement - result is assumed to be correct

∗If results are different - we can not identify the failed processor

∗a higher-level software has to decide how failure is to be handled

♦This can be done using one of several methods


First Method - Acceptance Tests

♦Acceptance Test - a range check of each processor's output

♦Example - the pressure in a boiler must be in some known range

♦We use semantic information of the task to predict which values of output indicate an error

♦How should the acceptance range be picked?


Acceptance Test - Sensitivity Vs. Specificity

♦Narrow acceptance range: high probability of identifying an incorrect output, but also a high probability that a correct output will be misidentified as erroneous (false positive)

♦Wide acceptance range: low probability of both♦Sensitivity - the probability that the test will

recognize an erroneous output as such♦Specificity - the probability that the test will

identify a correct output as such♦Narrow range - high sensitivity but low specificity♦Wide range - low sensitivity but high specificity


Second Method - Testing

♦Both processors are subjected to some test ♦The processor which fails the test is identified

as faulty ♦Real-life tests are never perfect♦Test Coverage - same as test sensitivity - the

probability that the test can identify a faulty processor as such

♦Test Transparency - the complement of the test coverage - the probability that the test passes a faulty processor as good


Third Method - Forward Recovery

♦Use a third processor to repeat the computation carried out by the duplex

♦If only one of the three processors is faulty, then the one that disagrees is the faulty one

♦It is possible to use a combination of these methods

♦Acceptance test - quickest to run but often the least sensitive


Duplex Reliability

♦Two active identical processors with reliability R(t)each

♦Lifetime of duplex - the time until both processors fail

♦C - Coverage Factor - the probability that a faulty processor will be correctly diagnosed, identified and disconnected

♦Rduplex(t) - the reliability of the duplex system: ♦

Rduplex(t) = Rcomp(t) [ R² (t)+2C R(t)(1-R(t) ]Rcomp(t) - reliability of comparator


Duplex - Constant Failure Rates

♦Each processor has a constant failure rate λ♦Ideal comparator - Rcomp(t)=1

♦Duplex reliability -♦ -2λt - λt - λt

Rduplex(t) = e + 2Ce (1-e )

♦MTTFduplex = 1/(2λ) + C/λ


Duplex with Redundancy

♦Duplex with two active identical processors and an unlimited number of spares

♦When a processor fails, failure is detected with probability Pd and a new processor is replacing the one that failed

♦Probability that this process will result in failure of the entire duplex - 1-Ps

♦Induction process is instantaneous, spares are always functional


Duplex with Redundancy - Model

♦Each processor has a constant failure rate λ♦Lifetime of a processor has an Exponential

distribution with parameter λ♦Time between two consecutive failures of the

same logical processor is Exponentially distributed with a parameter λ

♦M(t) - number of failures in one logical processor during the time interval [0,t]

♦N(t) - number of failures in the duplex system during the time interval [0,t]


Duplex with Redundancy -Distribution of M(t)

♦∆t - a small interval of time so that the probability of more than one failure occurring in ∆t is negligible

♦M(t+ ∆t )=n if either M(t)=n-1 and a fault occurred during ∆t, or M(t)=n and no fault occurred during ∆t

♦ Prob[M(t+ ∆t)=n]=Prob[M(t)=n-1] λ ∆t+Prob[M(t)=n](1- λ ∆t)♦This results in the differential equation♦ d Prob[M(t)=n]/dt=- λ Prob[M(t)=n] + λ Prob[M(t)=n-1]♦ Prob[M(0)=n]=0 for n≥1 and Prob[M(0)=0]=1♦The solution is - -λt n

Prob[M(t)=n]=e (λt ) /n! for n=0,1,2,...♦M(t) has a Poisson distribution with the parameter λt


Duplex with redundancy - Reliability Calculation

♦Duplex has two processors - failure rate of system is 2λ

♦Comparator failure rate - negligible♦Probability of n failures in duplex in [0,t] -

-2λt nProb[N(t)=n]=e (2λt ) /n! for n=0,1,2,...

♦For the duplex not to fail, each of these failures must be detected and successfully replaced -

nprobability C=Pd Ps ; for n failures - C


Duplex - Reliability

n Rduplex(t) = Σ Prob[n failures] C

= Σ exp(-2 λ t) (2 λ t ) C / n!

= exp(-2 λ t) Σ (2 λ t C) / n!

= exp(-2 λ t) exp(2 λ t C)

Rduplex(t) = exp (-2 λ (1-C) t )

n=0

∞

n

n∞

nn=0

n=0

∞


Duplex Reliability - Alternative Derivation

♦Individual processors fail at rate λ♦Rate of failures in the duplex is 2λ♦Probability C of each failure to be successfully

dealt with, and 1-C to cause duplex failure ♦Failures that crash the duplex occur with rate

2λ(1-C)

- 2λ(1-C)tThe reliability of the system is e


More Complex Systems♦NMR systems in which failing processors are

identified and replaced from an infinite pool of spares - similar calculation

♦Finite set of spares - the summation in the reliability derivation is capped at that number of spares, rather than going to infinity

♦Other variations of duplex systems -∗ One processor is active while the second is a standby spare ∗ Processors can be repaired when they become faulty

♦Combinatorial arguments may be insufficient for reliability calculation in more complex systems

♦If failure rates are constant, we can use MarkovModels for reliability calculations


Markov Chains - Introduction♦Markov Models provide a structured approach for the

derivation of the reliability of complex systems ♦A Markov Chain is a stochastic process X(t) - an

infinite sequence of random variables indexed by time t , with a special probabilistic structure

♦For a stochastic process to be a Markov Chain, its future behavior must depend only on its present state, and not on any past state

♦X(t+s) depends on X(t), but given X(t), X(t+s) does not depend on any X(τ) for τ < t

♦If X(t)=i - the chain is in state i at time t♦We deal only with Markov Chains with continuous time

(0≤t≤ ∞ ) and discrete state (X(t)=0,1,2,…)


Markov Chain - Probabilistic Interpretation

♦P(X(t+s)=j/X(t)=i,X(τ)=k)=P(X(t+s)=j/X(t)=i) (τ<t)♦Once the chain moves into state i, it stays there

for a length of time which is Exponentially distributed with parameter λ i - it has a constant rate λ i of leaving state i

♦The probability that when leaving state i the chain will move to state j (with j≠i) - Pij

♦Transition rate from state i to state j - λij=Pijλ i

Σ Pij=1 Σ λij= λ ij≠ij≠i


State Probabilities ♦Pi (t) - probability that the process is in state i at

time t, given it started at state i0 at time 0♦Differential equations for Pi(t), (i=0,1,2,...) -♦For given time instant t, state i and a very small

interval of time ∆t, the chain can be in state i at time t+∆t in one of the following cases: ∗It was in state i at time t and has not moved during the interval ∆t - probability Pi(t)(1- λi∆t)

∗It was at some state j at time t (j≠i) and moved from j to i during the interval ∆t - probability Pj(t) λji ∆t

∗No more than one transition if ∆t is small enough

♦Pi(t+∆t)=Pi(t)(1-λi ∆t)+ Σ Pj(t) λ ji ∆tj≠i


Differential Equations for Pi(t)

♦dPi(t)/dt = - λi Pi(t)+ Σ λji Pj(t)

♦Since λ i = Σ λ ij

♦dPi(t)/dt = - Σ λijPi (t) + Σ λji Pj (t)

♦This (for i=0,1,2,...) can now be solved, using the initial conditions

♦Pi0(0)=1 and Pi(0)=0 for i ≠ i0

j≠i

j≠i

j≠i

j≠i


Duplex with a Standby

♦Example: One active processor and a one standby spare -connected when the active unit fails

♦Constant failure rate λ of an active processor ♦C- coverage factor - probability that a failure of

the active processor is correctly detected and the spare processor is successfully connected

♦The Markov chain -


Differential Equations for Duplex with Standby

♦dP2(t)/dt = - λ P2(t) ♦dP1(t)/dt = λ C P2(t) - λ P1(t)♦dP0(t)/dt = λ (1-C) P2(t) + λ P1(t)♦Initial conditions:♦P2(0)=1, P1(0)=P0(0)=0

dPi(t)/dt = - Pi (t) Σ λij + Σ λji Pj (t)j≠i j≠i


Reliability of Duplex with Standby

♦Solution of differential equations:♦P2(t)=exp(- λ t)♦P1(t)=C λ t exp(- λ t)♦P0(t)=1-P2(t)-P1(t)

♦Rsystem=1-P0(t)= P2(t)+P1(t) = exp(- λt)+Cλt exp(- λt)

♦Exercise - derive this expression using combinatorial arguments


Duplex with Repair♦Two active processors: each with failure rate λ

and repair rate µ♦The Markov model

♦The differential equations -♦dP2(t)/dt=-2λ P2(t)+µ P1(t)♦dP1(t)/dt= 2λ P2(t)+2µ P0(t)-(λ+µ)P1(t) ♦dP0(t)/dt= λ P1(t) -2µ P0(t)♦Initial conditions -♦P2(0)=1, P1(0)=P0(0)=0

dPi(t)/dt = - Pi (t) Σ λij + Σ λji Pj (t)j≠i j≠i


Duplex with Repair - State Probabilities

♦The solution to the differential equations -

♦P2(t)=µ /(λ+µ) +2λµ/(λ+µ) exp[-(λ+µ)t]

+λ /(λ+µ) exp[-2(λ+µ)t]

♦ P1(t)=2λµ/(λ+µ) +2λ(λ-µ)/(λ+µ) exp[-(λ+µ)t]

-2λ /(λ+µ) exp[-2(λ+µ)t]

♦P0(t)=1-P2(t)-P1(t)

22

22 2

22

22


Different Measures♦In systems without repair, mainly the reliability

measure is of significance♦With repair all three - reliability, availability and

point availability - are meaningful♦ Point Availability - Ap(t)

= Prob{The system is operational at time t}=1-P0(t)♦Reliability - R(t)=Prob {The system is operational

during [0,t] } - can be calculated by removing the transition from state 0 to state 1, solving the resulting new differential equations - R(t)=1-P0(t)

♦Availability - A(t) - average proportion of time during the time interval [0,t] that the system is operational -most relevant in systems with repair


Steady State Availability

♦We calculate the steady state availability - A(∞) (or A) - the proportion of time in the long run that the system is operational

♦We first calculate the steady-state probabilities -P2(∞), P1(∞), and P0(∞) (or P2,P1,P0)

♦These steady-state probabilities can be calculated in one of the two methods:∗ letting t approach ∞ in Pi(t)∗ setting dPi(t)/dt=0 (i=0,1,2) and solving the linear equations for Pi, using the relationship P2+P1+P0=1

♦ A=1-P0


Duplex with Repair - Steady State♦Steady state probabilities -♦P2= µ /(λ+µ)

♦P1= 2λµ/(λ+µ)

♦P0= λ /(λ+µ)

♦Steady state availability -♦A=A(∞) = P2 + P1 = 1 - P0

= (µ +2λµ )/(λ+µ) = 1 - λ /(λ+µ)

2

2 2

22

2

2 2

2

university of massachusetts dept. of …krishna/655/fall06/part2...during dt, or m(t)=n and no fault...

Documents