dynamic decision elements for 3-unit systems

4
IEEE TRANSACTIONS ON RELIABILITY, VOL. R-26, NO. 5, DECEMBER 1977 335 Dynamic Decision Elements for 3-Unit Systems Oguz Tosun, Member IEEE This paper proposes a variety of dynamic decision tech- niques to be used with triplication. Their estimated reliabili- ties are proven (under certain conditions) to be better than Key Words-3-Unit system, Stuck-at-fault, Dynamic decision ele- that of TMR and TMR/Simplex. Pr {unit will survive in time period T} and Pr {Unit is in failure mode mi unit has failed } Readers Aids- are used as the criteria to design the decision elements. The Purpose: Widen state of the art second criterion was used in the past to calculate the estimated Special math needed for explanations and results: Probability reliability of TMR [5] , but not as a criterion to design deci- Results useful to: Reliability theoreticians, designers of fault-tolerant sion elements. computers There are two different types of hardware failures [1]: Abstract-Simple triplication is widely analyzed either as a static re- a) Permanent failures dundancy by itself or as an active core for dynamic redundancy. Its b) Temporary or intermittent failures. practical value was tested when it was used to protect the hardcore test Decision elements proposed in this paper are designed con- and repair processor (TARP) of JPL STAR computer and the CPU of Guidance Computer for SATURN V. Two current techniques for using sidering permanent failures. They are also capable of tolerat- triplicated modules are TMR and TMR/Simplex. In this paper a set of ing temporary failures; however, once the-decision element dynamic decision techniques is proposed for better use of triplication, adapts itself following a failure, it does not return to its origi- assuming Pr {system is in a specified failure mode system has failed } is nal configuration even though the temporary failure is re- known for all failure modes of concern. moved. A retry mechanism can be added to the system to identify temporary failures [7] . Permanent failure modes are further divided into three categories: a) Stuck-at-0 (s-a-0): Failed unit produces a logical 0 out- INTRODUCTION put for all input combinations, i.e., the output is zero regard- less of input. A purely fault-intolerant computer is known as a nonredun- b) Stuck-at-I (s-a-1): Failed unit produces logical 1 output dant computer [1] To improve the reliability of a nonredun- for all input combinations. dant computer, redundancy is used in several forms (hardware, c) Stuck-at-x (s-a-x): Failed unit is in a mode other than program, or time redundancy). Redundancy is the set of a) or b). extras (hardware, programs, or time) which would not be Example: Failure modes for a logical AND gate are shown necessary for a perfect computer to execute the same algo- in the following table. rithms. A perfect computer is failure free and has infinite speed. Static (masking) redundancy is a type of hardware re- inputs 16 possible outputs dundancy in which replicas of nonredundant modules are per- manently connected and powered to mask the effect of hard- 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 i 1 0 1 0 0 00 1 1 1 1 0 0 0 0 1 1 1 1 ware failures within the modules. 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 In this paper 3-replica static hardware redundancy (triplica- 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 tion) is considered. The most commonly used decision rule for triplication is the Majority Decision Rule. A system which s-a-0 correct s-a-x s-a-1 uses triplication with a majority decision element is known as a TMR system. The system output is the vote of majority of the replicas. It is capable of masking single failures, but the NOTATION system is degraded with the first failure and fails with the sec- ond failure, despite the remaining one good unit. A detailed X failure rate (constant) of a nonredundant unit analysis of TMR systems, both with perfect and imperfect de- T mission time, (T > 0) cision elements, can be found in [2-4] . Another technique Ru exp(- XT) used with triplication is the TMR/Simplex method. In TMR/ aa Pr {stuck-at-ci failure unit has failed} Simplex, system begins operation as a regular TMR, but with Pi Pr {nonredundant unit output = logical i unit is the detection of first failure, it automatically switches off the good }, i = 0,1 failed unit and a good unit so that remaining good unit oper- Xi1 Pr {nonredundant unit output is logical j where its ates as a simplex [2].X A decision element which changes its specified value is i unit has failed into s-a-x mode }, character in accordance with the occurrence of failures in the i, Ic (0,1) modules is called a dynamic decision element. s-a-i hardware failure into permanent mode i, i = 0,1 ,x.

Upload: oguz

Post on 06-Nov-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON RELIABILITY, VOL. R-26, NO. 5, DECEMBER 1977 335

Dynamic Decision Elements for 3-Unit Systems

Oguz Tosun, Member IEEE This paper proposes a variety of dynamic decision tech-niques to be used with triplication. Their estimated reliabili-ties are proven (under certain conditions) to be better than

Key Words-3-Unit system, Stuck-at-fault, Dynamic decision ele- that of TMR and TMR/Simplex. Pr {unit will survive in time

period T} and Pr {Unit is in failure mode mi unit has failed }Readers Aids- are used as the criteria to design the decision elements. The

Purpose: Widen state of the art second criterion was used in the past to calculate the estimatedSpecial math needed for explanations and results: Probability reliability of TMR [5] , but not as a criterion to design deci-Results useful to: Reliability theoreticians, designers of fault-tolerant sion elements.computers

There are two different types of hardware failures [1]:Abstract-Simple triplication is widely analyzed either as a static re- a) Permanent failures

dundancy by itself or as an active core for dynamic redundancy. Its b) Temporary or intermittent failures.practical value was tested when it was used to protect the hardcore test Decision elements proposed in this paper are designed con-and repair processor (TARP) of JPL STAR computer and the CPU ofGuidance Computer for SATURN V. Two current techniques for using sidering permanent failures. They are also capable of tolerat-triplicated modules are TMR and TMR/Simplex. In this paper a set of ing temporary failures; however, once the-decision elementdynamic decision techniques is proposed for better use of triplication, adapts itself following a failure, it does not return to its origi-assuming Pr {system is in a specified failure mode system has failed } is nal configuration even though the temporary failure is re-known for all failure modes of concern. moved. A retry mechanism can be added to the system to

identify temporary failures [7] . Permanent failure modes arefurther divided into three categories:

a) Stuck-at-0 (s-a-0): Failed unit produces a logical 0 out-

INTRODUCTION put for all input combinations, i.e., the output is zero regard-less of input.

A purely fault-intolerant computer is known as a nonredun- b) Stuck-at-I (s-a-1): Failed unit produces logical 1 outputdant computer [1] To improve the reliability of a nonredun- for all input combinations.dant computer, redundancy is used in several forms (hardware, c) Stuck-at-x (s-a-x): Failed unit is in a mode other thanprogram, or time redundancy). Redundancy is the set of a) or b).extras (hardware, programs, or time) which would not be Example: Failure modes for a logical AND gate are shownnecessary for a perfect computer to execute the same algo- in the following table.rithms. A perfect computer is failure free and has infinitespeed. Static (masking) redundancy is a type of hardware re- inputs 16 possible outputsdundancy in which replicas of nonredundant modules are per-manently connected and powered to mask the effect of hard- 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 i 1

0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1ware failures within the modules. 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

In this paper 3-replica static hardware redundancy (triplica- 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1tion) is considered. The most commonly used decision rulefor triplication is the Majority Decision Rule. A system which s-a-0 correct s-a-x s-a-1uses triplication with a majority decision element is known asa TMR system. The system output is the vote of majority ofthe replicas. It is capable of masking single failures, but the NOTATIONsystem is degraded with the first failure and fails with the sec-ond failure, despite the remaining one good unit. A detailed X failure rate (constant) of a nonredundant unitanalysis of TMR systems, both with perfect and imperfect de- T mission time, (T > 0)cision elements, can be found in [2-4] . Another technique Ru exp(- XT)used with triplication is the TMR/Simplex method. In TMR/ aa Pr {stuck-at-ci failure unit has failed}Simplex, system begins operation as a regular TMR, but with Pi Pr {nonredundant unit output = logical i unit isthe detection of first failure, it automatically switches off the good }, i = 0,1failed unit and a good unit so that remaining good unit oper- Xi1 Pr {nonredundant unit output is logical j where itsates as a simplex [2].X A decision element which changes its specified value is i unit has failed into s-a-x mode },character in accordance with the occurrence of failures in the i, Ic (0,1)modules is called a dynamic decision element. s-a-i hardware failure into permanent mode i, i = 0,1 ,x.

336 TOSUN: DYNAMIC DECISION ELEMENTS FOR 3-UNIT SYSTEMS

2. PROPOSED DYNAMIC DECISION TECHNIQUES Following the detection of first failure in 3-unit system, modi-AND DERIVATION OF ASSOCIATED fication of decision element to simulate d2 for remaining

RELIABILITY FORMULAS 2-unit system is equivalent to the TMR/Simplex technique,i.e., when first failure is detected, a good unit together with

Assumptions: the failed unit is discarded and system is reduced to Simplex.a) Failure of redundant copies are s-independent. The use The reliability is the sum of the probabilities for no failure,

of triplication and other static redundancy schemes is based on single failure, and a single failure followed by another failurethis assumption [1]. Therefore the models described in this on discarded good unit. The integral equation for Rd2 is thepaper can be used to design systems with IC chips or discrete same as (1), except that 2ai -* 1.components. However, their use is difficult to justify withinIC packages in which a failure most likely affects the other Rd2 =R3 + 3R2(1 - R,) + 3/2 R,(I - RU)2components; i.e., failures might be s-dependent.

b) Voter-Detector-Switch (V-D-S) trio is perfect. Be- RTMR/Simplex (2)cause V-D-S is independent of nonredundant unit complexity,this assumption is justified for nonredundant units which are The best decision rule isconsiderably more complex circuits than V-D-S. The evolu-tion of component technology into LSI makes the triplicationof large and complex circuits economically feasible. do if aO > 1/2 >a

The majority voting system is shown in Fig. 1. This systemmasks single failures but does not announce the failure. A de- dl, if a, > 1/2 > aOtector can be added to the system to locate the failed unit.Then immediately after detection two actions can be taken, as d2, if ao, a1 < 1/2.shown in sections 2.1 and 2.2.

Furthermore the results can be summarized in one reliabil-2.1 Action 1. ity expression:

After detecting a failure, discard the failed unit and changethe majority element into an appropriate decision element for Rd,opt = R3 + 3R2(1-Rl ) + 3 iiRU(1 -RU)2the remaining 2-unit system. Among (22)2 = 16 decision rulesfor 2-unit systems, the following three rules (do, dl, d2) domi- where , = max{ao, a1, 1/2 }. (3)nate others or are equivalent to them in performance; i.e , theyprovide equal or better reliabilities. Details are given in the Circuits of proposed decision elements are given in the Supple-Supplement [7] . ment [7]. They begin operation as a majority element. With

the detection of first failure they switch automatically to sim-

X, X2 do d, d2 ulate di, i = 0,1. ROMs can also be used to implement decisionelements. See [6] for the use of ROMs in the design of deci-

0 o 0 0 0 sion elements.0 1 1 0 01 0 1 0 1 2.2 Action 2:1 1 1 1 1

After detecting a failure leave the faulty unit in the systemand use it as a reference to detect a second failure. Let unit-j

The columns d0, d1, d2 specify the decision rules and X1, be the first unit detected as failed in 3-unit system. From thisX2 are the outputs of duplicate units. moment on, the output of unit-j sometimes agrees with system

If di (i = 0,1) is used, then the 2-unit system can tolerate a output and sometimes not. (In general a unit is assumed faileds-a-i type failure on either unit. if it gives incorrect outputs for a subset of input combina-

The reliability for the original 3-unit system is the sum of tions). But before the second failure occurs, any disagreeingthe probabilities for three cases: no failure, single failure, and unit will certainly be unit-j. Let unit-k be the second faileda single failure followed by a s-a-i failure. unit. After the second failure, unit-k gives incorrect output

for some input and disagrees with the good unit. At this mo-Rdi = RU + 3RU(1 -RU) ment unit-f agrees with either unit-k or the good unit. In the

first case, unit-f and unit-k determine the system output and

J 3xt the good unit disagrees with it. In the second case, unit-f and+ e1e (3X) e2X(t2~t1)2X1e~X(T~t2) dt2dt1 the good unit determine the system output and unit-k dis-tl _ t2-tl ~~~~~~~~~~agreeswith it. After the detection of a disagreeing unit other

than unit-f, two actions can be taken as described in sections=Ru + 3RU(1 -RU)+3RU(1 -RU)2 a,i =0,1 (1) 2.2.1 and 2.2.2.

IEEE TRANSACTIONS ON RELIABILITY, VOL. R-26, NO. 5, DECEMBER 1977 337

2.2.1 Action 2.1: Discard the second disagreeing unit to- good unit The second failed unit gives incorrect output firstgether with unit-j leaving a simplex system. Unit-j is consid- time}.ered to agree with the good unit and unit-k is in the minority Corresponding system reliability is obtained similar to (5):to disagree with the system- output.

This action is taken if the following inequality holds: 3Pr{First failed unit agrees with good unit second failed unit R2 = + 3RU(1 Ru) + ( -Ru)gives incorrect output first time} > Pr {First failed unit agrees 2 2with second failed unit second failed unit gives incorrect out- [aO + a1 + aoaX(xi1 +Yl0) + a0axoi + Y0i)put first time}. +a2(x10y10 +X01y01)] (7)

The conditional system reliability is:

Pr {System survives Two failures have occurred} 3. EVALUATION OF RESULTS

= [a0al + a1a0 + aoax(xll +yol) + alax(xoo +y1O) Four new dynamic decision techniques are proposed be-

T T sides TMR and TMR/Simplex. The first two are explained in+ a2(x1y10 + y) r [e-3xtl (3X r -2x(t2-ti) section 2.1 and corresponding reliability expressions (Rdi,ax(xlly100y01)] j e 3 e i= 0,1) are derived. The other two methods are explained in

t1-0 t2 ti sections 2.2.1 and 2.2.2 and the reliability expressions for R1,

2XV(T-t2)dI dt hr ~ Px1(00 Px0, R2 are given.2X.e-x( T-t2)dt2] dt, where Yi; =Poxijl(Poxol +Plxl0) In the following, these four techniques are compared with

TMR and TMR/Simplex in performance.i = 0,1; j = 0,1. (4) TMR/Simplex is always better than TMR, as is well known.

(Also shown in the Supplement [7] ). Therefore, TMR is elim-Equation (4) is derived in the Supplement [7]. The system mated from discussion.

reliability is: In section 2.1, conditional failure probabilities a, and aOare used to choose the best decision technique from the two

R1 = R3+ 3R2(l -R) + 3Ru(l RU) new proposed techniques and TMR/Simplex.The proposed techniques are better than TMR/Simplex for

[2a0a, + aoaX (xl1 + yo1) + alax(xoo + Yio) those nonredundant units which have conditional failure prob-abilities aO > 1/2 or a1 > 1/2 (i.e.,RdO > RTMR/Simplex or

+ a2(x11y10 + x00.0)]. (5) Rdl > RTMR/Simplex respectively).Equations (6) and (7) are now compared with (3) to find

A special case of this procedure is now explained. Let no the conditions under which the two techniques explained inunits be discarded and let the decision element preserve its sta- sections 2 2.1 and 2.2.2 are better than the techniques repre-tus as a majority element despite the failures. If the first and sented by Rdi, i = 0,1,2. (Rd2 = RTMR/Sjflnplex).second failures are s-a-0 and s-a-I respectively or in reverseorder, then their outputs compensate each other and the sys- R >2aa+ (x1 +y)0a(1tem output is the good unit output, i.e., the system survives. R1>R,p.= 0a)+aa(o y0Two failed units compensating each other are equivalent to +as(xysy10 + x00y01) > p. (8)two failed units discarded leaving a simplex. This case iS ana-lyzed in [2,5] and the following reliability formula is obtained:

R2 > Rd,OPt. 2aO +al +aoax(xl0 +yl0)RTMR = R3 +3R2(1 -RU)+3RU(1 -RU)2 (2aOa1). (6)

+aiax(xoi +y1)+a2(x10y10 +x01y0) > p. (9)

Eq (6) is a special case of (5) and the corresponding model isknown as "TMR with compensating failures considered." If conditional probabilities of the nonredundant unit satisfy

either (8) or (9), the corresponding proposed technique is pre-2.2.2 Action 2.2: Keep the second disagreeing unit and ferred to TMR/Simplex and the other two techniques.

discard other two units. In doing so, the first failed unit Conclusions: Given a nonredundant unit and the associated(unit-j) is considered to agree with second failed-unit (unit-k) conditional probabilities a1, Pi, Xij (i, j e (,1)), the best deci-and the good unit is in minority to disagree with the system sion technique can be chosen among TMR/Simplex and theoutput. This action is preferred over action 2.1 if the follow- four others described in this paper. First, a0, a1, 1/2 are com-ing inequality holds: Pr{The first failed unit agrees with the pared to choose among those represented by Rdi.- Then, (8)second failed unit The second failed unit gives incorrect out- and (9) are used to compare the chosen one with those de-put first time} > Pr {The first failed unit agrees with the scribed in Sections 2.2.1 and 2.2.2.

338 TOSUN: DYNAMIC DECISION ELEMENTS FOR 3-UNIT SYSTEMS

ACKNOWLEDGMENT [7] Supplement: NAPS document No. 03087-C, 7 pages in this Sup-plement. For current ordering information, see inside rear coverof a current issue. Order NAPS document No. 03087, 28 pages.

Helpful comments of the Editor are gratefully appreciated. ASIS-NAPS; Microfiche Publications; POBox 3513, Grand Cen-tral Station, New York, NY 10017 USA.

REFERENCES

[1] A. Avizienis, "Fault-tolerant systems", IEEE Trans. Computers, Dr. Oguz Tosun; Computer Science Dept; State University of Newvol C-25, 1976 Dec, pp 1304-1312. York; College at Oswego; Oswego, NY 13126 USA.

[2] F.P. Mathur, P.T. DeSousa, "Reliability models of NMR sys-tems", IEEE Trans. Reliability, vol R-24, 1975 Jun, pp 108-1 13.

[3] R.E. Lyons, W. Vanderkulk, "The use of triple-modular redun- Oguz Tosun (S'73, M'75) is Assistant Professor in the Department ofdancy to improve computer reliability", IBM Journal, 1962 Apr, Computer Science at State University of New York (College at Oswego).pp 200-209. He holds an EE degree from Istanbul Technical University and a PhD

[4] D.K. Rubin, "The approximate reliability of triply redundant degree in computer and information sciences from Syracuse University.majority voted systems", Proc. 1st Ann. IEEE Comp. Conf, His research interests include reliability modelling and analysis of com-1967 Sep, pp 46-49. Available from IEEE. putational systems, fault-tolerant computer design, artificial intelligence

[5] D.P. Siewiorek, "Reliability modelling of compensating module and information theory. He is a member of IEEE Technical Committeefailures in majority voted redundancy", Proc. 4th Ann. Int. on Fault-Tolerant Computing.Symp. Fault-Tolerant Computing, 1974, pp 2.14-2.19. Availablefrom IEEE. Manuscript received 1976 Apr 28- revised 1977 Mar 23 1977 May 20.

[6] N.G. Dennis, "Ultra-reliable voter switches, with a bibliography Mof mechanization", Microelectronics and Reliability, vol 13,1974, pp. 299-308.

We apologize. The information on Proceedings printed in the 1977 October issuewas out of date. Here is the correct information.

Annual Reliability & Maintainability SymposiumProceedings Price List

Copies of past Proceedings are available at each Symposium or by mail from-

Order Dept.IEEE Service Center Ann. Reliability & Maintainability Symp.445 Hoes Lane 804 Vickers AvePiscataway, NJ 08854 USA Durham, NC 27701 USA

Payment must be enclosed. When ordering from IEEE, specify that member-price is de-There is an appreciable surcharge for billing. sired.Price includes postage by slow surface mail.

National Symposium on Reliability & Quality Control (in Annual Symposium on ReliabilityElectronics) 1966-1971 *

First-Eleventh (1964-1965)* Annual Reliability & Maintainability Symposium

Reliability & Maintainability Conference 1972-1974*First-Tenth (1962-1971) not available 1975 $18.00 each (not available from IEEE)

Come to the next Symposium: 1978 January 17-19; Biltmore 1976-1978 $18.00 eachHotel; Los Angeles, California USA. Write to the Editor formore information.

*Only a few of these are available (and on a first-come first-served basis) and only for the years; Sixth-Ninth, Eleventh, 1966-1967, 1969-197 1 (someare slightly used). They are $18.00 each, regardless of the price printed on them or previously advertised. Order only from the right-hand (Durham,NC) address above.