dependable computing concepts and fault tolerance: terminologybai/ege534/fault_tolerant... ·...

DEPENDABLE COMPUTING

CONCEPTS AND

AND FAULT TOLERANCE :

TERMINOLOGY

Jean-ClaudeLaprie*

IAAwxRs

7, Avenue du Colonel Roehe, 31400 Toulouse, France

ABSTRACT

This paper provides a conceptual framework forexpressing the attributes of what constitutes dependableand reliable computing

- the impairments to dependability faults, errors,and failures,

- the means for dependability fault-avoidanec,fault-tolerance, error-removal, and error-

This paper isan editedversion of [Lap $4]. &ing backward in time, a milestone was the special meet-ing devoted to fundamental concepts held in conjunctionwith FIWS1l. A special session was organized atIWCS12, devoted to the presentation of viewpoints ela-borated in several places and institutions [And 82, Avi82, Kop 82, hp 82, Lee 82, Rob 82]. lb previousversions of this paper were then discussed during the1983 and 1984 Winter and Summer meetings of the IFIPWG 10.4.

forecasting,

- the measures of dependability reliability, availa- The paper proceeds by refinements: dependabili-

bility, maintainability, and safety. ty is &at introduced as a global concept. Fauk-

Emphasis is being put on the dependability impairments toleranee is then detailed.

and on fault-tolerance.The guidelines which have governed this presen-

FOREWORD tation can be summed up as follows:

-seareh fortheminimum munberof conceptsThis paper is aimed at giving informal but precise

definitions eharaetenzmenabling the dependability attributed to be ex-

“ - g the various attributes of com- pressed,ping systemsdependability. It is a contribution to thework undertaken within the “Reliable and Fault-

. use of terms which are identical to (whenever

Tolerant Computing” seientitic and teehnical community possible) or as close as possible to those general-

[Avi 78, Ran 78, Car 79, Lap 79, And 81, Se 82, Crilyused; asarule, atermwhich hasnotbeende-

84] in order to propose clear and widely acceptable de-fined shall retain its ordinary sense (as given by

finitions for some basic ooncepts.any dictionary),

- emphasis on integration (as opposed to speciali-The work presented here has been oondueted zillion) [Gel 82].

within the framework of the subcommittee “Fundamen-tal Concepts and Terminology”, which is common to the In each section, concise definitions are givenIFW WG 10.4 “Reliable Computing and Fault- first, then they are heavily eummented in order to (at-Tolerance” and to the IEEE Computer Society ‘fC tempt to) show the wide applicability of the adopted“Fauk-Tolerant Computing”. presentation.

Boldfaee characters areusedwhen aterm is&fined, italic characters being an invitation to focus the

l’l%awork prcaentedhercbasbeenpsrt.ly perfonnedwbiletha reader’s attention.

SutborSssocistionWithma UCLA(lmlplltcrS&Ice DqMmcnt,lm An@.?&,CA 9cUM.

0731 -3071/85/0000/0002$01 .00 @ 1985 IEEE2

TEE DEPENDABILITY CONCEPT

BASIC DEFINITIONS AND ASSOCIATED TERMINOLO-GY

Computer system dependability is the guality 4the &livered service s ch that reliance can @@ably be

Yplaced on this service .

The serviee delivered by a system is the system behavioras it is perceived by another special system(s) interactingwith the considered system its user(s).

A system failure occurs when the delivered ser-vice deviates from the specifkd seMce, where the ser-vice spedfication is an agreed description of the expect-ed service. The failure occurred bexause the system waserroneous: an error is that part of the system statewhich is liable to lead to failure, i.e. to the delivery of aservice not complying with the specified service. Thecause -- in its phenomenological sense -- of an error is afault.Upon occurrence, a fault creates a latent error, whichbecomes effective when it is activated; when the erroraffects the delivered seMce, a failure occurs. Stated inother terms, an error is the manifestation in the systemof a fault, and a failure is the manifestation on the ser-

vice of an error.

Achieving a dependable computing system callsfor the combined utilization of a set of methods whichcan be classed inta

- fault-avoidance: how to prevent, by construdon,fault occurrence,

- fault-tolerance: how to provide, by redundmcy,seMce complying with the specification in spiteof faults having occurred or ownrring,

error-removab how to minimize, by ver#ication,the presence of latent errors,

. error-forecasting how to estimate, by evahuz-tion, the presence, the creation and the conse-quences of errors.

Fault-avoidance and fault-tolerance may be seen as con-stituting dependability procurem enti how to providir thesystem with the ability to &liver the spec@ed service;error-removal and error-forecasting may be seen as con-stituting dependability validation: how to reach co@dence in the system ability to deliver the specified ser-vice.

The life of a system is perceived by its users asan alternation between two states of the delivered ser-vice with respect to the specified service:

. service aeeomplfshment, where the semice isdelivered as specified,

service interruption where the delivered serviceis different from the specified service.

The events which constitute the transitions betweenthese two states are the failure and the restoration.Quantifying this accomplishment-interruption altern-ationleads to the two main measures of dependability

. reliability a measure of the continuous serviceaccomplishment (or, equivalently, of the time tofailure) from a reference initial instant,

availability a measure of the service accom-plishment with respect to the alternation of wcomplishment and interruption.

COMMENTS

1. On the introduction of &pendability as a generic con-cept

Why should another term be added to an alreadylong list ? reliability, availabtity, safety, etc. The rea-sons are basically two-fold

. to remedy the existing confusion between relia-bility in its general meaning (reliable system)and reliability as a mathematical quality (systemreliability),

. to show that reliability, maintainability, availa-bility, safety, etc. are quantitative measurescorresponding to distinct perceptions of the sameattribute of a system: its dependability.

In regard to the term “dependability”, it isnoteworthy that from an etymological point of view, theterm”reliability” would be more appropriate ability torely upm. Although dependability is synonymous withreliability, it brings in the notion of dependence at asecond level. ‘l’’hismay be felt as a negative eonnotationat first sight, when compared to the positive notion oftrust as expressed by reliability, but it does highlight oursociety’s ever increasing dependence upon sophisticatedsystems in general and especially upon computing sys-tems. Moreover, “to rely” comes from the French “r&Lier”,itself from the Latin “religare”, to bind back: r-,back and ligare, to fasten, to tie. The necessary soli-darity for reaehing reliability is present! The Frenchword for reliability, Yiabilite” traces back to the 12thcentury, to the word ‘fiablete” whose meaning was

lmsdefillitionisadsptedfrom[cartlz],Whcse”trustwmlhcsand continuity” hss been replacedby“quality”.

“characterof being trustworthy”; the Latin origin isYidare”, a popular verb meaning “to trust”.

Finally, it is interesting that viewing dependabili-ty as a more general concept than reliability, availabili-ty, etc., and embodying the latter terms, has alreadybeen attempted in the past (see e.g. [Hos 60]), althoughwith less generality than here, since the goal was thento define a measure.

2. On the notion of service and its specification

From its very deftition, the service delivered bya system is clearly an abstraction of its behavwr. It isnoteworthy that this abstraction is highly dependent onthe application where the computer system is employed.An example of this dependence is the important roleplayed in the abstraction by time: the time granularitiesof the system and of its user(s) are generally different.

Concerning specifbtion, what is essential withinthe present context is that it is a description of the ex-pected service which is agreed upon by two persons orcorporate bodies: the system supplier (in a broad senseof the term: designer, builder, vendor, etc.) and its(human) user(s).Whatprecedesdoesnotmeanthataservice specifkation will not change omx established.This would be simple ignorance of the facts of life,which imply change. The changes may be motivated bymodifying the service expectation: modificaticm of fun~tionality, correction of some undesired features such asdeficiencies in the agreed specification. Once more,what is important is that the specification is (re-) agreedupon.It is noteworthy that such matters as performanCe, Obscrvability, readiness, etc. can be captured by an appropriately stated specifkation.

3. On the notions offd, error and fdlure

Fret, some illustrative examples:

- a programmer’s mistake is a fd: the conse-quence is a (Merit} error in the written software(erroneous instruction or piece of data); upon~“vation (activation of the module where theerror resides and an appropriate input patternactivating the erroneous instruction, instructionsequence or piece of data) the error bccumes #fective; when this effective error produces er-roneous data (in value or in the timing of theirdelivery) which affect the delivered service, afailure occurs,

. a short-circuit omurring in an integrated circuitis a fault; the consequence (connection stuck at aBoolean value, modifkation of the circuit funetion, etc.) is an error which will remain latent aslong as it is not activated, the continuation of

.

.

.

tie process being identical to the previous exam-ple,

an electromagnetic perturbationof suftkient en=ergy is a fink; when (for instance) acting on amemory’s inputs, it will create an error if activewhen the memory is in the write position; the er-ror will remain latent until the erroneousmemory location(s) is (are) read, etc.

an inappropriate man-machine interaction per-formed (iadvcrtcntly or deliberately) by anoperator during the operation of the system is afault; the resulting altered data is an error, etc.

a maintenance or ooerati.mzmanual Writs’s mis-take is a fault; the &nseq~ence is an error in thecorresponding manual (erroneous directives)which will remain latent as kmg as the directivesare not acted upon in order to face a given situa-tion, etc.

From the above examples, it is easily understood thatthe error latency duration may vary considerably,depending upon the fault, the considered system utiliza-tion, etc. These examples also explain why an errorwas defined as being liabk to lead to a failure. Wheth-er or not an error will e%xtively lead to a failuredepends on several factors:

- the activation conditions according to which a la-tent error will become effective,

- the system composition, and especially theamount of available redundancy

explicit redundancy (in order to ensurefault-tolerance) which is directly intendedto prevent an error fkom leading to afailure,

indicit redundance (it is in fact diffkultto” build a system” &thout any form ofredundancy) which may have the same(unexpected) result as explicit redundancy,

- the very deftition of a failure from the user’sviewpoint, e.g. in the notion of ‘“acceptableerrorrate” (implicitly: before considering that afailure has occurred) in data transmission.

These examples finally enable the introduction of thenotion of fault classes, which are classically [Avi 78]physical faults and human-made faults. Proposed deil.n-itions are as follows:

- physical fanltw adverse physical phenomena,either internal (physicxxhemical disorders:threshold changes, short-circuits, open-circuits...) or external (environmentrd perturba-tions: ektro-maguetic perturbations, tempera-ture, vibration, . ..).

- human-made faults: imperfections which maybe:

l desfgn fmdta, committed either a) duringthe system initial design (broadly speak-ing, from requirement specification to im-plementation) or during subsequent modif-ications, or b) during the establishment ofoperating or maintenance procedures,

l fnteraetfon faults: inadvertent or deli-berate violations of operating or mainte-nance procedures.

Faults, errors and failures are all unakired cir-cunwtances. At fit sight, the three notions are neces-sary: a) the omurenee of an undersired circumstanceaffwting the serviee -- failure -- is felt by the user(s)and assessed, b) an (internal) system undesired &eumstanee -- error -- is deteeted, c) the undesired eir-eumstanee able to give rise to a system error -- fault --and later on to a serviee failure is either avoided ortolerated. Assignment of the terms fault, error andfailure to the phenomenologieal cause, the syxtern andthe serviee undesired eireumstanees simply takea intoamount current usage: fault-avoidance or -toleranee, er-ror deteetion, failure rate.Moreover, it seems important to differentiate betweenthe cause with respeet to activation and propagation --error causing failure --, and the cause with respeet tothe (suspeeted) originating phenomenon(a) -- fault caus-ing error -- .

It could be argued that with such a reasoning(fault viewed as the phenomenologieal cause of error)one ean go “a long way W“. For instance, if we 100IKback at two of the previously given examples:

- why did the programmer make a mistake?

- why did the short oeeur in the integrated eireuit?In fact, recurswn stops at the cause which is intended tobe avoided or tolerated. f.f fault - avoialmce is meaning-ful within this eontext (when a cause is avoided, its ef-feets are of little interest), it may not be so for fault -tolerance; in fact, it is the cause which is tolerated,through processing its effects: a fauh is thus the ad-judged cause of an error [Mor 83].Furthermore, such a view is consistent with the dis~tion between human and physical faults in that a com-puting system is a human creation and as such any faultis ultimately human-made since it represents human ina-bility to master the complexity of the phenomena whichgovern the behavior of a system. In an absolute way,distinguishing between physical and human-made faults(especially design faults affeeting the system) may beconsidered as unnecessary; however it is of importancewhen considering (current) methods and techniques forprocuring and validating dependability. If the above-memioned recursion is not stopped, a fd isnothingelse than a failure of a system huving interacted or fn-tiracting with the consiakred sywm; emunpks follow:

. a design fault is identilable as a designer

failure,

an interaai physical fault h due to a latent error(the “physics refinability” community rarelyehameterizes failures as “sudden, nonprexfktableand irreversible”) originating from the hardwareproduction,

physical external faults and (human-made) in-teraction faults are identilable” as failures die toanother design fault: the inability to foresee allthe situations the system will be faced with dur-ing its operational life.

Up to now, a system has been considered as awhole, emphasizing its externally pereeived behavior; adefinition of a system complying with this “black box”view is: an entity having interacted, interacting, or li-able to interaet with other entities, thus other systems.The behavior is them simply what the system abes [Zie76]. What enables it to do what it does is the WI’Uetureofthe system or its organization. Adopting the spirit of[And 81], a system, from a structural (“white box”)viewpoint, is a set of eonpnents bound together in ord-er to interaet; a component is another system, ete. Therecursion stops when a system is considered as beiigatomic any further internal structure cannot be dis-eemed, or is not of interest and ean be ignored. Theterm “component” has to be understood in a broadsense: layers of a computing system as well as intra-

layer components; in addition, a component beiig itselfa system, it embodies the interrelation(s) of the com-ponents of which it is composed.From these deftitions, the discussion of whether“failure” appk9 to a system or to a component is simplyirrelevant, since a component is itself a system. Whenatomic systems are dealt with, the classical notion of“elementary” failure comes naturally. It is alsonoteworthy that from the preceding view of systemstructure, the notions of serviee and specification applyequally naturally to the components. This k especiallyinteresting in the design process, when using off-the-shelf components, either hardvwe of software [Her 84]:what is of actual interest is the semice they are able toprovide, not their detailed (internal) behavior.

This structured view enables fd pathofo~ to bemade more precise; the creation and action mechanismsof faults, errors and failures may be summarized as fol-lows:

1)

2)

5

A fdt creates one or several latent errors in thecomponent where it oeeurs; physied faults eandireetly affeet the physical layer componentsonly, whereas human-made faults may affeetany component.

The properties governing errors may be stated asfollows:

a) a latent error beeomes effeetive once it is aci

b)

c)

tivated,

an error may cycle between its latent and ef-fective states,

an effective error may, and in general does,prqwak S- one component to another;b pwwtmg, anerror createsother (new)errors.

From these properties it maybe dedueed that aneffective error within a component may originatefrom:

. activation of a latent error within the sameCQmponent,

. an effeetive error propagating within thesame component or from another com-ponent.

3) A component ftikre oeeurs Wht!Il an error af-

fects the service delivered (as a response torequest(s)) by the component.

4) These properties apply to any component of asystem.

In the preeeding, the intransitive form of “propagate”was intentionally used an error does not propagate it-self, it just propagates. Although “propagate” was r~tained due to its wide use, better words would probablybe “spread”, or “breed”.

Three final comments:

i) A given error in a given component may be subsequent to different faults. For instance an er-ror in a physical component (e.g. stuck atground voltage) may result from:

- a physical fault (e.g. threshold change)acting at the physical layer comprMmg thecomponent,

an informationerror (e.g. erroneous mi-croinstruction), caused by a design fault(e.g. programmer mistake), propagatingtop-down through the kyCTS and leadingto a short between two eireuit outputs fora duration long enough to provoke ashort-circuit having the same effect as thethreshold change.

ii) It is noteworthy that the notion of failure mnnotbe separated from time granularity an errorWhieb “p9sscS through” * interface between thesystem and its user(s) may or may not be viewedas a failure by the system user(s) depending onthe time granularity of the lattq this remark isof practical importance when considering thesolutions currently adopted for fault-tolerance asa function of the application.

iii) lhe adjeetive “deliberate” in the definition ofhurmm-made interaction faults is clearly intend-

ed to include “undesired accesses” in the senseof computer seeurity and privacy; however, theeormsponding methods and techniques will notbe addressed in the sequel.

4. On fdt-avoihwe and Werance, error-removal andforecasting

All the “how to’s” which appear in the basic d-finitions are in fact goals which cannot be fully reaehed,as all the corresponding activities are human activities,and thus imperfect. These imperfections bring independencies which explain why it is only the combinedutilization of the above methods (preferably at eaeh stepof the design and implementation proses) which eanlead to a dependable computin~ system. These depen-dencies can be sketched as follows: in spite of construc-tion rules (imperfeet in order to be workable), faults@cur; hence the need for error-removal; error-removal isitself imperfect, as are the off-the-shelf components ofthe system, henee the need for error-forecasting; our in-creasing dependence on computing systems brings infault-tolerance, which in turn necessitates constructionrules, and thus error-remmml, error-forecasting, ete. Ithas to be noted that the process is even more recursivethan it appears from the above current computer sys-tems are so complex that their design and implementa-tion need computerized tools in order to be cost-effective (i a broad sense, iuchding the capability of

succeding within acceptable delays). These tools havethemselves to be dependable, and so on.

The preceding reasonin g ~~ why k thegiven definitions error-removal and error-foreeastingare gathered into validation. Classically speaking (seee.g. [And 82, Avi 78, Lap 79]), fault-avoidance anderror-removal are ameqturdly gathered into fault-prevention, error-forecasting being left with no definite(eoneeptual) room. Validation is then limited to whathas been terlnd as W1’i&ltioll; in that case these tWOterms are often associated, e.g. “V and V“ ~ 79], thedistinction being related to the difference t.Mwecn“building the system right” (related to verification) and“building the right system” (related to validation).What is propased is simply an extension of this eoneept:the answer to the question “am I building the right sys-tem?” being complemented by “for how long will it beright?”. Besides highlighting the need for validating theprocedures and mechanisms of fault-toleranee, consider-ing error-removal and error-forecasting as two consti-tuents of the same activity -- validation -- is moreoverof great interest as it enables a better understanding ofthe notion of coverage, and thus of an important probkm introduced by the above recursion(s): the vufiuirtion~thevalialrtio n,orhow toreachconfidence in themethods and tools used in building eonfidtxm in thesystem. Coverage refers here to a measure of therepresentativity of the situations to which the system is

6

submitted during its vrdidation with respect to the actualsituations it will be confronted with during i@ operation-al life. Fdy, “validation” stems from “validity”,which encapsulates two notions:

. validity at a given moment, which relates toerror-removal,

validity for a given &uation, which relates toerror-forecasting.

5. On the dependirbility measures

The term “probability” has intentionally not beenemployed in the given definitions, so as to keep the dis-cussion informal, and to reinforce the physical signM-eance of the defined measures. However, as the con-sidered circumstances are non-det erministic, randomvariables are associated with them, and the measureswhich are dealt with are probabilities; this is strictlyspeaking Wrre& a probability can be definedmathematically as a measure.

Only two basic measures have been considered,reliability and availability, whereas a third one, maintai-nability is usually considered, which may be defied asa measure of the continuous service interruption, orequivalently, of the time to restoration. This measure isno less important than those previously defined; it wasnot introduced earlier because it may, at least eoneeptu-

ally, be deduced from the other two. E is noteworthythat availability embodies the failure frequency and tieaccomplishment time duration at each alternationaccomplishment-interruption.

A system may not, and generally does not, al-ways fail in the same way. This immediately brings inthe notion of the consequences of a failure upon the oth-er systems with which the considered system in interact-ing, i.e. its environment; several failure modes can gen-erally be distinguished, ordered according to the in-creasing severity of their amscquences. A special caseof great interest is that of systems which exhibit twofailure modes whose severities differ considerably

- benign failures, where the consequences are ofthe same order of magnitude (generally in termsof cost) as those of the service delivered in theabsence of failure,

. malign or castrophic failures, where the ~qucnecs are not commensurable with those efthe scMce delivered in the absence of failure.

Through grouping the states of service accomplishmentand seMce interruption subsequent to benign failuresinto a safe state (in the sense of beiig free from dam-age, not ffom danger), the generalization of reliabilityleads to an additional measure: a measure of continu-ous safeness, or equivalently, a measure of the time tocatastrophic failure, i.e. safety. It is worth noting that a

direct generalization of the availability, thus providiug ameasure of safeness with respeet to the alternation ofsafeness and interruption after catastrophic failure,would not provide a significant measure. When a catas-trophic failure has occurred, the consequences are gen-erally so important that system restoration is Em ofprime importance for at least the two following reasons:

. it comes second to repairing (in the broad senseof the term, including legal aspects) the conse-quences of the catastrophe,

- the lengthy period prior to being allowed tooperate the system again (investigation commis-sions) would lead to meaningless numericalvalues.

However, a “hybrid” reliability-availability measure canbe defined: a measure of the service accomplishmentwith respect to the alternation of accomplishment andinterruption after benign failure. This measure is of in-terest in that it provides a quantifkation of the systemavailability before occurrence of a catastrophic failure,and as such enables quantification of the wxalled“reliability- (or availability-) safety trade-off”.

SUMMARY

What has been presented actually constitutes anattempt to build a taxonomy of dependable computing.The concepts introduced may be gathered into threemain clase9 of attributes:

the dependability impairments, which are un-desired (not unexpected) circumstances causingor resulting from un-dependability, whose defin-ition is very simply derived from the definitionof dependability: reliance cannot, or will not, beany more justifiably placed on the service,

the dependability means, which are themethods, tools and solutions enabling a) to protide with the abtity to deliver a seMce onwhich reliance can be placed, and b) to reachconfidence in this ability,

the dependability measures, which enable theservice-quality resulting ffom the impairmentsand the means opposing them to be appraised.

This (beginning of a) taxonomy is representedbelow in the form of a tree.

DEPENDABILITY

r FAULTS

r IMPAIRMENTSP

RR(2RS

~FAILURESFAULT

{

AVOIDANCE

{

PROCUREMEIWFAULT

TOLERANCE

MEANSERROR

{

REMOVALVALIDATION

ERROR

‘_FORECASTINQ

L{ RELIABILITY

MEASURES

AVAILABILITY

FAULT-TOLERANCE

DEFINITIONS

Fault-tolerance is carried out by error pracess.ing, which may be automatic or operator-assisted; twoconstituent phases can be identified:

a)

b)

effective error processing, aimed at bringing theeffeetive error back to a latent state, if possiblebefore oeeurmnce of a failure,

latent error processing, aimed at ensuring thatthe latent error does not become effeetive again.

Effeetive error processing may take on twoform%

- error recovery, where an error-free state is sub-stituted for the. erroneous state; this substitutionmay in turn make on two forms [And 81]:

backward error reeovery, where the er-roneous state transformation eonaists ofbringing the system back to an already weupied state prior to the error becomingeffective,

forward error recovery, where the errone-ous state transformation consists of find-ing a new state (whieb never oeeurred be-fore, or which has not occurred since thesame erroneous state occurred),

. error eompenaation, where the erroneous statecontains enough redundancy to enable the

delivery of an error-free serviee from the er-roneous (iitemal) state.

When error reeovery is employed, the erroneous stateneeds to be (urgently) identifkd as being erroneuus pri-or to beiig transformed, which is the purpose of errordeteetion. On the other hand, eornpensation may beapplied systematically, even in the absence of effectiveerror(s), providing error masking (to the user(s)).

Latent error processing is carried out by makingthe error passive and recont@uring the system if it is nolonger capable of delivering the same seMce as before.

If it is estimated that effective error processingaxdd directly remove the error, or more generally thatits likelihod of reeurring is low enough, then latent er-ror processing is not undertaken. As long as latent er-ror processing is not undertaken, the error is regardedas being volatile; undertaking it implies that the error isconsidered as solid.

Error processing is generally completed bymaintenance (except of course for non-maintained sys-tems), aimed at removing latent errors. Maintenanceactions can be put into two ekmses:

- corrective maintenance, aimed at removingthose latent errors which have become effectiveand have been promssed,

- preventive maintenance, aimed at removing la-tent errors before they kcome effective in theconsidered system; these errors can result from

l physical faults having occnned since thelast preventive maintenance actions,

. design faults having led to effective errorsin other similar sysiexna.

COMMENTS

1. On tht tolerated-fault classes

‘l%epreceding detlnitions apply to physical faultsas well as to design faultw the class(es) of faults whichcan actually be tolerated depend(s) on the fault hy-pothesis which is being considered in the design process,and thus depend on the independency of rcxhmdancieawith respect to the process of fault creation and activa-tion.

It is noteworthy that fault-tolerance ia (also) a re-cursive concept; it is essential that the mechanismsaimed at implementing fault-tolerance be protectedagainst the faults which ean affect them voter replica-tion, self-checking checkers [Car 68], “stable” memoryfor recovery programs and data [Lam 81].

8

2. On error processing

The goal of error processing is the preservationof data integrity, which in turn requires the data am-tained in the components involved in this preservationto be consistent [Wen 78, Gra 78].

In effective error processing,

a) backward and forward error recovery are notexelusive: backward recovery may be attemptedfret; if the effective error persists, forwardrecovery may then be attempted,

b) in forward error reeovery, it is necessary to us-sess the damage caused by the deteeted error, orby errors propagated before deteetion; damageassessment ean (ii principle) be ignored in thecase of backward recovery, provided that themeclumisms enabling the transformation of theerroneous state into an error-free state have notbeen affeeted [And 81].

Some previous definitions of compensation consider itwithin the context of interaction faults in distributedsystems only (see e.g. [Ran 78]); the given defitionclearly embodies more classical situations such as error-eorreeting codes and majority voting. It is hoped thatthe essential idea has been captured, its applieabfity be-ing a matter of deftition of the system boundaries.

In error masking, the systematic appkmion of eompen-satkm ensures in itself that any effeetive error (providedof course it corresponds to the fault hypothesis of thedesign pmeess) has been brought back to a latent state.However, this ean at the same time correspond to aredundancy deerease which is not known. So, praetiealimplementations of masking generally involve emrdeteetion, which enables compensation to be appliedagain (in ease an effeetive error has been deteeted) inorder to cheek whether latent error processing has to beundertaken or not. It is noteworthy that due to themasking effeet, this seeond application of compensationean be performed within an acceptable delay, off-linewith respect to the current computation which was inprogress when the error became effective.

Of importance is the signaling of a componentfailure to its users. This is often aeeounted for withinthe framework of exceptins [And 81, Cri 82]. From astrict terminology viewpoint, this naming may be regret-ted in the sense that it may induce a contradiction withthe wish of seeing fault-toleranee as a natural featureof computing systems, and as such introduced from thevery beginning of the design process (not as an “exeeptional” attribute).

The definition given for the fact that an errormay be volatile or solid may lead to the feeling that thisis a subjective notion. Generally, latent Gmr process-ing does not immediately follow effective error proceM-ing, the delay resulting bakally from the fact that theestimation of whether an error is volatile or solid is acomplex task involving identification of the fault classwhich gave rise to the error. As different fault ckssesean lead to very similar errors, this identification gen-erally involves either waiting for a given lapse of timeor logging a given number of error oecurremms. ArI ad-ditional, correlated, reason is that latent error process-ing is in fact a voluntary system change, which, as such,will lead to lowering the system redundancy. Thus, theabove-mentioned subjectivity in fact relates to the difti-eulty in accurately identifying the class of fault havingcaused an error, as well as diagnosing its effects.Moreover, the given definition is consistent with

a)

b)

the difficulty in distinguishing the effects of atemporary (transient or intermittent) physieidfault from those of a design fault (see e.g. [Avi82]),

the fact that the techniques for tolerating designfaults [Elm 72, Ran 75, Che 78] maybe awn asattempts to make design-induced errors volatile.

The knowledge of some system properties maylimit the necessary amount of redundancy. Examples ofthese properties are regularities of structural nature: er-ror deteeting and correcting codes, robust data S-tures [Tay 80], mukiproeessors and loud area networks[Hay 76]. The faults which are tolerated are thendependent upon the properties which are aeeounted for,since they intervene directly in the fault hypotheses.

The operational time overhead neeessary for ef-feetive error processing is radically different aeeordingto the two distinguished hrrns:

in error deteetion and reeovery, the time over-head is longer when an error becomes effeetivethan when it was latent (it is then related to theproviskm of reeovery points, thus in fact topreparing for effective error processing),

. in error masking, the time overhead is alwaysthe same, and in practice the duration of errorcompensation is much shorter than errorrecovery, due to the larger amount of available(structural) redundancy.

This remark

a) is of high praetkxd importance in that it oftenconditions the ehoiee of the adopted fault-toleranee method with respect to the user timegranularity,

b) has introduced a relation between operationaltime overhead and structural redundancy; moregenerally, a redundant system always providesredundant behavior, incurring at least someoperational time overhead the time overheadmay be small enough not to be perceived by theuser, which means only that the service is notredundant; an extreme opposite form iS ‘timeredundancy” (redundant behavior obtained byrepetition) which needs to be at least initializedby a structural redundancy, limited but still ex-isting; roughly speaking, the more the structuralredundancy, the less the time overhead incurred.

3. On maintenance

The frontier between latent error processing andcorrective maintenance is relatively arbitrary; especially,corrective maintenance may be considered as an (ulti-mate) means of achieving fault-tolerance. However, thegiven definitions were adopted for the ability to embody(a) on-line or off-line maintainable fault-tolerant sys-tems, as well as non-fault-tolerant systems, and (b)preventive as well as corrective maintenance.

It is noteworthy within the present context thatthe current discussions about the irrelevance of the useof the term “maintenance” when applied to softwaresimply forget the etymology of the word: the asso&-

tion of maintenance with repairing hardware is actuallya (recent) deviation; associating “to maintain” with thenotion of service would enable this etymological mean-ing to be revived, while at the same time removing thevery sourm of diseuwdon.

CONCLUSION

The contents of this paper are devoid of any “Ta-blets of Stone” pretension the efforts undertaken areonly worthwhile in so far as they manage to embody aswide as possible a range of concepts and therefore thoseefforts have to keep abreast of technology. Naturally,the associated terminology effort is not an end in itse~words are only of interest in so far as they transmitideas, subject them to eritieism, and enable viewpointsto be shared.

‘I%eindependence of the basic definitions withrespeet to any fault class should facilitate the bringiqtogether of activities which are often considered asseparate, such as:

- VLS! testing and software testing,

- hardware reliability (with respeet to physicalfaults) and software reliability (with respect todesign faults), to say nothing of hardware relia-

bility with respect to design faults,

- tolerance to physical faults and tolerance todesign faults,

- conqmter system security, safety and reliability.

ACKNOWLEDGEMENTS

What has been presented would not exist withoutthe many discussions held with many colleagues. I wantto thank d of them, and c%-y:

- the members of the IFIP W.G. 10.4 “ReliableComputing and Fauh-Tolerance”; speeial thanksgo ta

l Tom Anderson, Al AviZienis, Bill Carter,and Alain Costes for their extremely ef-fective contributions,

l Flaviu Cristian, Al Hopkins, Dave Mor-gan, Brian Randell, and Art Robinson fortheir very constructive eritieism of previ-ous versions of this report,

- the members of the “Design and Validation ofDependable Computing Systems” research groupat LAA$ special thanks go to Jean Arlat, Chris-tian Beounes, Jean-Paul Blanquart, and DavidPowell for their extremely valuable suggestions.

REFERENCES

And S1 T. hdersoq P. A. Lee,Fo@ tolerance: @?@.&$ ad

-G @#* Qiffs, New Jersey RuItice Han,1981.

AndS2 T. ~ P. A. Lee,“Fault.tolcramtemdeology~

Avi 7S

Ad 82

-, p- lz~ Iw. $wP. on FaMofemu Computing,LosAn@s, he 1982,pp.29-33.

A Avizienis,“@k-tok!rance,tk NU’ViVd attdi of di-

gital systems”, Proceedings # the IEEE, Vol. 66, No. 10,Oct.1978, pp. 1109-1125.

A. Avizicnis,‘Thefmr-universe information systau modelfor the study of fault-tolerance”, pm. 12* it Sy?np. on

FouhTokmnt cOtl?plUiJZ& Los An@cs, Jum1982, pp. 6.13.

Boe 79 B. W. Boehm,“Guidelinesfor VCSifjI@ and d&~

~ ~ts ~ deaip WKifMbs”, PW.EURO IFIP 79, London, Sq. 1979, ~. 711.719.

~68W. C. Carter,P. RSCI&&, ~ ofdynamk@

car79

cartn

dwked exoputm”, pmt. IFZF’68 Cong., Amstcrdal&1968, pp. 87&883.

w. c. carte?,“FMdtdcteetionandrecoveryalgorithmsforfault-tolerant systems”, proc. EUROZPLP 7P, U Sq)t.

1979, pp. 72s-734.

W. C. Carter, “Atimefor reflection”, pmt. 12th Int. Symp.

10

On FUUkTOhSf fhIfXUhg, h AII@k5, hUE 1982,p.41.

(I3M78 L. C&m, A Avizkmis, “N-version progr- ssfatit-tolcmmce approach to rcliahility of softwxe operation”,proc. 8th ht. Symp. on FaukTo&rant Computing, Toulouse,France, June 1982, p, 3-9.

Cd S2 F. titian, ‘Rmeption handling and software fauft-tolcrance”, IEEE Trans. on Coqxtters, Vol. C31, No. 6,June 1982, pp. 531-540.

Crl 84 F. Csistian, ‘A rigorous appmacb to fault-tolerant pmw-, PWC. 4th Jerusalem Conference on InformationTechnology, Jerusalem, May 1984, pp. 192-201,

llhn 72 W. R Ebndorf, “Fauk-tokrant programming”, pmt. zthIn!. Sysnp. on Fauk-Toleram Contpsding, Ncwtoq Mas-

sdmsetts, June 1972, p. 79-83.

Gol S2 J, Goldberg, “A time for integration”, proc. 12fishr. Synsp.on Fauk-To&raru Computing, b An@@ JOIM19S2, pp.62.

Gra 78 J. N. Gray, “Notes on data base operating systems”, in

oper~ce SYSWW-C No- b -* *u= lQ%Bcrlisq Sprillger.verlag,1978, pp. 394-479.

Hay 76 J. P. Hayea, “A graph model for fault-tolerant computingsystems”, IEEE Trans. on Computers, voL G2.5, No. 9,Sept.1976, pp. 875-884.

Hor 84 E. Horowitz, J. B. Munson, “An tXfWISiVC vimv of reus-

able sofhwm”, IEEE Tram on &@vare Engineering, Vd

SE.-lo, No. 5, Sept. 19&, pp. 477-487.

Ha 60 J.E. Hosford, “Mcasuru of dqsdaMity”, f)pemiosssReseorch, Vol.8, No. 1,196, pp. 204-2LM.

Kq 82 H. Kopetz, “The failure-fault (IT) model”, pmx. 12th hr.

Synsp. on Fou.kTokrant Computing, Los An&b, June19s2, pp. 14-17.

Ian tll B, W. Lampson, “Atomic Transactions”, m Dkributd

Systems-Architecture and Inspknemadm, Ixcture NO&S ~

CQmputer science 10s, Berlim Springer-vdag,1981,chap. 11.

Lap 79 J. C. Laprie, A. titcs, R Troy, ~+mcnta and solutions”, proc. SEE Cong. on Ekfricd d

Ektronicol System Dependability, Toulouse, France, CM.1979; in Fred.

Lap 82 J. C. Laprie, A. Cct+tu, “DqmdMityauni@gcvm~ for reliable Computing”, proc. 12th Int. Symp. onFault-Tolerant Computing, Las Angeles, June1982, pp. 18-21.

LqM84 J. C. Lapric, “Dqmdahle Coopting and fault-tcd~

Le82

UJ=@ ~ -108Y”, m WG 104. Summa 1984*! ~, Fbrickx U/48 IUaswcb Report No.S4.035, June 1984.

P. A. be, D. E. Morgan, E&., %m&mmmI -offasdt-tolcmmt computing, progrcas rcpmt”, proc. 12th Ilk

Symp. on FaATolemnt Compufing, Los Arsgcks, June

1982, pp. 34-38.

Mor 83 D. Morgan, W. C. Carter, A. Hopkins, “Report to IFIPWG 10.4- Conce@s and terminology, Draft 1“, IFIP W(310.4 Summer 83 meeting, Como, Italy, Jum 1983,

Ran 75 B. Randell, “system structure for softwaro fauk.tolcrancs”,IEEE Trans. on Sqflware Engineering, Vol. S. El,No, 2,June 1975, pp. 220-232.

Ran 7S B. Randell, P. A. be, P. C. Trekavesl, ‘Reliability issues

in computing system deaigo”, Corqdng surveys, vol. 10,No. 2, June 1978, pp. 123-16S.

Rob 82 A. S. RObirlMML “A uses oriented oerspcctive of fault

Sk 82

. .tolerant systems models and terminologies”, proc. 12th hr.Symp. on Fault-Tolerant Computing, Los Angeles, June1982, pp. 22-28.

D. P, Siwiorek, R S. SwaIZ, The theoIYand practice of re-

liable system &sign, Digital Rew, 1982.-

Tssy SO D. J. Taylor, D. E. Morgau, J. P. Black, “Rdmdarq indata structures: @roving software fault-tokmmce”, rEEETrans. on Sq%vare Engineering, Vol. =6, No. 6, Nw.1980, pp. 383-594.

Wen 7$J. IW Wenslcy, L. Import, J. G&berg, M. W. GrcerLK. N. Levitt, P. M.. Melliar-smith, R E. Wostak, C.B.Weimtcck, “SET the design and analysis of a fauk-tolerantcomputerfor aircraftcontrol”,Proceedqp qf fhe

IEEE, vol. 66, No. 10, ti. 78, pp. 12S5-126S,

Z& 76 B.P. Zegler, Theory qf Modeling and Sisnsddon, NewYork John Wdey,1976.

II

dependable computing concepts and fault tolerance: terminologybai/ege534/fault_tolerant... ·...

Documents