from experimental assessment of fault-tolerant systems to

From Experimental Assessmentof Fault-Tolerant Systems

to Dependability Benchmarking

Workshop on Fault-Tolerant Parallel and Distributed Systems — April 19, 2002

Int. Parallel & Distributed Processing Symp. (IPDPS-2002) — Fort Lauderdale, FL, USA

Jean Arlat

DBench

Dependability BenchmarkingWork partially supported by project IST 2000-25425

Dependability Assessment

ObjectivesEvaluation of Dependability Measures (Reliability, Availability, etc.)Verification of Properties

Nominal ServiceService in presence of Faults

Characterization of Behavior in Presence of FaultsFailure modesEfficiency of fault tolerance

Methods and TechniquesAxiomatic

Stochastic processesModel checking Empirical

Field measurement Robustness testing Fault injection

Fault Tolerance Validation

Validation of fault tolerancewrt specific inputs it is designed to deal with: the faults

FT mechanisms = humanartefacts (not perfect)

Calibration of models

Formal approaches limits

Threats = rare event

Impact on dependabilitymeasures

Estimation of FT efficiency

Experimental approaches

Controlled experiments

Fault Tolerance (FT)Dependability

Fault Injection

Fault Injection

Test and evaluation of fault-tolerant systems& FT mechanisms

Explicit characterization of faulty behaviors

Target System

Activity

Faults

InputError

Processing

Valid

Invalid

Output

8

Calibration & Validation

Analytical & Experimental Evaluations

Target System 1

5

Fault Injection

Experiments

6

7

Constructionof Empirical & Physical

Models

Estimation of FTMs

Coverage Parameters

Processing

by Stochastic Processes

3

4Evaluation of

Dependability Measures

2Construction ofAxiomatic Models

Fault Injection as A Design Aid

—4 C T,1 N

1 2 3

4

μ μ

2

4 +H 4 CT N 3 +H 3 CT N

4 ( C—

T,2 + C—

T,3 ) N

N3 C—

T N

Model

10

10

10

/μ

+1

+2

+3

10-4 10-3 10-2

NAC Std - AMp V2

NAC Std - AMp V1

NAC Std - AMp V2.5

NAC Duplex - AMp V2.5

MTFF networkMTFF node

NAC/AMp

spare node

Token ring

Host Host Host Host

Target Architecture

NAC/AMp NAC/AMp NAC/AMp

Coverage factors— — —

NAC Std - AMp V 1

NAC Std - AMp V 2

NAC Std -AMp V 2.3

NAC Duplex - AMp V 2.5

CT CT,1 CT,2 CT,3

79.08% 2.32% 11.77% 6.83%

85,02% 8.73% 2.80% 3,45%

90,32% 7.79% 1,05% 0,84%

99.55% 0.32% 0.00% 0.12%

Target system

FaultInjection

[ESPRIT Project Delta-4]

(Software) Component-based Development

Integration of previously developed components (commercial or not) OTS = either COTS or OSSMain advantages

productivity and time to market incorporation of technology advances compatibility with industry standards quality (widely deployed components)

Applications Requiring High-level of Dependability

Lack of observability and controlability : ? Dependability

Global cost (development, validation, usage, etc.) ?

Fault Injection-basedDependability Characterization of COTS SW

�

�

�

��

�

0% 10% 20% 30% 40% 50%

SunOS 5.5

SunOS 4.13

QNX 4.24

QNX 4.22

OSF-1 4.0

OSF-1 3.2

NetBSD

Lynx

Linux

Irix 6.2

Irix 5.3

HPUX 10.20

HPUX 9.05

FreeBSD

AIX

Abort % Silent % Restart %

�Catastrophic

Normalized failure rate (%)

AIX

FreeBSD

HP-UX B.10.20

Linux

LynxOS

QNX 4.24

SunOS 5.5

NetBSD

Irix 6.2

Irix 5.3

HP-UX B.9.05

OSF-1 3.2

OSF-1 4.0

QNX 4.22

SunOS 4.13

Abort

SilentRestart

Catastrophic

0 10 20 30 40 50

9

15 « COTS » OSs[Koopman & DeVale 99 (FTCS-29)]

Invalid parameters in system calls at POSIX Interface

Ext. faults

0

10

20

30

40

50

60

70

80

90

ErroneousResult

Appl.Hang

SystemHang

KernelDebug.

Exception ErrorStatus

NoObs.

Ob

serv

atio

n % Int. faults

Chorus microkernel[Arlat et al. 2002 (ToC Feb.)]

Bit flips- on parameters of kernel calls (ext.) - in kernel memory space (int.)

MAFALDA

Faul InjectionWell-Accepted by Industry as a Whole

ProviderIBM, Intel, Sun MicroSystems,…

IntegratorAnsaldo Segnalamento Ferroviario, Astrium,DaimlerChrysler, Saab Ericsson Space, Siemens,Technicatome, THALES, Volvo,…

StakeholderElectricité de France, ESA, NASA (JPL),…

ConsultantCritical Software, Cigital,…

Dependability Benchmarking

DependabilityAssessment

Performance Benchmarking

DependabilityBenchmarking +

Naive View … :-)

More Realistic View

DependabilityBenchmarking

Agreement/AcceptanceUsefulnessFairnessUsabilityPortability

etc.

DependabilityAssessment

Performance Benchmarking

Desired Properties

Dependability Benchmarking Framework

Analysis

Experimentation

Target System

FaultloadWorkload

Experimental Features& Measures

AComprehensive

Measures

Modeling

B

DBench


Challenges

?Real faultsReal faultsReal faultsReal Faults

Real faultsReal faultsReal faultsFault Injection

Techniques

Real faultsReal faultsReal faults

Faultload(s)for


• Agreement/Acceptance

• Representativeness

• Usefulness,

• Portability,…

Properties Dimensions

• Faultload

• Workload

• Measurements

• Measures

• Representativeness• Faultload

The Fault Injection Techniques

Logical &Information

Physical

MEAN

SimulationModel

Prototype/Real System

ModèleSimulation-based

Software-Implemented

Physical

Heavy-ions FIST,…EM perturbations TU ViennaPins MESSALINE, Scorpion, DEFOR, RIFLE, AFIT, ...

TARGET SYSTEM

Built-in test devices (SCIFI) FIMBUL

system DEPEND, REACT, ...RT Level ASPHALT, ...Logical Gate Zycad, Technost, ...Switch FOCUS, ...

Wide Range MEFISTO, VERIFY,...

communication FIMD debugger FIESTA task FIATexecutive Ballista, (DE)FINE, MAFALDA,memory DEF.I, SOFIT, ...instr. set FERRARIprocessor Xception, …

Compile-timesoftware mutation

SESAME, ...

Target System Levels & Fault Pathology

M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

X

O

O

O

O

OOO

X: Fault locations O: Observation locations


M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

X

O

O

O

O

OOO

X

O

O

O

O

O

OO

O



M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

XO

O

O

OOO

X

O

O

OX

O

O

OO

O


Target System Levels — Ref. & Obs. dist.

dr1

dr2 do1

do2Dist. to fault ref. Dist. to observation

M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

X

OOO

Mutation vs. Real Software Faults

10%

30%60%

5%

13%

82%

Sets of Errors Provoked => 395 distinct errors

140

<— Mutation Data Base —>

2272 rec => 348 Distinct Errors

208 47

<— Real Fault Data Base —>

1450 rec => 255 Distinct Errors

ImmediatePropagatedOther errorsCommon errors

7%8%

85%

Impact of the Mutation Experiments(wrt Real Faults)

2272

Not found in error set created by real faults

Critical software from civil nuclear field - 12 programming faults

[Daran & Thévenod-Fosse 1996 — ISSTA’96]

SWIFI vs. Software Faults

SW Fault Classification (ODC)AssignmentCheckingInterfaceTimingAlgorithmFunction

Can be (easily) emulated by SWIFI

-> Main open issues are related to fault-trigerring conditions?

SWIFI Bit-flips vs. SEUs

Computerized system (80C51 controller)Activity: 6x6 matrix multiplication

3%

46% 51%

Sequence lossErroneous resultTolerated

SWIFI Bit-flips

1.5%

44% 54.5%

SEUs Radiation

12245

[Velazco et al. 2000 — IEEE ToNS Dec. 2000]

Several FI techniques

MARS fault-tolerantdistributed system(prior version of TTP architecture)

Heavy-ion radiation

Electromagnetic interferences

Pin-level (forcing)

[Karlsson et al. 1998 — DCCA-5]

Scan Chain-Implemented Fault Injectionvs. Simulation

32-bit Processor (Saab Ericsson Space)Control program

Exceptions16%

Control Flow 3%

Other 1%Failure 4%

Overwritten

61% Latent15%

Exceptions23%

Control Flow 1%

Other 0%Failure 6%

Overwritten

42%

Latent28%

SCIFI Simulation (VHDL)

[Folkesson et al. 1998 — FTCS-28]

33405 7990

Comprehensive & Coordinated Study

Physical Faults (VHDL Simulation and SWIFI)

Software FaultsApplication (Mutation, Controlled-SWIFI)OS (Bit-flips on system call parameters, Invalid system call parameters,Bit-flips on internal function calls, real faults)

Operator & Maintenance Administrator Faults(emulation scripts, real faults from field data and interviews)

Software Faults in OSs

Target:Linux OSScheduling component

Invalid APIparameter

Bit-flippedAPI parameter

Bit-flippedinternal function

parameter

507 1890 552

More information

DBench - Dependability Benchmarking[IST Project 2000-25425]

-> http://www.laas.fr/dbench/

IFIP WG 10.4 - SIGDeBSpecial Interest Group on Dependability Benchmarking

-> http://www.dependability.org/wg10.4/SIGDeB

from experimental assessment of fault-tolerant systems to

Documents