from experimental assessment of fault-tolerant systems to

26
From Experimental Assessment of Fault-Tolerant Systems to Dependability Benchmarking Workshop on Fault-Tolerant Parallel and Distributed Systems — April 19, 2002 Int. Parallel & Distributed Processing Symp. (IPDPS-2002) — Fort Lauderdale, FL, USA Jean Arlat DBench Dependability Benchmarking Work partially supported by project IST 2000-25425

Upload: others

Post on 03-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Experimental Assessment of Fault-Tolerant Systems to

From Experimental Assessmentof Fault-Tolerant Systems

to Dependability Benchmarking

Workshop on Fault-Tolerant Parallel and Distributed Systems — April 19, 2002

Int. Parallel & Distributed Processing Symp. (IPDPS-2002) — Fort Lauderdale, FL, USA

Jean Arlat

DBench

Dependability BenchmarkingWork partially supported by project IST 2000-25425

Page 2: From Experimental Assessment of Fault-Tolerant Systems to

Dependability Assessment

ObjectivesEvaluation of Dependability Measures (Reliability, Availability, etc.)Verification of Properties

Nominal ServiceService in presence of Faults

Characterization of Behavior in Presence of FaultsFailure modesEfficiency of fault tolerance

Methods and TechniquesAxiomatic

Stochastic processesModel checking Empirical

Field measurement Robustness testing Fault injection

Page 3: From Experimental Assessment of Fault-Tolerant Systems to

Fault Tolerance Validation

Validation of fault tolerancewrt specific inputs it is designed to deal with: the faults

FT mechanisms = humanartefacts (not perfect)

Calibration of models

Formal approaches limits

Threats = rare event

Impact on dependabilitymeasures

Estimation of FT efficiency

Experimental approaches

Controlled experiments

Fault Tolerance (FT)Dependability

Fault Injection

Page 4: From Experimental Assessment of Fault-Tolerant Systems to

Fault Injection

Test and evaluation of fault-tolerant systems& FT mechanisms

Explicit characterization of faulty behaviors

Target System

Activity

Faults

InputError

Processing

Valid

Invalid

Output

Page 5: From Experimental Assessment of Fault-Tolerant Systems to

8

Calibration & Validation

Analytical & Experimental Evaluations

Target System 1

5

Fault Injection

Experiments

6

7

Constructionof Empirical & Physical

Models

Estimation of FTMs

Coverage Parameters

Processing

by Stochastic Processes

3

4Evaluation of

Dependability Measures

2Construction ofAxiomatic Models

Page 6: From Experimental Assessment of Fault-Tolerant Systems to

Fault Injection as A Design Aid

—4 C T,1 N

1 2 3

4

μ μ

2

4 +H 4 CT N 3 +H 3 CT N

4 ( C—

T,2 + C—

T,3 ) N

N3 C—

T N

Model

10

10

10

+1

+2

+3

10-4 10-3 10-2

NAC Std - AMp V2

NAC Std - AMp V1

NAC Std - AMp V2.5

NAC Duplex - AMp V2.5

MTFF networkMTFF node

NAC/AMp

spare node

Token ring

Host Host Host Host

Target Architecture

NAC/AMp NAC/AMp NAC/AMp

Coverage factors— — —

NAC Std - AMp V 1

NAC Std - AMp V 2

NAC Std -AMp V 2.3

NAC Duplex - AMp V 2.5

CT CT,1 CT,2 CT,3

79.08% 2.32% 11.77% 6.83%

85,02% 8.73% 2.80% 3,45%

90,32% 7.79% 1,05% 0,84%

99.55% 0.32% 0.00% 0.12%

Target system

FaultInjection

[ESPRIT Project Delta-4]

Page 7: From Experimental Assessment of Fault-Tolerant Systems to

(Software) Component-based Development

Integration of previously developed components (commercial or not) OTS = either COTS or OSSMain advantages

productivity and time to market incorporation of technology advances compatibility with industry standards quality (widely deployed components)

Applications Requiring High-level of Dependability

Lack of observability and controlability : ? Dependability

Global cost (development, validation, usage, etc.) ?

Page 8: From Experimental Assessment of Fault-Tolerant Systems to

Fault Injection-basedDependability Characterization of COTS SW

��

0% 10% 20% 30% 40% 50%

SunOS 5.5

SunOS 4.13

QNX 4.24

QNX 4.22

OSF-1 4.0

OSF-1 3.2

NetBSD

Lynx

Linux

Irix 6.2

Irix 5.3

HPUX 10.20

HPUX 9.05

FreeBSD

AIX

Abort % Silent % Restart %

�Catastrophic

Normalized failure rate (%)

AIX

FreeBSD

HP-UX B.10.20

Linux

LynxOS

QNX 4.24

SunOS 5.5

NetBSD

Irix 6.2

Irix 5.3

HP-UX B.9.05

OSF-1 3.2

OSF-1 4.0

QNX 4.22

SunOS 4.13

Abort

SilentRestart

Catastrophic

0 10 20 30 40 50

9

15 « COTS » OSs[Koopman & DeVale 99 (FTCS-29)]

Invalid parameters in system calls at POSIX Interface

Ext. faults

0

10

20

30

40

50

60

70

80

90

ErroneousResult

Appl.Hang

SystemHang

KernelDebug.

Exception ErrorStatus

NoObs.

Ob

serv

atio

n % Int. faults

Chorus microkernel[Arlat et al. 2002 (ToC Feb.)]

Bit flips- on parameters of kernel calls (ext.) - in kernel memory space (int.)

MAFALDA

Page 9: From Experimental Assessment of Fault-Tolerant Systems to

Faul InjectionWell-Accepted by Industry as a Whole

ProviderIBM, Intel, Sun MicroSystems,…

IntegratorAnsaldo Segnalamento Ferroviario, Astrium,DaimlerChrysler, Saab Ericsson Space, Siemens,Technicatome, THALES, Volvo,…

StakeholderElectricité de France, ESA, NASA (JPL),…

ConsultantCritical Software, Cigital,…

Page 10: From Experimental Assessment of Fault-Tolerant Systems to

Dependability Benchmarking

DependabilityAssessment

Performance Benchmarking

DependabilityBenchmarking +

Naive View … :-)

Page 11: From Experimental Assessment of Fault-Tolerant Systems to

More Realistic View

DependabilityBenchmarking

Agreement/AcceptanceUsefulnessFairnessUsabilityPortability

etc.

DependabilityAssessment

Performance Benchmarking

Desired Properties

Page 12: From Experimental Assessment of Fault-Tolerant Systems to

Dependability Benchmarking Framework

Analysis

Experimentation

Target System

FaultloadWorkload

Experimental Features& Measures

AComprehensive

Measures

Modeling

B

DBench

Dependability Benchmarking

Page 13: From Experimental Assessment of Fault-Tolerant Systems to

Challenges

?Real faultsReal faultsReal faultsReal Faults

Real faultsReal faultsReal faultsFault Injection

Techniques

Real faultsReal faultsReal faults

Faultload(s)for

Dependability Benchmarking

• Agreement/Acceptance

• Representativeness

• Usefulness,

• Portability,…

Properties Dimensions

• Faultload

• Workload

• Measurements

• Measures

• Representativeness• Faultload

Page 14: From Experimental Assessment of Fault-Tolerant Systems to

The Fault Injection Techniques

Logical &Information

Physical

MEAN

SimulationModel

Prototype/Real System

ModèleSimulation-based

Software-Implemented

Physical

Heavy-ions FIST,…EM perturbations TU ViennaPins MESSALINE, Scorpion, DEFOR, RIFLE, AFIT, ...

TARGET SYSTEM

Built-in test devices (SCIFI) FIMBUL

system DEPEND, REACT, ...RT Level ASPHALT, ...Logical Gate Zycad, Technost, ...Switch FOCUS, ...

Wide Range MEFISTO, VERIFY,...

communication FIMD debugger FIESTA task FIATexecutive Ballista, (DE)FINE, MAFALDA,memory DEF.I, SOFIT, ...instr. set FERRARIprocessor Xception, …

Compile-timesoftware mutation

SESAME, ...

Page 15: From Experimental Assessment of Fault-Tolerant Systems to

Target System Levels & Fault Pathology

M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

X

O

O

O

O

OOO

X: Fault locations O: Observation locations

Page 16: From Experimental Assessment of Fault-Tolerant Systems to

Target System Levels & Fault Pathology

M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

X

O

O

O

O

OOO

X

O

O

O

O

O

OO

O

X: Fault locations O: Observation locations

Page 17: From Experimental Assessment of Fault-Tolerant Systems to

Target System Levels & Fault Pathology

M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

XO

O

O

OOO

X

O

O

OX

O

O

OO

O

X: Fault locations O: Observation locations

Page 18: From Experimental Assessment of Fault-Tolerant Systems to

Target System Levels — Ref. & Obs. dist.

dr1

dr2 do1

do2Dist. to fault ref. Dist. to observation

M

i

d

d

l

e

w

a

r

e

K

e

r

n

e

l

-

d

e

v

i

c

e

L

o

g

i

c

R

T

L

A

l

g

o

r

i

t

h

m

i

c

A

p

p

l

i

c

a

t

i

o

n

O

p

e

r

a

t

i

o

n

X

X

X

OOO

Page 19: From Experimental Assessment of Fault-Tolerant Systems to

Mutation vs. Real Software Faults

10%

30%60%

5%

13%

82%

Sets of Errors Provoked => 395 distinct errors

140

<— Mutation Data Base —>

2272 rec => 348 Distinct Errors

208 47

<— Real Fault Data Base —>

1450 rec => 255 Distinct Errors

ImmediatePropagatedOther errorsCommon errors

7%8%

85%

Impact of the Mutation Experiments(wrt Real Faults)

2272

Not found in error set created by real faults

Critical software from civil nuclear field - 12 programming faults

[Daran & Thévenod-Fosse 1996 — ISSTA’96]

Page 20: From Experimental Assessment of Fault-Tolerant Systems to

SWIFI vs. Software Faults

SW Fault Classification (ODC)AssignmentCheckingInterfaceTimingAlgorithmFunction

Can be (easily) emulated by SWIFI

-> Main open issues are related to fault-trigerring conditions?

Page 21: From Experimental Assessment of Fault-Tolerant Systems to

SWIFI Bit-flips vs. SEUs

Computerized system (80C51 controller)Activity: 6x6 matrix multiplication

3%

46% 51%

Sequence lossErroneous resultTolerated

SWIFI Bit-flips

1.5%

44% 54.5%

SEUs Radiation

12245

[Velazco et al. 2000 — IEEE ToNS Dec. 2000]

Page 22: From Experimental Assessment of Fault-Tolerant Systems to

Several FI techniques

MARS fault-tolerantdistributed system(prior version of TTP architecture)

Heavy-ion radiation

Electromagnetic interferences

Pin-level (forcing)

[Karlsson et al. 1998 — DCCA-5]

Page 23: From Experimental Assessment of Fault-Tolerant Systems to

Scan Chain-Implemented Fault Injectionvs. Simulation

32-bit Processor (Saab Ericsson Space)Control program

Exceptions16%

Control Flow 3%

Other 1%Failure 4%

Overwritten

61% Latent15%

Exceptions23%

Control Flow 1%

Other 0%Failure 6%

Overwritten

42%

Latent28%

SCIFI Simulation (VHDL)

[Folkesson et al. 1998 — FTCS-28]

33405 7990

Page 24: From Experimental Assessment of Fault-Tolerant Systems to

Comprehensive & Coordinated Study

Physical Faults (VHDL Simulation and SWIFI)

Software FaultsApplication (Mutation, Controlled-SWIFI)OS (Bit-flips on system call parameters, Invalid system call parameters,Bit-flips on internal function calls, real faults)

Operator & Maintenance Administrator Faults(emulation scripts, real faults from field data and interviews)

Page 25: From Experimental Assessment of Fault-Tolerant Systems to

Software Faults in OSs

Target:Linux OSScheduling component

Invalid APIparameter

Bit-flippedAPI parameter

Bit-flippedinternal function

parameter

507 1890 552

Page 26: From Experimental Assessment of Fault-Tolerant Systems to

More information

DBench - Dependability Benchmarking[IST Project 2000-25425]

-> http://www.laas.fr/dbench/

IFIP WG 10.4 - SIGDeBSpecial Interest Group on Dependability Benchmarking

-> http://www.dependability.org/wg10.4/SIGDeB