fundamentals of dependable computing - icse...

69
Fundamentals Of Dependable Computing ICSE 2014 John C. Knight Department of Computer Science University of Virginia

Upload: hoangdung

Post on 24-May-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Fundamentals Of Dependable Computing

ICSE 2014

John C. Knight Department of Computer Science

University of Virginia

University of Virginia

Tutorial Goals p  Answer the following questions:

n  What is dependability and what affects it? n  How do we achieve dependability? n  How do we know we have achieved it?

p  Learn a few things, have some fun, etc. p  Motivation level, cannot teach several semester’s worth of

material in a day p  Assume that fundamental material is unfamiliar p  Emphasis on software

Dependability — A Rigorous, Principled, Engineering Subject

© John C. Knight 2014, All Rights Reserved 2

University of Virginia

Tutorial Roadmap

Basic Concepts

Terminology Of Dependability

Dependability Requirements

Faults And Fault Treatment

Software Technology

Review

Software Fault Avoidance

Software Fault Elimination

You are here

Fault Tolerance

Special Topics

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

© John C. Knight 2014, All Rights Reserved 3

University of Virginia

Tutorial Schedule p  Session 1:

n  Basic concepts n  Terminology n  Dependability requirements

p  Session 2: n  Faults n  Fault treatment n  Software technology review

p  Session 3: n  Formal specification n  Correctness by construction with SPARK Ada

p  Session 4: n  Hardware and software fault tolerance n  Byzantine agreement n  Fail-stop machines

© John C. Knight 2014, All Rights Reserved 4

University of Virginia

Tutorial Protocol p  Please call me John p  Please ask questions anytime p  Please relate your experience p  Please ask me to slow down if necessary p  Please indicate material that is:

n  Too high or too low level n  Needs clarification n  Wrong

p  Make sure that this is fun and useful to you p  Copies omit some slides (copyright)

© John C. Knight 2014, All Rights Reserved 5

University of Virginia

Tutorial Assumed Background

p  High-level language programming p  Basic computer organization p  Elementary probability theory p  Predicate calculus p  Set theory p  Basic principles of operating systems

© John C. Knight 2014, All Rights Reserved 6

Session 1

Terminology And Requirements

University of Virginia

Starting Point

Entrance to the

Meissen Pottery Factory

near Dresden in

Germany

© John C. Knight 2014, All Rights Reserved 8

University of Virginia

What Is Our Goal?

A chain is only as strong as its weakest link...

Ahhhhh

Many things could go wrong: •  Attachment could break •  Box could break •  Chain could break •  Etc.

© John C. Knight 2014, All Rights Reserved 9

University of Virginia

Dependability Outline 1.  Determine what happens if chain/box breaks

n  Determine the consequences of failure 2.  Decide how strong the chain/box has to be

n  Define dependability requirements 3.  Determine how chain/box might fail

n  Determine all possible faults that might occur 4.  Buy high quality chain/box so as to avoid defects

n  Practice fault avoidance during construction 5.  Fix any broken links that remain

n  Practice fault elimination during construction 6.  Cope with broken links you didn’t find or that break

n  Practice fault tolerance during operation

Make absolutely sure you find all the

weak links

© John C. Knight 2014, All Rights Reserved 10

University of Virginia

Dependability Outline

1.  Determine the consequences of failure

2.  Define dependability requirements

3.  Determine all possible faults that might occur

4.  Practice fault avoidance during construction

5.  Practice fault elimination during construction

6.  Practice fault tolerance during operation

© John C. Knight 2014, All Rights Reserved 11

University of Virginia

Dependable Computing Systems

© John C. Knight 2014, All Rights Reserved 12

University of Virginia

What Happened? p  Titan IV:

n  Power bus short circuit, guidance reboot

p  Mars Global Explorer: n  Unit systems confusion

p  Mars Polar Lander: n  Touchdown sensor triggered by landing leg deployment

© John C. Knight 2014, All Rights Reserved 13

University of Virginia

What Happened? p  Mars Rovers:

n  Pathfinder - Priority inversion

n  Spirit and Opportunity – file system filled up

p  Korean 801: n  Safety case violated

p  Ariane V: n  Unhandled exception

© John C. Knight 2014, All Rights Reserved 14

University of Virginia

Software And Its Role

Environment

System

Software

Users

Operators

Software caused all of these consequences

© John C. Knight 2014, All Rights Reserved 15

University of Virginia

How To Think About Failures p  Accidents are very complex:

n  Do not jump to conclusions n  There is never a “root” or single cause n  There are usually many causal factors

p  Lessons learned must be comprehensive: n  Prevent future occurrences of all causal factors

p  Always investigate failures

© John C. Knight 2014, All Rights Reserved 16

University of Virginia

Role of The Software Engineer p  Why should a software engineer be concerned

about dependability?

p  Four reasons: n  Software has to perform the intended function n  Software has to perform the intended function correctly n  Software has to operate on target system whose design

might have been influenced by system’s dependability goals

n  Software often needs to take action when a hardware component fails

© John C. Knight 2014, All Rights Reserved 17

University of Virginia

So? p  Acquire sufficient information about systems side

of dependability that the software engineering can: n  Understand why software is being asked to do what it is

being asked to do n  Understand why software is being made to operate on

the particular platform specified by the system designers

p  Acquire a conceptual framework within which to think about software and its dependability

© John C. Knight 2014, All Rights Reserved 18

University of Virginia

Current Software Practice p  Unstructured, natural language documents p  Some use of visual specification & formalism

from other disciplines (PDE’s) p  UML, C, C++, Java p  Tools for test setup and management p  Combative adoption of standards

GREAT CARE By Many GREAT PEOPLE

© John C. Knight 2014, All Rights Reserved 19

University of Virginia

Dependable Systems p  Software is the least well understood technology

involved p  Software volume is increasing at an exponential

rate p  Software is quickly becoming the dominant

cause of critical system failures p  Many aspects of hardware dependability are

essentially solved problems

© John C. Knight 2014, All Rights Reserved 20

University of Virginia

Step 1 Determine Consequences of

Failure

What Happens If The Chain Breaks?

© John C. Knight 2014, All Rights Reserved 21

University of Virginia

Consequences of Failure p  Injury or loss of life p  Environmental damage p  Damage to or loss of equipment p  Financial loss:

n  Theft n  Useless or defective mass-produced equipment n  Loss of production capacity or service n  Loss of business reputation, customer base

© John C. Knight 2014, All Rights Reserved 22

University of Virginia

Consequences of Failure p  Direct—device fails and causes injury p  Indirect—defective support tool leads to device

that fails p  Combinations of losses:

n  Immediate damage to equipment n  Subsequent loss of service n  Subsequent loss of business reputation n  Subsequent law suits

© John C. Knight 2014, All Rights Reserved 23

University of Virginia

Hidden Costs of Failure p  Some costs of failure are obvious, others not p  Ariane V example:

n  Obvious: p  Loss of launch vehicle p  Loss of payload

n  Not so obvious: p  Environmental cleanup p  Loss of service from payload p  Vehicle redesign p  Increased insurance rates for launches

© John C. Knight 2014, All Rights Reserved 24

University of Virginia

Consequences of Failure

p  Loss of revenue n  Software defects : $200 billion/year (SCC)

p  Note: recent worms and viruses

n  1 hour of downtime costs (InternetWeek, 2000): Brokerage operations $6,450,000/hr Credit card auth. $2,600,000/hr Ebay (1 outage/22hrs) $225,000/hr Home Shopping Channel $113,000/hr Airline reservation ctr $89,000/hr ATM services fee $14,000/hr

n  Note: Ebay (22 hrs, 1999) p  $4 Billion Market Cap loss

© John C. Knight 2014, All Rights Reserved 25

University of Virginia

Step 2 Define Dependability

Requirements

How Strong Does The Chain Have To Be?

© John C. Knight 2014, All Rights Reserved 26

University of Virginia

Why Dependability Requirements? p  Need an engineering target

n  What technology do we have to employ? p  Informal statements are inadequate:

n  “System should not fail.” n  “System must be very reliable.”

p  Need a target so that, if we meet it, system will perform “acceptably”

p  Need terminology to define requirements

© John C. Knight 2014, All Rights Reserved 27

University of Virginia

Terminology of Dependability

© John C. Knight 2014, All Rights Reserved 28

University of Virginia

Why Study Terminology? p  We need to be able to communicate in a precise

manner: n  Researchers n  Developers n  Customers

p  There are everyday notions of these terms p  The public has an interest p  But public terminology is imprecise

© John C. Knight 2014, All Rights Reserved 29

University of Virginia

Webster p  Dependable:

capable of being depended on: RELIABLE

p  Reliable: suitable or fit to be relied on: DEPENDABLE

n  Rely: n  1) to be dependent

<the system for which we depend on water>

n  2) to have confidence based on experience <someone you can rely on>

© John C. Knight 2014, All Rights Reserved 30

University of Virginia

Historical Evolution of Concerns p  40’s: ENIAC

n  18K vacuum tubes à failed ~ every 7 mns n  18K multiplies/minute à 7 mns ~ one program execution

Need RELIABILITY p  60’s: Interactive systems

+ AVAILABILITY p  70’s: F-8 Crusader, MD-11, Patriot missile defense

+ SAFETY p  90’s-today: Internet, E-commerce, Grid/Web services

+ SECURITY

© John C. Knight 2014, All Rights Reserved 31

University of Virginia

The Definitive Reference “Basic Concepts and Taxonomy of Dependable

and Secure Computing” Avizienis, Laprie, Randell, and Landwehr

IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1

p  We will use this material frequently

p  For now, we will look at dependability &failure

© John C. Knight 2014, All Rights Reserved 32

University of Virginia

Dependability

p  Old: Dependability is the ability to deliver service that can

justifiably be trusted. p  New: The dependability of a system is the ability to avoid service

failures that are more frequent and more severe than is acceptable.

p  Why the change? To allow for the possibility of failure yet the system remaining acceptable

Note Subjectivity

© John C. Knight 2014, All Rights Reserved 33

University of Virginia

Social Aspects p  Wholesale death:

n  Aircraft, railway accidents n  Improper detonation of weapons

p  Retail death: n  Car accidents, medical devices n  Operational procedures

p  Societal assessment determines acceptable levels of loss

Resolution of Subjectivity

© John C. Knight 2014, All Rights Reserved 34

University of Virginia

Failure p  From the taxonomy: Correct service is delivered when the service implements the

system function. A service failure, often abbreviated here to failure, is an event that occurs when the delivered service deviates from correct service. A service fails either because it does not comply with the functional specification, or because this specification did not adequately describe the system function. A service failure is a transition from correct service to incorrect service, i.e., to not implementing the system function. The period of delivery of incorrect service is a service outage. The transition from incorrect service to correct service is a service restoration. The deviation from correct service may assume different forms that are called service failure modes and are ranked according to failure severities

© John C. Knight 2014, All Rights Reserved 35

University of Virginia

Verification and Validation

Requirements

Specification

Implementation

Customer

Requirements Analysis & Elicitation

System Definition

System Development

Verification

Validation

Failure can arise from anywhere.

The customer does not care.

© John C. Knight 2014, All Rights Reserved 36

University of Virginia

Failure Viewpoints

p  Failure domain

p  Detectability of failures

p  Consistency of failures

p  Consequences of failure

© John C. Knight 2014, All Rights Reserved 37

University of Virginia

Failure Domain p  Content failure:

n  Supplied service has incorrect information

p  Timing failure: n  Timing or duration of information incorrect n  Halt failure:

p  System halts and remains in fixed state

n  Erratic failure: p  Service delivery is erratic

© John C. Knight 2014, All Rights Reserved 38

University of Virginia

Detectability And Consistency p  Detectability:

n  Signaled, unsignaled, false alarm p  False positives, false negatives

p  Consistency: n  Consistent—incorrect service seen identically by all

users n  Inconsistent—incorrect service seen differently by

different users, Byzantine failures

© John C. Knight 2014, All Rights Reserved 39

University of Virginia

Consequences Of Failure p  Different failures, different consequences p  Examples (go with intuitive notion for now):

n  Availability Long vs. short outage n  Safety Human life endangered vs. equip n  Confidentiality Type of information disclosed

p  Range of severity of failure: n  Minor failure n  Catastrophic failure

© John C. Knight 2014, All Rights Reserved 40

University of Virginia

Attributes Of Dependability p  Reliability

p  Availability

p  Safety

p  Confidentiality

p  Integrity

p  Maintainability

Requirements: These are properties that might be required of a given system

Customers must identify the dependability requirements of their system and developers must design so as to achieve them

© John C. Knight 2014, All Rights Reserved 41

University of Virginia

Attributes Of Dependability p  Note this very carefully…

p  Fault tolerance is not a system requirement or a system property

p  Fault tolerance is one of the mechanisms that can be used to provide dependability

p  This is important

Fault tolerance is

not a system property

© John C. Knight 2014, All Rights Reserved 42

University of Virginia

Reliability R(t) = Probability that the system will

operate correctly in a specified operating environment up until time t

© John C. Knight 2014, All Rights Reserved 43

p  System’s ability to provide continuous service p  Note that t is important p  If a system only needs to operate for ten hours

at a time, then that is the reliability target

University of Virginia

Failure Per Demand p  Sometimes, the notion of time, t, is not what we

need p  Some systems are only called to act on demand:

n  E.g., Protection systems

p  For such systems, what we need is:

Probability of failure per demand

© John C. Knight 2014, All Rights Reserved 44

University of Virginia

Availability

A(t) = Probability that the system will be operational at time t

© John C. Knight 2014, All Rights Reserved 45

p  Literally, readiness for service p  Admits the possibility of brief outages p  Fundamentally different concept

University of Virginia

Reliability vs. Availability

p  They are not the same..... p  Example:

A system that fails, on average, once per hour but which restarts automatically in ten milliseconds would not be considered very reliable but is highly available

Availability = 0.9999972

© John C. Knight 2014, All Rights Reserved 46

University of Virginia

Nines of Availability

# 9’s % Downtime / year Systems

2 99% ~5000 minutes General web site

3 99.9% ~500 minutes Amazon.com 4 99.99% ~50 minutes Enterprise

server 5 99.999% ~5 minutes Telephone

System 6 99.9999% ~30 seconds Phone switches

Caveats: How measured? What does it mean to be operational?

© John C. Knight 2014, All Rights Reserved 47

University of Virginia

Design Tradeoffs

p  How to make availability approach 100%?

p  Need to define maximum repair time p  Need to define how Availability is measured

Availability = MTTF -------------------- MTTF + MTTR

MTTF à infinity (high reliability) MTTR à zero (fast recovery)

MTTF = Mean Time To Failure

MTTR = Mean Time To Repair

© John C. Knight 2014, All Rights Reserved 48

University of Virginia

Subtleties

p  Which has higher availability? n  Two 4.5 hour outage / year n  1 minute outage / day

p  For an Internet-base company such as EBay or Amazon, which would be more desirable? Why?

p  For an autonomous rover? p  Need to specify details of acceptable outages

99.9%

© John C. Knight 2014, All Rights Reserved 49

University of Virginia

Specifying Availability p  Minimum probability of readiness for service p  Period over which probability measured p  Fixed or moving window p  Maximum acceptable outage

n  Time to repair

p  Minimum sustained operating time

© John C. Knight 2014, All Rights Reserved 50

University of Virginia

Safety

Absence of catastrophic consequences on the users or the environment

40K deaths/yr; 800/week = 2 fully loaded Boeing 747/week

© John C. Knight 2014, All Rights Reserved 51

p  Are commercial aircraft “safe”? p  They crash occasionally p  How many crashes are too many? p  Are cars “safe”? They crash quite a lot

University of Virginia

Risk p  Risk is defined to be expected loss per unit time:

Risk = Σ pr(accidenti) x cost(accidenti)

Risk < pre-defined_delta

© John C. Knight 2014, All Rights Reserved 52

p  Safety is defined to be an acceptable level of risk:

University of Virginia

Reliability vs. Availability vs. Safety

p  They are not the same..... p  Example:

A system that is turned off is not very reliable, is not very available, but is probably very safe

p  In practice, safety often involves specific intervention

p  Never believe an airline that says: “Your safety is our biggest concern.”

© John C. Knight 2014, All Rights Reserved 53

University of Virginia

Confidentiality

Absence of unauthorized disclosure of information p  Microsoft source code vs. Linux source code p  Web browsing p  Operating systems security model

n  Files, memory p  Medical records p  Credit card transaction records p  School grades

© John C. Knight 2014, All Rights Reserved 54

University of Virginia

Integrity

Absence of improper system state alterations p  Operating systems:

n  Files, memory, network packets p  Linux kernel backdoor attempt p  Database records p  Your bank account p  File transfer p  Did I really get the right version of software XYZ? p  …

© John C. Knight 2014, All Rights Reserved 55

University of Virginia

Security p Security is a combination of attributes:

n  Integrity n Confidentiality n Availability

p Under different circumstances, these attributes are more or less important: n Denial of service is an availability issue n Exposure of information is a confidentiality

issue

© John C. Knight 2014, All Rights Reserved 56

University of Virginia

Maintainability

Ability to undergo repairs and modifications

© John C. Knight 2014, All Rights Reserved 57

University of Virginia

General Dependability Requirements

p  Telecommunications: n  Availability, maintainability

p  Transportation: n  Reliability, availability, safety

p  Weapons: n  Safety

p  Nuclear systems: n  Safety

© John C. Knight 2014, All Rights Reserved 58

University of Virginia

Pacemaker Dependability

Pacing now extended to left ventricle

© John C. Knight 2014, All Rights Reserved 59

University of Virginia

General Characteristics p  Eight-bit processors, moving to 32-bit p  Software:

n  Approximately 30K lines, mostly “C” n  Vastly more software in external programmer

p  Patient data storage example: n  200 samples/sec, 3 channels, 15 minutes

p  Long battery life necessary—device “sleeps” between heart beats

© John C. Knight 2014, All Rights Reserved 60

University of Virginia

Pacemaker Inputs & Outputs p  Three leads—one from each of three chambers

(L, R ventricles, R atrium) p  Each lead can sense and pace p  Two shocking electrodes in right ventricle p  Case to lead impedance for breathing rate and

lung volume p  Accelerometer to measure activity p  Body temperature

© John C. Knight 2014, All Rights Reserved 61

University of Virginia

Real-time Control

Wake up

Check Expected Event

Generate Signal

Set Timer

Event As Expected?

Timer Interrupt

Yes

No

Heart Beat

© John C. Knight 2014, All Rights Reserved 62

University of Virginia

Multi-Chamber Functionality p  Pacing:

n  Always allow natural pace if possible n  Atrium-ventricle sequencing to allow for fill n  Atrium natural pulse, pace ventricle n  Biventricular pacing for congestive heart failure

p  Defibrillation: n  Ventricular fibrillation relatively easy to detect n  Distinguish benign, moderate, serious events n  Atrial fibrillation being treated also

© John C. Knight 2014, All Rights Reserved 63

University of Virginia

Dependability of Pacemaker p  Is reliability the goal? p  Typical battery life is five years p  Does failure matter?

n  Device is “off” between heartbeats n  70 Hz restart rate is acceptable n  So t = 100 milliseconds is OK

p  But persistent storage is needed

© John C. Knight 2014, All Rights Reserved 64

University of Virginia

Dependability of Pacemaker p  Is availability the goal?

p  How about an availability of 0.99999?

p  This corresponds to an average of five minutes per year of downtime

p  Death would result if this occurred all at once

© John C. Knight 2014, All Rights Reserved 65

University of Virginia

Dependability of Pacemaker p  Is safety the goal? p  It’s safe when it’s off—or is it? p  Leaving the system off might result in death very

quickly..... p  Safety:

n  Under/over pacing n  Unnecessary/missing defibrillation

© John C. Knight 2014, All Rights Reserved 66

University of Virginia

Attributes Of Dependability p  Reliability

p  Availability

p  Safety

p  Confidentiality

p  Integrity

p  Maintainability

Requirements: These are properties that might be required of a given system

Customers must identify the dependability requirements of their system and developers must design so as to achieve them

© John C. Knight 2014, All Rights Reserved 67

University of Virginia

How Much Is Enough? p  Dilemma—could we achieve higher

dependability: If more resources were expended on developing a given system

then the rate of failure might be reduced. That being the case, should the developers expend all possible resources on developing the system?

p  For example, formal verification: n  Known to detect defects n  Considered difficult and expensive to apply n  Should its use be required of all systems with large

consequences of failure?

© John C. Knight 2014, All Rights Reserved 68

University of Virginia

ALARP

p  As Low As Reasonably Practicable p  Three degrees of risk:

n  Tolerable (broadly acceptable)

n  Unacceptable—has to be reduced

n  ALARP region

Carrot Diagram

© John C. Knight 2014, All Rights Reserved 69