acceptability-oriented computing martin rinard laboratory for computer science massachusetts...

94
Acceptability-Oriented Computing Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Upload: nickolas-rodgers

Post on 13-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Acceptability-Oriented Computing

Martin Rinard Laboratory for Computer Science

Massachusetts Institute of Technology

Traditional View of Correctness

Execution Space

Traditional View of Correctness

Correct Execution

Execution Space

Acceptability View

Correct Execution

AcceptabilityEnvelope

Execution Space

Acceptability View

Correct Execution

AcceptabilityEnvelope

Execution Space

Acceptable Executions

Acceptability View

Correct Execution

AcceptabilityEnvelope

Execution Space

Acceptable Executions

Unacceptable Execution

Acceptable Execution

Correct Execution

Execution Space

AcceptabilityEnvelope

Fail Stop Execution

Correct Execution

Execution Space

AcceptabilityEnvelope

STOP

Safe Exit Execution

Correct Execution

Execution Space

AcceptabilityEnvelope

STOPSafe Exit

Point

Correct Execution

Execution Space

AcceptabilityEnvelope

Repaired Execution

Resilient Computing Execution

Questions

• How to identify acceptability envelope?• Set of acceptability properties• Basic properties that any execution

must satisfy to be acceptable• How to ensure program stays within

envelope?• Acceptability monitoring• Acceptability enforcement

Correct Execution

Acceptability Enforcement

Execution Space

AcceptabilityEnvelope

AcceptabilityMonitoring

Repaired Execution

Resilient Computing Execution

Proposed Structure

Core SystemInputs Outputs

Proposed Structure

Core SystemInputs Outputs

OutputFilter

Proposed Structure

Core SystemInputs Outputs

OutputFilter

InputFilter

Proposed Structure

Core System

Data StructureRepair

Outputs

OutputFilter

Inputs

InputFilter

Proposed Structure

Core System

Data StructureRepair

ProbeRepair

Outputs

OutputFilter

Inputs

InputFilter

Proposed Structure

Core System Outputs

Data StructureRepair

ProbeRepair Output

Filter

Control TransferOutput

Rectification

Inputs

InputFilter

Proposed Structure

Core System Outputs

Data StructureRepair

ProbeRepair Output

Filter

Control Transfer

ExceptionRecovery

Inputs

InputFilter

OutputRectification

Proposed Structure

Core System Outputs

Data StructureRepair

ProbeRepair Output

Filter

ResponseEnforcement Control Transfer

ExceptionRecovery

Inputs

InputFilter

OutputRectification

Monitoring and Enforcement Mechanisms

• Black Box• Do not affect core• Input/output filters and correlators

• White Box – New code and data into core• Gray Box

• No change to core program• Can change data structures and control flow• Mechanisms

•Procedure call and system call interception•Ptrace interface, mmap to access address

space

Reason for Acceptability-Oriented Computing:

Difficulty of Delivering Perfect Software• Difficulty in all areas of development effort

• Understanding domain, obtaining requirements

• Producing specification, developing software

• Change Aspiration of Development Process• Accept inevitability of imperfection• Goal is to deliver acceptable program

• Augment Development Activities • Identify crucial acceptability properties• Ensure that program does not violate them

Aspiring to Perfection Recognized as Harmful

Defocuses development effort• All parts seen as equally important• No formal way to direct development

effort to most important parts of code• Produces brittle structure

• Each piece of functionality implemented•Once (no redundancy)•Completely (hard and easy parts

together)• No recovery or protection mechanisms• Program completely vulnerable to any

error

Advantages of Acceptability-Oriented Computing

• Focused, prioritized development effort • Appropriately direct engineering activities • Ensure satisfaction of acceptability

properties• Resilient software structure

• Redundant acceptability property enforcement

• Mechanisms enforce partial properties• Simpler (easier to obtain acceptability) than

complete modules in core software• Resulting software structure tolerates errors

Ideal Result

• Can build systems with less development effort• Can reduce testing effort for core• Can leave (infrequent) errors in system

• Can build systems with more functionality• Can invest saved development effort on

increasing functionality of system• Can make larger system stable• Can use more aggressive, riskier

algorithms

Map Example

put x 10put y 12

get yrem z

put z 11

1012

1210

11

OutputsInputs

Acceptability PropertyOutput must be within min and max inputs

Map Core

Map Example

put x 10put y 12

get yrem z

put z 11

1012

1210

11

OutputsInputs

Acceptability PropertyOutput must be within min and max inputs

Map Core

Unacceptable Output

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

2

UnacceptableOutput

Map Core

Input/Output Correlation

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

Input/Output CorrelatorMin: Max:

2

InputMonitor

OutputFilter

Map Core

Input/Output Correlation

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

Input/Output CorrelatorMin: 10 Max: 12

2

put x 10put y 11

put x 12rem x

rem y

get x

InputMonitor

OutputFilter

Map Core

Input/Output Correlation

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

Input/Output CorrelatorMin: 10 Max: 12

2

1011

1212

11

put x 10put y 11

put x 12rem x

rem y

get x

InputMonitor

OutputFilter

Map Core

First Option: Shut Down System

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

Input/Output CorrelatorMin: 10 Max: 12

2

1011

1212

11

put x 10put y 11

put x 12rem x

rem y

get x

InputMonitor

OutputFilter

Map Core

Second Option: Return Error Code

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

Input/Output CorrelatorMin: 10 Max: 12

2

1011

1212

11

0

put x 10put y 11

put x 12rem x

rem y

get x

InputMonitor

OutputFilter

Map Core

ErrorCode

Third Option: Return Min or Max Value

put x 10put y 11

put x 12rem x

rem y

OutputsInputs

get x

1011

1212

11

Input/Output CorrelatorMin: 10 Max: 12

2

1011

1212

11

10

put x 10put y 11

put x 12rem x

rem y

get x

InputMonitor

OutputFilter

Map Core

MinValue

When to Use Each Option

• Shut down system when• It is safe and acceptable• External intervention is available

• Return error code when• Client is able to deal with error code

• Return min or max when• Not safe to shut down system• No external intervention available• Client not prepared to deal with error

code

Safe Exit

Delegation

ResilientComputing

All options use block box mechanism

Implementation Approach

a1

e7

b3

d4

h10

i11

HashTable

FreeList

AcceptabilityProperty

Each entry has exactly one incoming reference

• From table, table entry, or free list

• Implies no cycles in table or free list

• Implies disjointness of table and free list

Checking for Acceptability Violations

• Auxiliary reference count for each entry• Traverse data structures to compute

counts• Check that no count greater than one• Complications

• Invalid pointers (addressing violations)

• Out of bounds array indices (more addressing violations)

• Cycles (infinite traversal loops)

Mechanisms for Accessing Data Structures

• White Box• Link monitor and checking code into core• Possibility of core corrupting checker

(and vice-versa!)• Gray Box

• Checker uses ptrace interface (or mmap)• More cumbersome to access data

structures• But checker isolated from core

Inconsistency Responses

• Fail stop – halt program, await intervention• Feasible when halting acceptable• And intervention practical• May actually decrease reliability

• Delegation – return error code to client• Feasible when client can deal with error

• Resilient computing – fix inconsistency, continue• Enables continued (acceptable) execution• Hides effect of inconsistency from clients

Code for Put Procedure in Map Example

int table[M];int freelist;put(n, v) e = alloc(); value(e) = v; strcpy(name(e), n); p = find(n); if (p != NOENTRY) free(p); b = bin(n); next(e) = table[b]; table[b] = e; return(v);free(e) value(e) = freelist; freelist = e;

Hash table and free list

Insert entry into free list

Allocate and initialize new hash table entry

Insert new entry into hash table

Free old entry with same name

Code for Put Procedure in Map Example

int table[M];int freelist;put(n, v) e = alloc(); value(e) = v; strcpy(name(e), n); p = find(n); if (p != NOENTRY) free(p); b = bin(n); next(e) = table[b]; table[b] = e; return(v);free(e) value(e) = freelist; freelist = e;

Hash table and free list

Insert entry into free list

Allocate and initialize new hash table entry

Insert new entry into hash table

Free old entry with same name

Does not check for empty free list

Leaves entryin table

Creates cycle if entry already in table

Problem

Program crashes if free list empty when call put

New Acceptability Property

Free list is not empty

Acceptability Enforcement

Repair algorithm ensures free list not empty

Data Structure Repair Goal

Map Core Map Core

Invalid References

Cycle

Empty Free List

All References Valid

No Cycles

Entries in Free List

Enforcing Consistency• Hand-coded consistency algorithm• Coding is difficult because must assume data

structures can be arbitrarily corrupted• Invalid references, out of bounds indices• Cycles (can cause infinite loops in repair code)

• Two data structure traversals• First eliminates invalid references and indices• Second removes all but first reference to each

entry (requires auxiliary marking data structure)• Reconstruct free list

• Any unreferenced entry put into list

• If free list still empty, steal entry from table

Issues

• Replace failure with potentially suboptimal (but still acceptable) execution

• Checking overhead• Depends on properties and application• Subject to optimization

• Obscured errors• Record violations and updates in logs• Use logs to reconstruct actions

• Potential errors in checking and repair code• Acceptability enforcement code deals with

simpler properties than core• Should be simpler and easier to get correct

Generalizations

• Process structure consistency• System structured as collection of

processes• Monitor and regenerate processes to

preserve consistency properties• System configuration consistency

• Difficult to get configuration settings correct

• Monitor and update to satisfy properties• Properties may depend on running

applications, attached devices, etc.• Both involve structural properties

Next Problemint table[M];int freelist;put(n, v) e = alloc(); value(e) = v; strcpy(name(e), n); p = find(n); if (p != NOENTRY) free(p); b = bin(n); next(e) = table[b]; table[b] = e; return(v);free(e) value(e) = freelist; freelist = e;

Buffer Overrun

Long Inputs Crash Core

Map Core

put x 10put y 11

put xxxxxxxxxxx 12

rem x

rem y

OutputsInputs

get xxxxxxxxxxx

101111

Long Inputs Crash Core

Map Core

put x 10put y 11

put xxxxxxxxxxx 12

rem x

rem y

OutputsInputs

get xxxxxxxxxxx

101111

Long Inputs Crash Core

Map Core

put x 10put y 11

put xxxxxxxxxxx 12

rem x

rem y

OutputsInputs

get xxxxxxxxxxx

101111

put x 10

put y 11

put xxx 12

rem x

rem y

get xxx

121012

TruncatingInput Filter

Classification of Techniques

• Acceptability properties can involve • Inputs, outputs, state, behavior, timing• In any combination

• Examples• Use data structures to filter outputs• Use inputs to repair data structures• Process structure and configuration

consistency • Timing constraints

• Input arrivals and triggered program actions

• Frequency of output events

Examples from Real Systems

5ESS Switch

IBM MVS OS

Both systems use hand-codeddata structure repair

Maintenance Commands fsck(1M)

NAME fsck - check and repair file systems

SYNOPSIS fsck [ -F FSType ] [ -m ] [ -V ] [ special ... ]

fsck [ -F FSType ] [ -n | N | y | Y ] [ -V ] [ -o FSType-specific-options ] [ special ... ]

DESCRIPTION fsck audits and interactively repairs inconsistent file system conditions. If the file system is inconsistent the default action for each correction is to wait for the user to respond yes or no. If the user does not have write permission fsck defaults to a no action. Some corrective actions will result in loss of data. The amount and sever- ity of data loss may be determined from the diagnostic out- put.

GPS Wide Area Augmentation System

GPS Wide Area Augmentation System

Validates Results

Ray Tracing Graphics Computations

Scene ComposedOf Triangles

Ray Tracing Graphics Computations

Scene ComposedOf Triangles

Shoot Rays Into Scene

Ray Tracing Graphics Computations

NormalVectors

Shoot Rays Into SceneCompute How They

Interact with Triangles

Ray Tracing Graphics Computations

NormalVectors

Shoot Rays Into Scene

Degenerate Triangle

(colinear vertices)Normal vector

computation fails

Acceptability-Oriented Approach

• Do not code up all degenerate cases• Failed computation generates a signal

• Catch signal• Generate some likely value• Continue with that value

• Result• Several pixels are incorrect• But picture as a whole looks fine• Program simpler and works faster

Sample Images

T. Kay and J. KajiyaRay Tracing Complex ScenesSIGGRAPH 1986

Limp-Home Modes in Engine Controllers

Hardware

Interlocks

Interlocks PreventUnsafe entry into enclosure while the bank is energized or not

grounded Unsafe operation of air-disconnect while vacuum switches are

closed Unsafe operation of ground switch(s) while air-disconnect is

closed

No Hardware Interlocks

Therac-25

Common Theme

Presence of acceptability-oriented features reduces need for perfection

• Safety-critical systems• Persistent data• Can have more ambitious core

• More functionality• More aggressive, riskier algorithms• Can tolerate algorithms with known errors

• Less development effort• Less testing and certification• Can leave infrequent errors in system

Two Kinds of Acceptability-Oriented Computing

• Opportunistic acceptability-oriented computing• Observe acceptability problem• Develop acceptability enforcement

mechanism specifically for that problem• Systematic acceptability-oriented computing

• Identify acceptability properties during requirements analysis and design

• Integrate acceptability features into design• Implement acceptability enforcement

mechanisms as normal development activity

Changes to Development Activities

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Changes to Development Activities

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Problem With Standard Approaches:

Aspiration of Perfection•Flat set of requirements•Specification expected to

perfectly capture requirements•Implementation goal: produce

flawless implementation•Testing goal: eliminate all

implementation errors•No attempt to

•Focus on important properties

•Build resilient system

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Prioritize RequirementsSeparate what really matters for system

From what would be nice to have

Foundation of Acceptability-Oriented Computing

Provides basis for acceptability properties

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Translate Prioritized Requirements into Acceptability Properties

External PropertiesInputs and Outputs

With Acceptability-Oriented Computing

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Identify Internal Acceptability Properties

Data StructuresImplementation

How to Integrate Acceptability Property Enforcement

How to monitor executionHow to interveneBlack/gray/white box

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Implement and integrate acceptability enforcement mechanisms

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Acceptability enforcement code helps discover and localize errors

Develop, deploy new acceptability properties as necessary

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Turn on appropriate resilient computing mechanisms

Helps system execute acceptably with minimal external intervention

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• MaintenanceDevelop, deploy new acceptability properties as necessary

With Acceptability-Oriented Computing

• Requirements

• Specification

• Design

• Implementation

• Testing

• Deployment

• Maintenance

Potential Adoption Paths•Adopt incrementally

• Start with specific activity• Or selected part of system• Can stop short of complete

adoption if it makes sense•Adopt in parallel

• Small acceptability team• Most developers oblivious

•Can orient entire development process around acceptability

Consequences of Systematic Acceptability-Oriented Computing

• Better (more acceptable) software• Improved understanding of requirements• Inevitable errors placed to minimize harm• Resilient systems that recover from errors

• Better documentation • Acceptability properties document what is

important about the system• Acceptability enforcement mechanisms

ensure that they accurately reflect implementation

More Consequences

• Reduced development and maintenance costs• Prioritized engineering effort• More aggressive software reuse• Reduced testing costs

•Acceptability properties help tester•Simpler testing for acceptability

enforcers• Can leave infrequent errors in system

Continued Execution as an Acceptability Property

Failure-Oblivious Computing

Why Don’t PCs Have Memory With Parity?

With ParityMemory error

occursPC flags it and stopsConsumer blames

manufacturer

Manufacturer’s Perspective

No ParityMemory error occursPC oblivious, keeps goingIf system crashes,

consumer blames Microsoft

No Incentive to Increase Parts Cost to Get “Benefit” of Parity

Why Don’t PCs Have Memory With Parity?

With ParityMemory error

occursPC flags it and

stopsHave to reboot

Consumer’s Perspective

No ParityMemory error occursPC oblivious, keeps goingSystem may never crash

(at least, not because of parity error)

If it does crash, no big surprise

Lack of Parity Increases Reliability!Because It Makes PC Oblivious to

Failure

Why Will Java Program Fail?

a[i] = x; x = a[i];

o.f = x; x = o.f;

Out of boundsarray access

Null pointerdereference

Standard ResponseThrow an exception and terminate the program

Why Will Java Program Fail?

a[i] = x; x = a[i];

o.f = x; x = o.f;

Out of boundsarray access

Null pointerdereference

Resilient ComputingResponse

Ignore error andkeep executing

DiscardValue

Use ManufacturedValue

Can Extend the Approach to C

• When program attempts to access illegal address• Discard value (writes)• Use manufactured value (reads)• Program keeps executing

• Improved version uses a Safe C compiler• Catches pointer and array bounds errors• Replace exception handler to

•Discard value (read)•Use manufactured (write)•Program keeps executing

• Improvement reduces data structure damage

Why Continued Execution is so Valuable

• Systems often consist of• Multiple components• Each provides important

functionality• Artificial coupling between

components• Components need flow of control to

deliver its functionality• Any error in any component can

deny flow of control to all others• Continued execution enables control

to continue to flow to each component

Why Continued Execution is so Valuable

Furthermore• Even within a component, error

may not cause unacceptable execution

• Or cause of error may eventually be flushed

• Moral of the story• 90% of life is just showing up• Keep program showing up

More Ways to Ensure Continued Execution

• Eliminate special-case code• Poorly tested, likely to contain errors• Not as important as common-case

code• Locate code that causes errors and

remove it (garbage collection instance of this idea)

Complication: Infinite Loops

• Failure-oblivious techniques can make a computation immortal

• Need a way to identify, then kill useless or misguided computations• Bound loop iterations• Randomize branch and jump targets• Speculatively parallelize computation

• Lack of good mortality units• Can attempt to leverage existing

structure: threads, transactions, components, …

• New construct to express mortality units

Conservative Aggressive

Data Structure Repair

Data Structure Consistency Checks

Failure-ObliviousComputing

Application-SpecificError Recovery

Input and OutputFiltering

HardwareInterlocks

Acceptability-Oriented Computing is a Perspective

CodeExcision

Limp HomeModes

RedundantComputing

DevelopmentProcess Changes

Key Ideas

• Reject aspiration of perfection• Focus on acceptability

• Acceptability properties identify acceptability envelope

• Acceptability enforcement mechanisms keep system within acceptability envelope

• Opportunistic vs. systematic approaches• Ideal result

• More resilient systems• Less development and testing effort

Binac Avionics System

Example Techniques

• Filter out unacceptable inputs• Truncate strings to eliminate buffer overruns• Clamp numeric values within range

• Use data structures to filter inputs and outputs• Use inputs to repair data structures• Process structure and configuration consistency• Continued execution as acceptability property

• Failure-oblivious computing• Code and input variation

Aesthetics

• AOC about how to get along in world without perfection

• One thing to accept perfection as unattainable• Another thing to view aspiration for perfection

as counterproductive• Examples from art that informed thinking• Bach (little fugue in G minor) vs. Mahler 2

• Scale important differentiator• Picasso (19 year old perfect picture, cubism)• Michelangelo (david, unfinished)