sweden day1

A New Approach to

System Safety Engineering

Nancy G. Leveson, MIT

System Safety Engineering: Back to the Future

http://sunnyday.mit.edu/book2.html

Copyright Nancy Leveson, Aug. 2006

Outline of Day 1

Why a new approach is needed

STAMP A new accident model based on system theory

(control)

Uses

Accident and incident investigation

Hazard Analysis (STPA) and Design for Safety

Non-Advocate Safety Assessment

Cultural and organizational risk analysis


Why need a new approach

Traditional approaches developed for relatively

simple electro-mechanical systems

Accidents in complex, software-intensive systems

are changing their nature

We need more effective techniques in these new

systems


Chain-of-Events Model

Explains accidents in terms of multiple events, sequenced

as a forward chain over time.

Simple, direct relationship between events in chain

Events almost always involve component failure, human

error, or energy-related event

Forms the basis for most safety-engineering and reliability

engineering analysis:

e,g, FTA, PRA, FMECA, Event Trees, etc.

and design:

e.g., redundancy, overdesign, safety margins, .


Chain-of-events example


Accident with No Component Failures


Types of Accidents

Component Failure Accidents

Single or multiple component failures

Usually assume random failure

System Accidents

Arise in interactions among components

Related to interactive complexity and tight coupling

Exacerbated by introduction of computers and

software

New technology introduces unknowns and unk-unks Copyright Nancy Leveson, Aug. 2006

Interactive Complexity

Critical factor is intellectual manageability

A simple system has a small number of unknowns in

its interactions (within system and with environment)

Interactively complex (intellectually unmanageable)

when level of interactions reaches point where can no

longer be thoroughly

Planned

Understood

Anticipated

Guarded against


Tight Coupling

Tightly coupled system is one that is highly

interdependent

Each part linked to many other parts

Failure or unplanned behavior in one can rapidly affect status

of others

Processes are time-dependent and cannot wait

Little slack in system

Sequences are invariant, only one way to reach a goal

System accidents are caused by unplanned and

dysfunctional interactions

Coupling increases number of interfaces and potential

interactions Copyright Nancy Leveson, Aug. 2006

Other Types of Complexity

Non-linear complexity

Cause and effect not related in an obvious way

Dynamic Complexity

Related to changes over time

Decompositional

Structural decomposition not consistent with

functional decomposition


Limitations of Chain-of-Events Model

Social and organizational factors in accidents

System accidents

Software

Adaptation

Systems are continually changing

Systems and organizations migrate toward accidents

(states of high risk) under cost and productivity

pressures in an aggressive, competitive environment


Limitations (2)

Human error

Define as deviation from normative procedures, but

operators always deviate from standard procedures

Normative vs. effective procedures

Sometimes violation of rules has prevented accidents

Cannot effectively model human behavior by

decomposing it into individual decisions and acts and

studying it in isolation from

Physical and social context

Value system in which takes place

Dynamic work process

Less successful actions are natural part of search by

operator for optimal performance Copyright Nancy Leveson, Aug. 2006

Mental Models


Exxon Valdez

Shortly after midnight, March 24, 1989, tanker Exxon Valdezran aground on Bligh Reef (Alaska)

11 million gallons of crude oil released

Over 1500 miles of shoreline polluted

Exxon and government put responsibility on tanker CaptainHazelwood, who was disciplined and fired

Was he to blame?

State-of-the-art iceberg monitoring equipment promised by oilindustry, but never installed. Exxon Valdez traveling outside normalsea lane in order to avoid icebergs thought to be in area

Radar station in city of Valdez, which was responsible for monitoringthe location of tanker traffic in Prince William Sound, had replaced itsradar with much less powerful equipment. Location of tankers nearBligh reef could not be monitored with this equipment.


Congressional approval of Alaska oil pipeline and tanker

transport network included an agreement by oil corporations to

build and use double-hulled tankers. Exxon Valdez did not

have a double hull.

Crew fatigue was typical on tankers

In 1977, average oil tanker operating out of Valdez had a crew of 40

people. By 1989, crew size had been cut in half.

Crews routinely worked 12-14 hour shifts, plus extensive overtime

Exxon Valdez had arrived in port at 11 pm the night before. The crew

rushed to get the tanker loaded for departure the next evening

Coast Guard at Valdez assigned to conduct safety inspections

of tankers. It did not perform these inspections. Its staff had

been cut by one-third.


Tanker crews relied on the Coast Guard to plot their positioncontinually.

Coast Guard operating manual required this.

Practice of tracking ships all the way out to Bligh reef hadbeen discontinued.

Tanker crews were never informed of the change.

Spill response teams and equipment were not readilyavailable. Seriously impaired attempts to contain and recoverthe spilled oil.

Summary:

Safeguards designed to avoid and mitigate effects of an oilspill were not in place or were not operational

By focusing exclusively on blame, the opportunity to learnfrom mistakes is lost.

Postscript:

Captain Hazelwood was tried for being drunk the night theExxon Valdez went aground. He was found not guilty


Hierarchical models


Hierarchical analysis example


The Role of Software in Accidents

The Computer Revolution

Software is simply the design of a machineabstracted from its physical realization

Machines that were physically impossible orimpractical to build become feasible

Design can be changed without retooling ormanufacturing

Can concentrate on steps to be achieved withoutworrying about how steps will be realized physically

+ =General

Purpose

Machine

SoftwareSpecial

Purpose

Machine


Advantages = Disadvantages

Computer so powerful and useful because has eliminated

many of physical constraints of previous technology

Both its blessing and its curse

No longer have to worry about physical realization of

our designs

But no longer have physical laws that limit the

complexity of our designs.


The Curse of Flexibility

Software is the resting place of afterthoughts

No physical constraints

To enforce discipline in design, construction, and

modification

To control complexity

So flexible that start working with it before fully

understanding what need to do

And they looked upon the software and saw that it was

good, but they just had to add one other feature


Abstraction from Physical Design

Software engineers are doing physical design

Most operational software errors related to requirements(particularly incompleteness)

Software failure modes are different

Usually does exactly what you tell it to do

Problems occur from operation, not lack of operation

Usually doing exactly what software engineers wanted

Autopilot

ExpertRequirements Software

Engineer

Design

of

Autopilot


Safety vs. Reliability

Safety and reliability are NOT the same

Sometimes increasing one can even decrease theother.

Making all the components highly reliable will have noimpact on system accidents.

For relatively simple, electro-mechanical systemswith primarily component failure accidents, reliabilityengineering can increase safety.

But accidents in high-tech systems are changingtheir nature, and we must change our approaches tosafety accordingly.


Its only a random

failure, sir! It will

never happen again.

Reliability Engineering Approach to Safety

Reliability: The probability an item will perform its requiredfunction in the specified manner over a given time period andunder specified or assumed conditions.

(Note: Most accidents result from errors in specified requirements or functions and deviations from assumed conditions)

Concerned primarily with failures and failure rate reduction:

Redundancy

Safety factors and margins

Derating

Screening

Timed replacements


Reliability Engineering Approach to Safety

Assumes accidents are caused by component failure

Positive:

Techniques exist to increase component reliability

Failure rates in hardware are quantifiable

Negative:

Omits important factors in accidents

May decrease safety

Many accidents occur without any component failure

Caused by equipment operation outside parameters and timelimits upon which reliability analyses are based.

Caused by interactions of components all operating accordingto specification.

Highly reliable components are not necessarily safe Copyright Nancy Leveson, Aug. 2006

Software-Related Accidents

Are usually caused by flawed requirements

Incomplete or wrong assumptions about operation of

controlled system or required operation of computer

Unhandled controlled-system states and

environmental conditions

Merely trying to get the software correct or to make

it reliable will not make it safer under these

conditions.


Software-Related Accidents (2)

Software may be highly reliable and correct and

still be unsafe:

Correctly implements requirements but specified

behavior unsafe from a system perspective.

Requirements do not specify some particular behavior

required for system safety (incomplete)

Software has unintended (and unsafe) behavior

beyond what is specified in requirements.


MPL Requirements Tracing Flaw

SYSTEM REQUIREMENTS

1. The touchdown sensors shall

be sampled at 100-HZ rate.

2. The sampling process shall be

initiated prior to lander entry to

keep processor demand

constant.

3. However, the use of the

touchdown sensor data shall

not begin until 12 m above the

surface.

SOFTWARE REQUIREMENTS

1. The lander flight software shall

cyclically check the state of each of

the three touchdown sensors (one

per leg) at 100-HZ during EDL.

2. The lander flight software shall be

able to cyclically check the

touchdown event state with or

without touchdown event

generation enabled.

????


Reliability Approach to Software Safety

Using standard engineering techniques of

Preventing failures through redundancy

Increasing component reliability

Reuse of designs and learning from experience

will not work for software and system accidents


Preventing Failures Through

Redundancy

Redundancy simply makes complexity worse

NASA experimental aircraft example

Any solutions that involve adding complexity will not solve

problems that stem from intellectual unmanageability and

interactive complexity

Majority of software-related accidents caused by

requirements errors

Does not work for software even if accident is

caused by a software implementation error

Software errors not caused by random wear-out failures


Increasing Software Reliability (Integrity)

Appearing in many new international standards forsoftware safety (e.g., 61508)

Safety integrity level (SIL)

Sometimes give reliability number (e.g., 10-9)

Can software reliability be measured? What does it mean?

What does it have to do with safety?

Safety involves more than simply getting thesoftware correct:

Example: altitude switch

1. Signal safety-increasing

Require any of three altimeter report below threshhold

1. Signal safety-decreasing

Require all three altimeter to report below threshhold


Software Component Reuse

One of most common factors in software-related

accidents

Software contains assumptions about its environment

Accidents occur when these assumptions are incorrect

Therac-25

Ariane 5

U.K. ATC software

Mars Climate Orbiter

Most likely to change the features embedded in or

controlled by the software

COTS makes safety analysis more difficult



Safety and (component or system) reliability

are different qualities in complex systems!

Increasing one will not necessarily increase

the other.

So what do we do?

A Possible Solution

Enforce discipline and control complexity

Limits have changed from structural integrity and

physical constraints of materials to intellectual limits

Improve communication among engineers

Build safety in by enforcing constraints on behavior

Controller contributes to accidents not by failing but by:

1. Not enforcing safety-related constraints on behavior

2. Commanding behavior that violates safety constraints


Example

(Chemical Reactor)

System Safety Constraint:

Water must be flowing into reflux condenser

whenever catalyst is added to reactor

Software (Controller) Safety Constraint:

Software must always open water valve

before catalyst valve


Conclusion

The primary safety problem in complex, software-

intensive systems is the lack of appropriate

constraints on design

The job of the system safety engineer is to

Identify the constraints necessary to maintain safety

Ensure the system (including software) design

enforces them


Introduction to System

Safety Engineering

A Non-System Safety Example:

Nuclear Power (Defense in Depth)

Multiple independent barriers to propagation of

malfunction

Emphasis on component reliability and use of lots of

redundancy

Handling single failures (no single failure of any

components will disable any barrier)

Protection (safety) systems: automatic system shut-

down

Emphasis on reliability and availability of shutdown system

and physical system barriers (using redundancy)


Why is this effective?

Relatively slow pace of basic design changes

Use of well-understood and debugged designs

Ability to learn from experience

Conservatism in design

Slow introduction of new technology

Limited interactive complexity and coupling


System Safety

Grew out of ballistic missile systems of 1960s

Emphasizes building in safety rather than adding it on to a

completed design

Looks at systems as a whole, not just components

A top-down systems approach to accident prevention

Takes a larger view of accident causes than just component

failures (includes interactions among components)

Emphasizes hazard analysis and design to eliminate or

control hazards

Emphasizes qualitative rather than quantitative approaches


System Safety Overview

A planned, disciplined, and systematic approach to

preventing or reducing accidents throughout the life cycle of

a system.

Organized common sense (Mueller, 1968)

Primary concern is the management of hazards

Hazard Through

identification analysis

evaluation design

elimination management

control

MIL-STD-882


System Safety Overview (2)

Analysis:

Hazard analysis and control is a continuous, iterativeprocess throughout system development and use.

Design: Hazard resolution precedence

1. Eliminate the hazard

2. Prevent or minimize the occurrence of the hazard

3. Control the hazard if it occurs

4. Minimize damage

Management:

Audit trails, communication channels, etc.


System Safety in Software-Intensive

Systems

While system safety approach was developed for and

works for complex, technologically advanced systems, new

methods are required

Software particularly stretches traditional methods

Need new hazard analysis and other approaches for

complex, software-intensive systems

Rest of day 1 shows new way to implement the basic

system safety approach

STAMP

A new accident causation

model using Systems Theory

(vs. Reliability Theory)


Introduction to Systems Theory

Ways to cope with complexity

1. Analytic Reduction

2. Statistics


Analytic Reduction

Divide system into distinct parts for analysis

Physical aspects Separate physical components

Behavior Events over time

Examine parts separately

Assumes such separation possible:

1. The division into parts will not distort the

phenomenon

Each component or subsystem operates independently

Analysis results not distorted when consider components

separately


2. Components act the same when examined singly as

when playing their part in the whole

Components or events not subject to feedback loops and

non-linear interactions

3. Principles governing the assembling of components

into the whole are themselves straightforward

Interactions among subsystems simple enough that can be

considered separate from behavior of subsystems themselves

Precise nature of interactions is known

Interactions can be examined pairwise

Called Organized Simplicity

Analytic Reduction (2)


Statistics

Treat system as a structureless mass with

interchangeable parts

Use Law of Large Numbers to describe behavior in

terms of averages

Assumes components are sufficiently regular and

random in their behavior that they can be studied

statistically

Called Unorganized Complexity


Complex, Software-Intensive Systems

Too complex for complete analysis

Separation into (interacting) subsystems distorts theresults

The most important properties are emergent

Too organized for statistics

Too much underlying structure that distorts the statistics

Called Organized Complexity


Systems Theory

Developed for biology (von Bertalanffly) and engineering

(Norbert Weiner)

Basis of system engineering and system safety

ICBM systems of the 1950s

Developed to handle systems with organized complexity


Systems Theory (2)

Focuses on systems taken as a whole, not onparts taken separate

Some properties can only be treated adequately in theirentirety, taking into account all social and technicalaspects

These properties derive from relationships among theparts of the system

How they interact and fit together

Two pairs of ideas

1. Hierarchy and emergence

2. Communication and control


Hierarchy and Emergence

Complex systems can be modeled as a hierarchy of

organizational levels

Each level more complex than one below

Levels characterized by emergent properties

Irreducible

Represent constraints on the degree of freedom of

components at lower level

Safety is an emergent system property

It is NOT a component property

It can only be analyzed in the context of the whole


Example

Control

Structure

Communication and Control

Hierarchies characterized by control processes working at

the interfaces between levels

A control action imposes constraints upon the activity at a

lower level of the hierarchy

Open systems are viewed as interrelated components kept

in a state of dynamic equilibrium by feedback loops of

information and control

Control in open systems implies need for communication



Process models must contain:

- Required relationship among

process variables

- Current state (values of

process variables

- The ways the process can

change state

Controlled Process

Model of

Process

Control

ActionsFeedback

Controller

Control processes operate between levels

Relationship Between Safety and

Process Models

Accidents occur when models do not match processand

Incorrect control commands given

Correct ones not given

Correct commands given at wrong time (too early, toolate)

Control stops too soon

(Note the relationship to system accidents and tosoftwares role in accidents)



Process Models (2)

How do they become inconsistent?

Wrong from beginning

Missing or incorrect feedback

Not updated correctly

Time lags not accounted for

Resulting in

Uncontrolled disturbances

Unhandled process states

Inadvertently commanding system into a hazardous state

Unhandled or incorrectly handled system component failures



Human Mental Models

Explains most human/computer interaction problems

Pilots and others are not understanding the automation

Or dont get feedback to update mental models or

disbelieve it

What did it just do? Why wont it let us do that?

Why did it do that? What caused the failure?

What will it do next? What can we do so it does

How did it get us into this state? not happen again?

How do I get it to do what I want?


Mental Models



Human Mental Models (2)

Also explains developer errors. May have incorrect

model of

Required system or software behavior for safety

Development process

Physical laws

Etc.


STAMP

Systems-Theoretic Accident Model and Processes

Accidents are not simply an event or chain of events butinvolve a complex, dynamic process

Based on systems and control theory

Accidents arise from interactions among humans,machines, and the environment (not just componentfailures)

prevent failures

_

enforce safety constraints on system behavior


STAMP (2)

View accidents as a control problem

O-ring did not control propellant gas release by sealing gapin field joint

Software did not adequately control descent speed of Mars

Polar Lander

Events are the result of the inadequate control

Result from lack of enforcement of safety constraints


A Broad View of Control

Does not imply need for a controller

Component failures and dysfunctional interactions maybe controlled through design

(e.g., redundancy, interlocks, fail-safe design)

or through process

Manufacturing processes and procedures

Maintenance processes

Operations

Does imply the need to enforce the safety constraintsin some way

New model includes what do now and more


STAMP (3)

Safety is an emergent property that arises when

system components interact with each other within a

larger environment

A set of constraints related to behavior of system

components enforces that property

Accidents occur when interactions violate those

constraints (a lack of appropriate constraints on the

interactions)

Controllers embody or enforce those constraints


Example Safety Constraints

Build safety in by enforcing safety constraints onbehavior

Controllers contribute to accidents not by failing but by:

1. Not enforcing safety-related constraints on behavior

2. Commanding behavior that violates safety constraints


Water must be flowing into reflux condenser whenever catalyst

is added to reactor

Software Safety Constraint:

Software must always open water valve before catalyst valve

STAMP (4)

Systems are not treated as a static design

A socio-technical system is a dynamic process

continually adapting to achieve its ends and to react

to changes in itself and its environment

Migration toward states of high risk

Preventing accidents requires designing a control

structure to enforce constraints on system behavior

and adaptation


Example

Control

Structure

Intelligent Cruise Control

Accident Causality

Accidents occur when

Control structure or control actions do not enforce

safety constraints

Unhandled environmental disturbances or conditions

Unhandled or uncontrolled component failures

Dysfunctional (unsafe) interactions among components

Control structure degrades over time (asynchronous

evolution)

Control actions inadequately coordinated among

multiple controllers


Dysfunctional Controller Interactions

Boundary areas

Overlap areas (side effects of decisions and control

actions)

Controller 1

Controller 2

Process 1

Process 2

Controller 1

Controller 2

Process


Uncoordinated Control Agents

Control Agent

(ATC)

InstructionsInstructions

SAFE STATE

ATC provides coordinated instructions to both planesSAFE STATE

TCAS provides coordinated instructions to both planes

Control Agent

(TCAS)


UNSAFE STATE

BOTH TCAS and ATC provide uncoordinated & independent instructions

Control Agent

(ATC)


No Coordination

Root Cause Analysis Example:

Exxon Valdez

Congress

Coast GuardExxon

Tanker Captain

And Crew

Tanker


For each component identify:

Responsibility (Safety Requirements

and constraints

Inadequate control actions

Context in which decisions made

Mental model flaws

Applying STAMP

To understand accidents, need to examine safetycontrol structure itself to determine why inadequateto maintain safety constraints and why eventsoccurred

To prevent accidents, need to create an effectivesafety control structure to enforce the system safetyconstraints

Not a blame model but a why model

Modeling Accidents Using STAMP

Three types of models are used:

1. Static safety control structure

2. Dynamic safety control structure

Shows how control structure changed over time

3. Behavioral dynamics

Dynamic processes behind changes, i.e., why the

system changed over time


Simplified System Dynamics Model of Columbia Accident


STAMP vs. Traditional Accident Models

Examines inter-relationships rather than linear

cause-effect chains.

Looks at the processes behind the events

Includes entire socio-technical system

Includes behavioral dynamics (changes over time)

Want to not just react to accidents and impose controls for

a while, but understand why controls drift toward

ineffectiveness over time and

Change those factors if possible

Detect the drift before accidents occur


Uses for STAMP

Basis for new, more powerful hazard analysis techniques

(STPA)

Inform early architectural trade studies

Identify and prioritize hazards and risks

Identify system and component safety requirements and constraints

(to be used in design)

Perform hazard analyses on physical and social systems

Safety-driven design (physical, operational, organizational)

More comprehensive accident/incident investigation and

root cause analysis

Uses for STAMP (2)

Organizational and cultural risk analysis

Identifying physical and project risks

Defining safety metrics and performance audits

Designing and evaluating potential policy and structural improvements

Identifying leading indicators of increasing risk (canary in the coal

mine)

New holistic approaches to security

Does it Work? Is it Practical?

MDA risk assessment of inadvertent launch (technical)

Architectural trade studies for the space exploration

initiative (technical)

Safetydriven design of a NASA JPL spacecraft

(technical)

NASA Space Shuttle Operations (risk analysis of a new

management structure)

NASA Exploration Systems (risk management tradeoffs

among safety, budget, schedule, performance in

development of replacement for Shuttle)

Does it Work? Is it Practical? (2)

Accident analysis (spacecraft losses, bacterial

contamination of water supply, aircraft collision, oil

refinery explosion, train accident, etc.)

Pharmaceutical safety

Hospital safety (risks of outpatient surgery at Beth Israel

MC)

Corporate fraud (are controls adequate?, Sarbanes-

Oxley)

Food safety

Train safety (Japan)

Accident/Incident Investigation

and Causal Analysis


Using STAMP in Root Cause Analysis

Identify system hazard violated and the system safety design

constraints

Construct the safety control structure as it was designed to

work

Component responsibilities (requirements)

Control actions and feedback loops

For each component, determine if it fulfilled its responsibilities

or provided inadequate control.

If inadequate control, why? (including changes over time)

Determine the changes that could eliminate the inadequate

control (lack of enforcement of system safety constraints) in

the future.


Components surrounding

Controller in Zurich

Links degraded due to

poor and unsafe practices

Links lost due to

sectorization work

Links

lost due

to

unusual

situations

Safety Requirements and Constraints

Must follow TCAS mandate

Context in which decisions were made

Flying over Western Europe ( TCAS is mandatory)

TU Crew doesnt have radio communication with Boeing Crew

Flight Crew has no simulator experience with TCAS

Flight Training is unclear on what to do in case of conflict betweenATC/TCAS

Flying at night

Inadequate Decisions and Control Actions

Reliance on optical contact

Ignores minority report from spare member of the crew

Follows controller instructions rather than TCAS

Mental Model Flaws

Optical illusion of distance

Belief that ATC is aware of everything that is happening

Belief that pilot, not TCAS has the last said in the evasion action

Understanding of TCAS as a backup system rather than a final resort

Tupulov Crew

Zurich ATC Operations

Safety Requirements and Constraints

Maintain safe separation between planes in airspace

Context in which decisions were made

Phone system prevented communication from other ATCs

Inadequate radar coverage

Insufficient personnel (only one controller)

Unaware of TCAS and/or impact of TCAS during a RA

Etc.

Inadequate Decision and Control Actions

Failure to communicate with DHL plane

Failure to adequately monitor situation

Mental Model Flaws

Unaware of conflicting TCAS procedures between Russian andEuropean pilots

Etc.

Regulatory Agencies (FAA, CAA,

Eurocontrol)

(No significant influence on accident according to the report)

Safety Requirements (Responsibilities)

Clearly articulate procedures for compliance with TCAS RAs.

Clearly articulate right of way rules in airspace.

Define the role of air traffic controllers and pilots in resolving conflicts inthe presence of TCAS.

Flawed Control Actions

AIP Germany regulations not up to date for current version of TCAS.

Procedural instruction for the actions to be taken by the pilots (from AIPGermany) in case of an RA not worded clearly enough.

LuftVO Air Traffic Order Pilots are granted a freedom of decisionwhich is not compatible with the system philosophy of TCAS II, Version7 use of term recommendation is inadequate.

Reasons for Flawed Control Actions, Dysfunctional Interactions

Overlapping control authority by several nations & organizations.

Asynchronous evolution between regulatory guidance documents andadopted technology.

Filnamn: Uberlingen enl STAMP

(C) FMV 2007

utg 9,3 2008-03-04

Bjrn

Koberstein

Uberlingen, Operators: STAMP s Static Control Structure (Action Feedback)

ICAO

EuroControl

Aircraft

authorities

(incl FAA)

Flight

operatorsAircraft

manufacturerTCAS

manufacturer

Pilots

Radar display

Missing

feedback

ICAO (a UN agency)-Cooperative aviation regulation

-Standards and recommendations .

. (including rules of the air,

. responsibilities of ATCOs & pilots,

. conflicts btw ATCO & TCAS)

Swiss Air Navigation Services .(ATC Zrich management)

-Air traffic control in Swiss airspace + in

. .delegated airspaces of adjoining states-ATC Safety policy (not fully implemented,. also SMOP in practice not prevented)

ATC operator-overloaded

-insufficiently trained

-unofficial practice deviating from the

. regular roster (night shift SMOP)

EUROCONTROL (European org)-Standardization and guidance

-Certification, education, information

Missing

feedback

Missing

feedback

No direct

feedback

Aircraft TCASCoC-safety, quality and risk management

-the implementation of the safety & risk

. management procedures were delayed

. due to their in-house development

-unaware of the sectorisation work

Radar

CoC

ATC-

operators

Missing

feedback

ATC Station

(ATC Zurich)

Missing

feedback

TCAS manufacturer-TCAS 2000 Pilots Guide

. (including TCAS-ATC conflicts)

Flight operators-TCAS training (B757-200/TU154M) * (obey RA unless/after conflict contact)

-TU154M Flight Operations (ATC has. precedence over TCAS)

JAA (Joint Aviation Authorities)-Guidance and training (TCAS-RAs

. precedence over ATCO instructions)

* BFU Investigation Report pp 62, 65

STPA

A new hazard analysis technique based

on the STAMP model of accident

causation


STAMP-Based Hazard Analysis (STPA)

Supports a safety-driven design process where

Hazard analysis influences and shapes early design

decisions

Hazard analysis iterated and refined as design evolves

Goals (same as any hazard analysis)

Identification of system hazards and related safety

constraints necessary to ensure acceptable risk

Accumulation of information about how hazards can be

violated, which is used to eliminate, reduce and control

hazards in system design, development, manufacturing,

and operations


Safety-Driven Design

Define initial control structure, refining system safety constraintsand design in parallel.

Identify potentially hazardous control actions by each ofsystem components that would violate system designconstraints. Restate as component safety design requirementsand constraints.

Perform hazard analysis using STPA to identify how safety-related requirements and constraints could be violated (thepotential causes of inadequate control and enforcement ofsafety-related constraints).

Augment the basic design to eliminate, mitigate, or controlpotential unsafe control actions and behaviors.

Iterate over the process, i.e. perform STPA on the newaugmented design and continue to refine the design until allhazardous scenarios are eliminate, mitigated, or controlled.

Document design rationale and trace requirements andconstraints to the related design decisions.


Step 1: Identify hazards and translate into high-

level requirements and constraints on behavior

TCAS Hazards:

1. A near mid-air collision (NMAC): Two controlled aircraft violateminimum separation standards)

2. A controlled maneuver into ground

3. Loss of control of aircraft

4. Interference with other safety-related aircraft systems

5. Interference with the ground-based ATC system

6. Interference with ATC safety-related advisory

System Safety Design Constraints:

TCAS must not cause or contribute to an NMAC

TCAS must not cause or contribute to a controlled maneuverinto the ground


Step 2: Define basic control structure


Component Responsibilities

TCAS:

Receive and update information about its own and other aircraft

Analyze information received and provide pilot with

Information about where other aircraft in the vicinity are located

An escape maneuver to avoid potential NMAC threats

Pilot

Maintain separation between own and other aircraft using visual

scanning

Monitor TCAS displays and implement TCAS escape maneuvers

Follow ATC advisories

Air Traffic Controller

Maintain separation between aircraft in controlled airspace by

providing advisories (control action) for pilot to follow


Aircraft components (e.g., transponders, antennas)

Execute control maneuvers

Receive and send messages to/from aircraft

Etc.

Airline Operations Management

Provide procedures for using TCAS and following TCAS

advisories

Train pilots

Audit pilot performance

Air Traffic Control Operations Management

Provide procedures

Train controllers,

Audit performance of controllers

Audit performance of overall collision avoidance system


Step 3a: Identify potential inadequate control

actions that could lead to a hazardous state.

In general:

1. A required control action is not provided or not

followed

2. An incorrect or unsafe control action is provided

3. A potentially correct or inadequate control action is

provided too late or too early (at the wrong time)

4. A correct control action is stopped too soon.


For the NMAC hazard:

TCAS:

1. The aircraft are on a near collision course and TCAS does not

provide an RA

2. The aircraft are in close proximity and TCAS provides an RA that

degrades vertical separation.

3. The aircraft are on a near collision course and TCAS provides an

RA too late to avoid an NMAC

4. TCAS removes an RA too soon

Pilot:

1. The pilot does not follow the resolution advisory provided by TCAS

(does not respond to the RA)

2. The pilot incorrectly executes the TCAS resolution advisory.

3. The pilot applies the RA but too late to avoid the NMAC

4. The pilot stops the RA maneuver too soon.


Step 3b: Use identified inadequate control

actions to refine system safety design

constraints

When two aircraft are on a collision course, TCAS must

always provide an RA to avoid the collision

TCAS must not provide RAs that degrades vertical separation

The pilot must always follow the RA provided by TCAS


Step 4: Determine how potentially hazardous

control actions could occur (scenarios of how

constraints can be violated). Eliminate from design

or control in design or operations.

Step4a: Augment control structure with process models for each

control component.

Step4b: For each of inadequate control actions, examine parts of

control loop to see if could cause it.

Guided by a set of generic control flaws

Step 4c: Design controls and mitigation measures

Step4d: Consider how designed controls could degrade over time.


Generic Control Loop Flaws

1. Inadequate Enforcement of Constraints (inadequate

Control Actions)

- Design of control algorithm (process) does not enforce

constraints

Flaws in creation process

Process changes without appropriate change in control

algorithm (asynchronous evolution)

Incorrect modification or adaptation

- Inadequate coordination among controllers and decision

makers


- Process models inconsistent, incomplete, or incorrect

Flaws in creation process

Flaws in updating (inadequate or missing feedback)

Not provided in system design

Communication flaw

Time lag

Inadequate sensor operation (incorrect or no information provided)

Time lags and measurement inaccuracies not accounted

for

Expected process inputs are wrong or missing

Expected control inputs are wrong or missing

Disturbance model is wrong

Amplitude, frequency, or period is out of range

Unidentified disturbance


2. Inadequate Execution of Control Actions

Communication flaw

Inadequate actuator operation

Time lag


Comparison with Traditional HA

Techniques

Top-down (vs bottom-up like FMECA)

Considers more than just component failure and failure

events (includes these but more general)

Guidance in doing analysis (vs. FTA)

Handles dysfunctional interactions and system accidents,

software, management, etc.


Comparisons (2)

Concrete model (not just in head)

Not physical structure (HAZOP) but control (functional)

structure

General model of inadequate control (based on control

theory)

HAZOP guidewords based on model of accidents being

caused by deviations in system variables

Includes HAZOP model but more general

Compared with TCAS II Fault Tree (MITRE)

STPA results more comprehensive

Included Ueberlingen accident


1. Identify high-level functional requirements andenvironmental constraints.

e.g. size of physical space, crowded area

2. Identify high-level hazards

a. Violation of minimum separation between mobile base andobjects (including orbiter and humans)

b. Mobile robot becomes unstable (e.g., could fall over)

c. Manipulator arm hits something

d. Fire or explosion

e. Contact of human with DMES

f. Inadequate thermal control (e.g., damaged tiles not detected,DMES not applied correctly)

g. Damage to robot

Thermal Tile Robot Example


3. Try to eliminate hazards from system conceptual design.

If not possible, then identify controls and new design

constraints.

For unstable base hazard


Mobile base must not be capable of falling over under worst case operational conditions


First try to eliminate:

1. Make base heavy

Could increase damage if hits someone or something.

Difficult to move out of way manually in emergency

2. Make base long and wide

Eliminates hazard but violates environmental constraints

3. Use lateral stability legs that are deployed when manipulatorarm extended but must be retracted when mobile base moves.

Two new design constraints:

Manipulator arm must move only when stabilizer legs are fullydeployed

Stabilizer legs must not be retracted until manipulator arm isfully stowed.


Define preliminary control structure and refine

constraints and design in parallel.


Identify potentially hazardous control actions by

each of system components

1. A required control action is not provided or not followed

2. An incorrect or unsafe control action is provided

3. A potentially correct or inadequate control action is providedtoo late or too early (at the wrong time)

4. A correct control action is stopped too soon.

Hazardous control of stabilizer legs:

Legs not deployed before arm movement enabled

Legs retracted when manipulator arm extended

Legs retracted after arm movements are enabled or retractedbefore manipulator arm fully stowed

Leg extension stopped before they are fully extended


Restate as safety design constraints on components

1. Controller must ensure stabilizer legs are extended

whenever arm movement is enabled

2. Controller must not command a retraction of stabilizer legs

when manipulator arm extended

3. Controller must not command deployment of stabilizer legs

before arm movements are enabled. Controller must not

command retraction of legs before manipulator arm fully

stowed

4. Controller must not stop leg deployment before they are fully

extended


Do same for all hazardous commands:

e.g., Arm controller must not enable manipulator armmovement before stabilizer legs are completely extended.

At this point, may decide to have arm controller and

leg controller in same component


To produce detailed scenarios for violation of

safety constraints, augment control structure with

process models

Arm MovementEnabled

Disabled

Unknown

Stabilizer LegsExtended

Retracted

Unknown

Manipulator ArmStowed

Extended

Unknown

How could become inconsistent with real state?

e.g. issue command to extend stabilizer legs but external

object could block extension or extension motor could fail


Problems often in startup or shutdown:

e.g., Emergency shutdown while servicing tiles. Stability legsmanually retracted to move robot out of way. When restart,

assume stabilizer legs still extended and arm movement could be

commanded. So use unknown state when starting up

Do not need to know all causes, only safety constraints:

May decide to turn off arm motors when legs extended or when

arm extended. Could use interlock or tell computer to power it off.

Must not move when legs extended? Power down wheel motors

while legs extended.

Coordination problems


Some Examples and References

to Papers on Them


Example: Early System Architecture

Trades for Space Exploration

Part of an MIT/Draper Labs contract with NASA

Wanted to include risk, but little information available

Not possible to evaluate likelihood when no designinformation available

Can consider severity by using worst-case analysisassociated with specific hazards.

Developed three step process:

Identify system-level hazards and associated severities

Identify mitigation strategies and associated impact

Calculate safety/risk metrics for each architecture


Sample

First identify system hazards and severities


ID# Phase Hazard H M EqG1 General Flamable substance in presence of ignition source (Fire) 4 4 4

G2 General Flamable substance in presnece of ignition source in confined space (Explosion) 4 4 4

G3 General Loss of life support (includes power, temperature, oxygen, air pressure, CO2, food, water, etc.) 4 4 4

G4 General Crew injury or illness 4 4 1

G5 General Solar or nuclear radiation exceeding safe levels 3 3 2

G6 General Collision (Micrometeroids, debris, with modules during rendevous or separation maneuver, etc.) 4 4 4

G7 General Loss of attitude control 4 4 4

G8 General Engines do not ignite 4 4 2

PL1 Pre-Launch Damage to Payload 2 3 3

PL2 Pre-Launch Launch delay (due to weather, pre-launch test failures, etc.) 1 4 1

L1 Launch Incorrect propulsion/trajectory/control during ascent 4 4 4

L2 Launch Loss of structural integrity (due to aerodynamic loads, vibrations, etc) 4 4 4

L3 Launch Incorrect stage separation 4 4 4

E1 EVA in Space Lost in space 4 4 1

A1 Assembly Incorrect propulsion/control during rendevous 4 4 4

A2 Assembly Inability to dock 1 4 3

A3 Assembly Inability to achieve airlock during docking 1 4 3

A4 Assembly Inability to undock 4 4 3

T1 In-Space Transfer Incorrect propulsion/trajectory/control during course change burn 4 4 3

D1 Descent Inability to undock 4 4 3

D2 Descent Incorrect propulsion/trajectory/control during descent 4 4 4

D3 Descent Loss of structural integrity (due to inadequate thermal control, aerodynamic loads, vibrations, etc) 4 4 4

A1 Ascent Incorrect stage separation (including ascent module disconnecting from descent stage) 4 3 3

A2 Ascent Incorrect propulsion/trajectory/control during ascent 4 3 3

A3 Ascent Loss of structural integrity (due to aerodynamic loads, vibrations, etc) 4 3 3

S1 Surface Operations Crew members stranded on M surface during EVA 4 3 3

S2 Surface Operations Crew members lost on M surface during EVA 4 3 3

S3 Surface Operations Equipment damage (including related to lunar dust) 2 3 3

NP1 Nuclear Power Nuclear fuel released on earth surface 4 4 2

NP2 Nuclear Power Insufficient power generation (reactor doesn't work) 4 3 3

NP3 Nuclear Power Insufficient reactor cooling (leading to reactor meltdown) 4 3 3

RE1 Re-Entry Inability to undock 4 3 3

RE2 Re-Entry Incorrect propulsion/trajectory/control during descent 4 3 3

RE3 Re-Entry Loss of structural integrity (due to inadequate thermal control, aerodynamic loads, vibrations, etc) 4 3 4

RE4 Re-Entry Inclement weather 4 2 2

Severity

Identified Hazards and their Severities


For example, not performing a rendezvous in transit reduces hazard

of being unable to dock


Evaluate Each Architecture and Calculate

Safety/Risk Metrics

Create an architecture vector with all parameters for

that architecture (column C of spreadsheet)

Compute metric on architecture vector:

Calculate a Relative Hazard Mitigation Index

Calculate a Relative Severity Index

Combine into an Overall Safety/Risk Metric

Details in http://sunnyday.mit.edu/papers/issc05-

final.pdf


Sample Results


Ballistic Missile Defense System (BMDS)

Non-Advocate Safety Assessment using STPA

A layered defense to defeat all ranges of threats in allphases of flight (boost, mid-course, and terminal)

Made up of many existing systems (BMDS Element)

Early warning radars

Aegis

Ground-Based Midcourse Defense (GMD)

Command and Control Battle Management andCommunications (C2BMC)

Others

MDA used STPA to evaluate the residual safety risk ofinadvertent launch prior to deployment and test


8/2/2006 132

Status

Track Data

Fire Control

Radar

Operators

Engage Target

Operational Mode Change

Readiness State Change

Weapons Free / Weapons Hold

Operational Mode

Readiness State

System Status

Track Data

Weapon and System Status

Command Authority

Doctrine

Engagement Criteria

Training

TTP

Workarounds

Early WarningSystem

Status Request

Launch Report

Status Report

Heartbeat

Radar Tasking

Readiness Mode Change

Status Request

Acknowledgements

BIT Results

Health & Status

Abort

Arm

BIT Command

Task Load

Launch

Operating Mode

Power

Safe

Software Updates

Flight Computer

InterceptorSimulator

Launch Station

Fire DIsable

Fire Enable

Operational Mode Change

Readiness State Change

Interceptor Tasking

Task Cancellation

Command Responses

System Status

Launch Report

Launcher

Launch Position

Stow Position

Perform BIT

InterceptorH/W

Arm

Safe

Ignite

BIT Info

Safe & Arm Status

BIT Results

Launcher Position

Abort

Arm

BIT Command

Task Load

Launch

Operating Mode

Power

Safe

Software Updates

Acknowledgements

BIT Results

Health & Status

Breakwires

Safe & Arm Status

Voltages

Exercise Results

Readiness

Status

Wargame Results

Safety Control Structure Diagram for FMIS


Results

Deployment and testing held up for 6 months because somany scenarios identified for inadvertent launch (the onlyhazard considered so far). In many of these scenarios:

All components were operating exactly as intended

Complexity of component interactions led to unanticipatedsystem behavior

STPA also identified component failures that could causeinadequate control (most analysis techniques consider onlythese failure events)

As changes are made to the system, the differences areassessed by updating the control structure diagrams andassessment analysis templates.

Adopted as primary safety approach for BMDS


Safety-driven Design of an Outer Planets

Explorer Spacecraft for JPL

Demonstration of approach on the design of a deep

space exploration mission spacecraft (Europa).

Defined mission hazards

Generated mission safety requirements and design

constraints

Created spacecraft control structure and system design

Performed STPA and generated component safety

requirements and design features to control hazards

http://sunnyday.mit.edu/papers/IEEE-Aerospace.pdf

(complete specifications also available)

Organizational and Cultural Risk

Analysis


Cultural and Organizational Risk

Analysis and Performance Monitoring

Apply STAMP and STPA at organizational level plus

system dynamics modeling and analysis

Goals:

Evaluating and analyzing risk

Designing and validating improvements

Monitoring risk (canary in the coal mine)

Identifying leading indicators of increasing or

unacceptable risk


System Dynamics

Created at MIT in 1950s by Forrester

Used a lot in Sloan School (management)

Grounded in non-linear dynamics and feedback control

Also draws on

Cognitive and social psychology

Organization theory

Economics

Other social sciences

Use to understand changes over time (dynamics of a

system


People who

know

People who

don't knowrate of sharing

the news

Probability of Contact

with those in the know

Contacts between peoplewho know and people who

don't

+

++

+

100

75

50

25

00 1 2 3 4 5 6 7 8 9 10 11 1

Time (Month)

People who know

peo

ple

People who don't knowRate of sharing the news


Risk Analysis Process

for Independent Technical Authority

1. Preliminary

Hazard Analysis

2. Modeling the ITA

Safety Control

Structure

3. Mapping

Requirements to

Responsibilities

4. Detailed Hazard

Analysis using STPA

? System hazards

? System safety requirements

and constraints

? Roles and

responsibilities

? Feedback mechanisms

? Gap analysis ? System risks

(inadequate

controls)

5. Categorizing &

Analyzing Risks

6. System Dynamics

Modeling and Analysis

7. Findings and

Recommendations

? Immediate and

longer term risks

? Sensitivity

? Leading

indicators

? Risk Factors

? Policy

? Structure

? Leading indicators

and measures of

effectiveness


1. Preliminary Hazard Analysis

System Hazard: Poor engineering and management decision-

making leading to an accident (loss).

System Safety Requirements and Constraints:

1. Safety considerations must be first and foremost in technical

decision-making.

2. Safety-related technical decision-making must be done by

eminently qualified experts with broad participation of the full

workforce.

3. Safety analyses must be available and used starting in the early

acquisition, requirements development, and design processes

and continuing through the system lifecycle.

4. The Agency must provide avenues for full expression of technical

conscience and a process for full and adequate resolution of

technical conflicts as well as conflicts between programmatic and

technical concerns. Copyright Nancy Leveson, Aug. 2006

Each of these was refined, e.g.,

1. Safety considerations must be first and foremost in technicaldecision-making.

a. State-of-the art safety standards and requirements for NASA missionsmust be established, implemented, enforced, and maintained thatprotect the astronauts, the workforce, and the public.

b. Safety-related technical decision-making must be independent fromprogrammatic considerations, including cost and schedule

c. Safety-related decision-making must be based on correct, complete,and up-to-date information.

d. Overall (final) decision-making must include transparent considerationof both safety and programmatic concerns.

e. The Agency must provide for effective assessment and improvementin safety-related decision-making.

To create a set of system safety requirements and constraintssufficient to eliminate or mitigate the hazard


2. Model the ITA Control Structure


For each component specified:

Inputs, outputs

Overall role and detailed responsibilities (requirements)

Potential inadequate control actions

Feedback requirements

For most added:

Environmental and behavior-shaping factors (context)

Mental model requirements

Controls


Example from System Technical

Warrant Holder

1. Establish and maintain technical policy, technical

standards, requirements, and processes for a

particular system or systems.

a. STWH shall ensure program identifies and imposes

appropriate technical requirements at

program/project formulation to ensure safe and

reliable operations.

b. STWH shall ensure inclusion of the consideration of

risk, failure, and hazards in technical requirements.

c. STWH shall approve the set of technical

requirements and any changes to them

d. STWH shall approve verification plans for the

system(s)


3. Map System Requirements to

Component Responsibilities

Took each of system safety requirements and

traced to component responsibilities

(requirements)

Identified omissions, conflicts, potential issues

Recommended additions and changes

Added responsibilities when missing in order for

risk analysis to be complete.


4. Hazard Analysis using STPA

General types of risks for ITA:

1. Unsafe decisions are made by or approved by ITA

2. Safe decisions are disallowed (overly conservative decision-making that undermines the goals of NASA and long-termsupport for ITA)

3. Decision-making takes too long, minimizing impact and alsoreducing support for ITA

4. Good decisions are made by ITA, but do not have adequateimpact on system design, construction, and operation

Applied to each of component responsibilities

Identified basic and coordination risks


Example from Risks List

CE Responsibility: Develop, monitor, and maintain technical

standards and policy

Risks:

1. General technical and safety standards and

requirements are not created (IC)

2. Inadequate standards and requirements are created

(IC)

3. Standards degrade as changed over time due to

external pressures to weaken them. Process for

approving changes is flawed (LT).

4. Standards not changed or updated over time as the

environment changes (LT).


5. Categorize and Analyze Risks

Large number resulted so:

Categorized risks as

Immediate concern

Longer-term concern

Standard Process

Used system dynamics models to identify which risks

were most important to assess and measure

Provide most important assessment of current level of risk

Most likely to detect increasing risk early enough to prevent

significant losses (leading indicators)


Risk

Shuttle Agingand

MaintenanceSystem Safety

Efforts &Efficacy

PerceivedSuccess by

Administration

System SafetyKnowledge,

Skills & Staffing

Launch RateSystem Safety

ResourceAllocation

Incident Learning& Corrective

Action

ITA


6. System Dynamics Modeling

Modified our NASA manned space program model

to include Independent Technical Authority (ITA)

Independently tested and validated the nine models,

then connected them

Ran analyses:

Sensitivity analyses to investigate impact of various

parameters on system dynamics and risk

System behavior mode investigation

Metrics evaluations

Additional scenarios and insights


Example Result

ITA has potential to significantly reduce risk and to

sustain an acceptable risk level

But also found significant risk of unsuccessful

implementation of ITA that needs to be monitored

200-run Monte-Carlo sensitivity analysis

Random variations of +/- 30% of baseline exogenous

parameter values


Successful vs. Unsuccessful ITA

ImplementationIndicator of Effectiveness and Credibility of ITA

1

0.5

0Time

1

2

1

0.5

0 1

2

Time

System Technical Risk

Self-sustaining for short period of time if conditions in

place for early acceptance.

Provides foundation for a solid, sustainable ITA

program implementation under right conditions.

Successful scenarios:

After period of high success, effectiveness slowly

declines

Complacency

Safety seen as solved problem

Resources allocated to more urgent matters

But risk still at acceptable levels and extended period

of nearly steady-state equilibrium with risk at low levels

Successful Scenarios


Unsuccessful Implementation Scenarios

Effectiveness quickly starts to decline and reaches

unacceptable levels

Limited ability of ITA to have sustained effect on system

Hazardous events start to occur, safety increasingly

perceived as urgent problem

More resources allocated to safety but TA and TWHs have

lost so much credibility they cannot effectively contribute to

risk mitigation anymore.

Risk increases dramatically

ITA and safety staff overwhelmed with safety problems

Start to approve an increasing number of waivers so can

continue to fly.


Unsuccessful Scenario Factors

As effectiveness of ITA decreases, number of problems

increase

Investigation requirements increase

Corners may be cut to compensate

Results in lower-quality investigation resolutions and

corrective actions

TWHs and Trusted Agents become saturated and cannot attend

to each investigation in timely manner

Bottleneck created by requiring TWHs to authorize all safety-

related decisions, making things worse

Want to detect this reinforcing loop while interventions

still possible and not overly costly (resources, downtime)


Identification of Lagging vs. Leading

Indicators

Number of waivers issued

good indicator but lags rapid

increase in risk

Incidents under investigation

is a better leading indicator

System Technical Risk Risk UnitsOutstanding Accumulated Waivers Incidents

Time

System Technical Risk Risk UnitsIncidents Under Investigation Incidents

Time

Modeling Exploration Enterprise (ESMD)

Built a large STAMP plus systems dynamics model

of the Project Constellation

Development-oriented vs. operations oriented

Space Shuttle model

Demonstrating how it can be used for risk

management decision-making


Risk Management in NASAs New

Exploration Systems Mission Directorate

Created an executable model, using input from the NASAworkforce, to analyze relative effects of managementstrategies on schedule, cost, safety and performance

Developed scenarios to analyze risks identified by theAgencys workforce

Performed preliminary analysis on the effects of hiringconstraints, management reserves, independence of safetydecision-making, requirements changes, etc.

Derived preliminary recommendations to mitigate andmonitor program-level risks


Structure of System Dynamics Model

Congress and White HouseDecision -Making

NASA Administration and ESMDDecision -Making

OSMA OCE

Exploration Systems Engineering Management

Technical

Personnel

Resources and

Experience

System Development and

Safety Analysis

Completion

Efforts and

Efficacy of Other

Technical

Personnel

Engineering

Procurement

NESC

Safety and Mission

Assurance

SMA Status , Efficacy ,

Knowledge and Skills

Exploration Systems Program /Project Management

Task Completion and Schedule Pressure Resource Allocation


Design Work

Remaining

Design Work

Completed

Pending Technology

Development Tasks

Completed Technology

Development Tasks

Design Task

Completion Rate

Technology Development

Task Completion Rate

Technologies used in

DesignTechnology

Utilization Rate

Pending Hazard

Analyses

Incoming Program

Design Work

Incoming Hazard

Analysis Tasks

Incoming Technology

Development Tasks

Completed Hazard

Analyses

Hazard Analyses

used in DesignHA Completion

Rate

HA Utilization

Rate

Hazard Analyses

unused in Design

Decisions

HA Discard Rate

Abandoned

Technologies

Technology

Abandonment Rate

Design Task AllocationRate (from P/P

Management)

Technology Development Task

Allocation Rate (from P/P

Management)

Capacity for Performing

System Design Work 0

Capacity for

PerformingTechnology

Development Work 0

Design SchedulePressure from

Management

Fraction of HAs Too

Late to Influence Design

Average Hazard

Analysis Quality

Average Quality ofHazard Analyses used in

Design

Fraction of Design Tasks with

Associated Hazard Analysis

Technology Available to

be used in Design

Additional Incoming

Design Work

Progress Report to

Management

Additional Incoming Work

from Changes (from P/P

Management)

Design Work

Completed with

Undiscovered Safety

and Integration

Problems

Design Work Completion

Rate with Safety and

Integration Flaws

Total Design Work

Completion Rate

Work Discovered with

Safety and Integration

Problems

Flaw Discovery

Rate

Design Work with

Accepted Problems or

UnsatisfiedRequirements

Acceptance Rate

Unplanned Rework

Decision Rate

Additional Operations Cost for Safety

and Integration Workaround

Efficacy of Safety

Assurance (SMA)

Safety Assurance

Resources

Time to Discover

Flaws

Incentives to

Report Flaws

Efficacy of System

Integration

Quality of Safety

Analyses 0

Maximum System Safety

Analysis Completion Rate

System

Performance

Apparent Work

Completed

Desired Design Task

Completion Rate

Safety of

Operational System

System Design

Overwork

Desired Safety Analysis

Completion Rate

Ability to Perform

Contractor Safety

Oversight 2

Fraction of Design TasksCompleted with Safety and

Integration Flaws

Engineering - System Development Completion and Safety Analyses

Safety Rework

Rework Cycle

Integrated

Product

NASA ESMD Workforce Planning

ESMD Employee Gap4,000

2,950

1,900

850

- 2000 37.5 75 112.5 150

Time ( Month)

Transfers from

Shuttle

+

Limits on Hiring

+

(A)

5 0

Simulation varied:

Initial experience distribution of ESMD civil servant workforce

Maximum civil servant hiring rates

Transfers from Shuttle ops during Shuttle retirement

Important Issues:

- Increase in retirements

- Hiring limits

- Transfers

Example: Schedule Pressure and Safety Priority

in Developing the Shuttle Replacement

1. Overly aggressive schedule

enforcement has little effect

on completion time (

Using Model for Policy Decisions

The results of the analyses can be used to make policydecisions, for example:

Reduce limitations (external and internal) that will impede civilservant hiring in the next few years

Monitor management reserves and use them to alleviate overwork

Enhance, monitor, and maintain influence of safety analysts ondecision-making

Rotate Rising Stars in the Agency through the safety organization

Monitor overwork of SE&I and safety engineering, as they controlthe rework cycle (safety, cost, and schedule impact)

Continue planning to minimize downstream requirements changes,and allow for on/off ramps (technologies and designs) to reducenegative impact

Exploring Limits of Use

Medical error and medical safety (risk analysis of

outpatient surgery at Beth Israel Deaconess Hospital)

Safety in pharmaceutical testing and drug development

Food safety

Control of corporate fraud

For More Information

New book draft on STAMP

http://sunnyday.mit.edu/book2.html

(link to CER Early Trades paper also here)

NASA ITA Risk Analysis Final Report

http://sunnyday.mit.edu/ITA-Risk-Analysis.doc

NASA ESMD Risk Management Demonstration

http://sunnyday.mit.edu/ESMD-Final-Report.pdf

Summary and Conclusions

A more powerful approach to hazard analysis and systemsafety engineering

Based on a new, more comprehensive model of accidentcausation

Includes what do now but also much more

Works for the complex, software-intensive systems (andsystems-of-systems) we are building

Considers the entire socio-technical system

Can be used early in concept formation and development toguide design for safety

Has been validated and is being used on real systems

Potential for very powerful automated tools and assistance


Differences with Traditional Approaches

More comprehensive view of causality

A top-down systems approach to preventing losses

Includes organizational, social, and cultural aspects of

risk as well as physical system

Emphasizes non-probabilistic and qualitative approaches

Combines static (structural) and behavioral models

Looks at dynamics and changes over time

Migration toward states of increasing risk

Includes human decision making and mental models

Handles much more complex systems than traditional

safety engineering approaches

sweden day1

goal system accidents

chain events

simple system

system theorycontrol

system sequences

new accident model

safety margins

limitations of chain

Documents

chem10171 day1

upa2010 day1

showdaily2015 day1

javascript day1

day1 session1

android day1

green day1

workshop day1

day1 tanaka

day1 switches

showdaily day1

lecture day1

day1 day2project

201310slide day1

day1 session2

presentation day1

openhardware day1

intel day1

day1 final

judgement day1