sweden day1
Post on 19-Nov-2015
39 Views
Preview:
DESCRIPTION
TRANSCRIPT
-
A New Approach to
System Safety Engineering
Nancy G. Leveson, MIT
System Safety Engineering: Back to the Future
http://sunnyday.mit.edu/book2.html
Copyright Nancy Leveson, Aug. 2006
-
Outline of Day 1
Why a new approach is needed
STAMP A new accident model based on system theory
(control)
Uses
Accident and incident investigation
Hazard Analysis (STPA) and Design for Safety
Non-Advocate Safety Assessment
Cultural and organizational risk analysis
Copyright Nancy Leveson, Aug. 2006
-
Why need a new approach
Traditional approaches developed for relatively
simple electro-mechanical systems
Accidents in complex, software-intensive systems
are changing their nature
We need more effective techniques in these new
systems
Copyright Nancy Leveson, Aug. 2006
-
Chain-of-Events Model
Explains accidents in terms of multiple events, sequenced
as a forward chain over time.
Simple, direct relationship between events in chain
Events almost always involve component failure, human
error, or energy-related event
Forms the basis for most safety-engineering and reliability
engineering analysis:
e,g, FTA, PRA, FMECA, Event Trees, etc.
and design:
e.g., redundancy, overdesign, safety margins, .
Copyright Nancy Leveson, Aug. 2006
-
Chain-of-events example
Copyright Nancy Leveson, Aug. 2006
-
Accident with No Component Failures
Copyright Nancy Leveson, Aug. 2006
-
Types of Accidents
Component Failure Accidents
Single or multiple component failures
Usually assume random failure
System Accidents
Arise in interactions among components
Related to interactive complexity and tight coupling
Exacerbated by introduction of computers and
software
New technology introduces unknowns and unk-unks Copyright Nancy Leveson, Aug. 2006
-
Interactive Complexity
Critical factor is intellectual manageability
A simple system has a small number of unknowns in
its interactions (within system and with environment)
Interactively complex (intellectually unmanageable)
when level of interactions reaches point where can no
longer be thoroughly
Planned
Understood
Anticipated
Guarded against
Copyright Nancy Leveson, Aug. 2006
-
Tight Coupling
Tightly coupled system is one that is highly
interdependent
Each part linked to many other parts
Failure or unplanned behavior in one can rapidly affect status
of others
Processes are time-dependent and cannot wait
Little slack in system
Sequences are invariant, only one way to reach a goal
System accidents are caused by unplanned and
dysfunctional interactions
Coupling increases number of interfaces and potential
interactions Copyright Nancy Leveson, Aug. 2006
-
Other Types of Complexity
Non-linear complexity
Cause and effect not related in an obvious way
Dynamic Complexity
Related to changes over time
Decompositional
Structural decomposition not consistent with
functional decomposition
Copyright Nancy Leveson, Aug. 2006
-
Limitations of Chain-of-Events Model
Social and organizational factors in accidents
System accidents
Software
Adaptation
Systems are continually changing
Systems and organizations migrate toward accidents
(states of high risk) under cost and productivity
pressures in an aggressive, competitive environment
Copyright Nancy Leveson, Aug. 2006
-
Limitations (2)
Human error
Define as deviation from normative procedures, but
operators always deviate from standard procedures
Normative vs. effective procedures
Sometimes violation of rules has prevented accidents
Cannot effectively model human behavior by
decomposing it into individual decisions and acts and
studying it in isolation from
Physical and social context
Value system in which takes place
Dynamic work process
Less successful actions are natural part of search by
operator for optimal performance Copyright Nancy Leveson, Aug. 2006
-
Mental Models
Copyright Nancy Leveson, Aug. 2006
-
Exxon Valdez
Shortly after midnight, March 24, 1989, tanker Exxon Valdezran aground on Bligh Reef (Alaska)
11 million gallons of crude oil released
Over 1500 miles of shoreline polluted
Exxon and government put responsibility on tanker CaptainHazelwood, who was disciplined and fired
Was he to blame?
State-of-the-art iceberg monitoring equipment promised by oilindustry, but never installed. Exxon Valdez traveling outside normalsea lane in order to avoid icebergs thought to be in area
Radar station in city of Valdez, which was responsible for monitoringthe location of tanker traffic in Prince William Sound, had replaced itsradar with much less powerful equipment. Location of tankers nearBligh reef could not be monitored with this equipment.
Copyright Nancy Leveson, Aug. 2006
-
Congressional approval of Alaska oil pipeline and tanker
transport network included an agreement by oil corporations to
build and use double-hulled tankers. Exxon Valdez did not
have a double hull.
Crew fatigue was typical on tankers
In 1977, average oil tanker operating out of Valdez had a crew of 40
people. By 1989, crew size had been cut in half.
Crews routinely worked 12-14 hour shifts, plus extensive overtime
Exxon Valdez had arrived in port at 11 pm the night before. The crew
rushed to get the tanker loaded for departure the next evening
Coast Guard at Valdez assigned to conduct safety inspections
of tankers. It did not perform these inspections. Its staff had
been cut by one-third.
Copyright Nancy Leveson, Aug. 2006
-
Tanker crews relied on the Coast Guard to plot their positioncontinually.
Coast Guard operating manual required this.
Practice of tracking ships all the way out to Bligh reef hadbeen discontinued.
Tanker crews were never informed of the change.
Spill response teams and equipment were not readilyavailable. Seriously impaired attempts to contain and recoverthe spilled oil.
Summary:
Safeguards designed to avoid and mitigate effects of an oilspill were not in place or were not operational
By focusing exclusively on blame, the opportunity to learnfrom mistakes is lost.
Postscript:
Captain Hazelwood was tried for being drunk the night theExxon Valdez went aground. He was found not guilty
Copyright Nancy Leveson, Aug. 2006
-
Hierarchical models
Copyright Nancy Leveson, Aug. 2006
-
Hierarchical analysis example
Copyright Nancy Leveson, Aug. 2006
-
The Role of Software in Accidents
-
The Computer Revolution
Software is simply the design of a machineabstracted from its physical realization
Machines that were physically impossible orimpractical to build become feasible
Design can be changed without retooling ormanufacturing
Can concentrate on steps to be achieved withoutworrying about how steps will be realized physically
+ =General
Purpose
Machine
SoftwareSpecial
Purpose
Machine
Copyright Nancy Leveson, Aug. 2006
-
Advantages = Disadvantages
Computer so powerful and useful because has eliminated
many of physical constraints of previous technology
Both its blessing and its curse
No longer have to worry about physical realization of
our designs
But no longer have physical laws that limit the
complexity of our designs.
Copyright Nancy Leveson, Aug. 2006
-
The Curse of Flexibility
Software is the resting place of afterthoughts
No physical constraints
To enforce discipline in design, construction, and
modification
To control complexity
So flexible that start working with it before fully
understanding what need to do
And they looked upon the software and saw that it was
good, but they just had to add one other feature
Copyright Nancy Leveson, Aug. 2006
-
Abstraction from Physical Design
Software engineers are doing physical design
Most operational software errors related to requirements(particularly incompleteness)
Software failure modes are different
Usually does exactly what you tell it to do
Problems occur from operation, not lack of operation
Usually doing exactly what software engineers wanted
Autopilot
ExpertRequirements Software
Engineer
Design
of
Autopilot
Copyright Nancy Leveson, Aug. 2006
-
Safety vs. Reliability
Safety and reliability are NOT the same
Sometimes increasing one can even decrease theother.
Making all the components highly reliable will have noimpact on system accidents.
For relatively simple, electro-mechanical systemswith primarily component failure accidents, reliabilityengineering can increase safety.
But accidents in high-tech systems are changingtheir nature, and we must change our approaches tosafety accordingly.
Copyright Nancy Leveson, Aug. 2006
-
Its only a random
failure, sir! It will
never happen again.
-
Reliability Engineering Approach to Safety
Reliability: The probability an item will perform its requiredfunction in the specified manner over a given time period andunder specified or assumed conditions.
(Note: Most accidents result from errors in specified requirements or functions and deviations from assumed conditions)
Concerned primarily with failures and failure rate reduction:
Redundancy
Safety factors and margins
Derating
Screening
Timed replacements
Copyright Nancy Leveson, Aug. 2006
-
Reliability Engineering Approach to Safety
Assumes accidents are caused by component failure
Positive:
Techniques exist to increase component reliability
Failure rates in hardware are quantifiable
Negative:
Omits important factors in accidents
May decrease safety
Many accidents occur without any component failure
Caused by equipment operation outside parameters and timelimits upon which reliability analyses are based.
Caused by interactions of components all operating accordingto specification.
Highly reliable components are not necessarily safe Copyright Nancy Leveson, Aug. 2006
-
Software-Related Accidents
Are usually caused by flawed requirements
Incomplete or wrong assumptions about operation of
controlled system or required operation of computer
Unhandled controlled-system states and
environmental conditions
Merely trying to get the software correct or to make
it reliable will not make it safer under these
conditions.
Copyright Nancy Leveson, Aug. 2006
-
Software-Related Accidents (2)
Software may be highly reliable and correct and
still be unsafe:
Correctly implements requirements but specified
behavior unsafe from a system perspective.
Requirements do not specify some particular behavior
required for system safety (incomplete)
Software has unintended (and unsafe) behavior
beyond what is specified in requirements.
Copyright Nancy Leveson, Aug. 2006
-
MPL Requirements Tracing Flaw
SYSTEM REQUIREMENTS
1. The touchdown sensors shall
be sampled at 100-HZ rate.
2. The sampling process shall be
initiated prior to lander entry to
keep processor demand
constant.
3. However, the use of the
touchdown sensor data shall
not begin until 12 m above the
surface.
SOFTWARE REQUIREMENTS
1. The lander flight software shall
cyclically check the state of each of
the three touchdown sensors (one
per leg) at 100-HZ during EDL.
2. The lander flight software shall be
able to cyclically check the
touchdown event state with or
without touchdown event
generation enabled.
????
Copyright Nancy Leveson, Aug. 2006
-
Reliability Approach to Software Safety
Using standard engineering techniques of
Preventing failures through redundancy
Increasing component reliability
Reuse of designs and learning from experience
will not work for software and system accidents
Copyright Nancy Leveson, Aug. 2006
-
Preventing Failures Through
Redundancy
Redundancy simply makes complexity worse
NASA experimental aircraft example
Any solutions that involve adding complexity will not solve
problems that stem from intellectual unmanageability and
interactive complexity
Majority of software-related accidents caused by
requirements errors
Does not work for software even if accident is
caused by a software implementation error
Software errors not caused by random wear-out failures
Copyright Nancy Leveson, Aug. 2006
-
Increasing Software Reliability (Integrity)
Appearing in many new international standards forsoftware safety (e.g., 61508)
Safety integrity level (SIL)
Sometimes give reliability number (e.g., 10-9)
Can software reliability be measured? What does it mean?
What does it have to do with safety?
Safety involves more than simply getting thesoftware correct:
Example: altitude switch
1. Signal safety-increasing
Require any of three altimeter report below threshhold
1. Signal safety-decreasing
Require all three altimeter to report below threshhold
Copyright Nancy Leveson, Aug. 2006
-
Software Component Reuse
One of most common factors in software-related
accidents
Software contains assumptions about its environment
Accidents occur when these assumptions are incorrect
Therac-25
Ariane 5
U.K. ATC software
Mars Climate Orbiter
Most likely to change the features embedded in or
controlled by the software
COTS makes safety analysis more difficult
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
Safety and (component or system) reliability
are different qualities in complex systems!
Increasing one will not necessarily increase
the other.
So what do we do?
-
A Possible Solution
Enforce discipline and control complexity
Limits have changed from structural integrity and
physical constraints of materials to intellectual limits
Improve communication among engineers
Build safety in by enforcing constraints on behavior
Controller contributes to accidents not by failing but by:
1. Not enforcing safety-related constraints on behavior
2. Commanding behavior that violates safety constraints
Copyright Nancy Leveson, Aug. 2006
-
Example
(Chemical Reactor)
System Safety Constraint:
Water must be flowing into reflux condenser
whenever catalyst is added to reactor
Software (Controller) Safety Constraint:
Software must always open water valve
before catalyst valve
Copyright Nancy Leveson, Aug. 2006
-
Conclusion
The primary safety problem in complex, software-
intensive systems is the lack of appropriate
constraints on design
The job of the system safety engineer is to
Identify the constraints necessary to maintain safety
Ensure the system (including software) design
enforces them
Copyright Nancy Leveson, Aug. 2006
-
Introduction to System
Safety Engineering
-
A Non-System Safety Example:
Nuclear Power (Defense in Depth)
Multiple independent barriers to propagation of
malfunction
Emphasis on component reliability and use of lots of
redundancy
Handling single failures (no single failure of any
components will disable any barrier)
Protection (safety) systems: automatic system shut-
down
Emphasis on reliability and availability of shutdown system
and physical system barriers (using redundancy)
Copyright Nancy Leveson, Aug. 2006
-
Why is this effective?
Relatively slow pace of basic design changes
Use of well-understood and debugged designs
Ability to learn from experience
Conservatism in design
Slow introduction of new technology
Limited interactive complexity and coupling
Copyright Nancy Leveson, Aug. 2006
-
System Safety
Grew out of ballistic missile systems of 1960s
Emphasizes building in safety rather than adding it on to a
completed design
Looks at systems as a whole, not just components
A top-down systems approach to accident prevention
Takes a larger view of accident causes than just component
failures (includes interactions among components)
Emphasizes hazard analysis and design to eliminate or
control hazards
Emphasizes qualitative rather than quantitative approaches
Copyright Nancy Leveson, Aug. 2006
-
System Safety Overview
A planned, disciplined, and systematic approach to
preventing or reducing accidents throughout the life cycle of
a system.
Organized common sense (Mueller, 1968)
Primary concern is the management of hazards
Hazard Through
identification analysis
evaluation design
elimination management
control
MIL-STD-882
Copyright Nancy Leveson, Aug. 2006
-
System Safety Overview (2)
Analysis:
Hazard analysis and control is a continuous, iterativeprocess throughout system development and use.
Design: Hazard resolution precedence
1. Eliminate the hazard
2. Prevent or minimize the occurrence of the hazard
3. Control the hazard if it occurs
4. Minimize damage
Management:
Audit trails, communication channels, etc.
Copyright Nancy Leveson, Aug. 2006
-
System Safety in Software-Intensive
Systems
While system safety approach was developed for and
works for complex, technologically advanced systems, new
methods are required
Software particularly stretches traditional methods
Need new hazard analysis and other approaches for
complex, software-intensive systems
Rest of day 1 shows new way to implement the basic
system safety approach
-
STAMP
A new accident causation
model using Systems Theory
(vs. Reliability Theory)
Copyright Nancy Leveson, Aug. 2006
-
Introduction to Systems Theory
Ways to cope with complexity
1. Analytic Reduction
2. Statistics
Copyright Nancy Leveson, Aug. 2006
-
Analytic Reduction
Divide system into distinct parts for analysis
Physical aspects Separate physical components
Behavior Events over time
Examine parts separately
Assumes such separation possible:
1. The division into parts will not distort the
phenomenon
Each component or subsystem operates independently
Analysis results not distorted when consider components
separately
Copyright Nancy Leveson, Aug. 2006
-
2. Components act the same when examined singly as
when playing their part in the whole
Components or events not subject to feedback loops and
non-linear interactions
3. Principles governing the assembling of components
into the whole are themselves straightforward
Interactions among subsystems simple enough that can be
considered separate from behavior of subsystems themselves
Precise nature of interactions is known
Interactions can be examined pairwise
Called Organized Simplicity
Analytic Reduction (2)
Copyright Nancy Leveson, Aug. 2006
-
Statistics
Treat system as a structureless mass with
interchangeable parts
Use Law of Large Numbers to describe behavior in
terms of averages
Assumes components are sufficiently regular and
random in their behavior that they can be studied
statistically
Called Unorganized Complexity
Copyright Nancy Leveson, Aug. 2006
-
Complex, Software-Intensive Systems
Too complex for complete analysis
Separation into (interacting) subsystems distorts theresults
The most important properties are emergent
Too organized for statistics
Too much underlying structure that distorts the statistics
Called Organized Complexity
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Systems Theory
Developed for biology (von Bertalanffly) and engineering
(Norbert Weiner)
Basis of system engineering and system safety
ICBM systems of the 1950s
Developed to handle systems with organized complexity
Copyright Nancy Leveson, Aug. 2006
-
Systems Theory (2)
Focuses on systems taken as a whole, not onparts taken separate
Some properties can only be treated adequately in theirentirety, taking into account all social and technicalaspects
These properties derive from relationships among theparts of the system
How they interact and fit together
Two pairs of ideas
1. Hierarchy and emergence
2. Communication and control
Copyright Nancy Leveson, Aug. 2006
-
Hierarchy and Emergence
Complex systems can be modeled as a hierarchy of
organizational levels
Each level more complex than one below
Levels characterized by emergent properties
Irreducible
Represent constraints on the degree of freedom of
components at lower level
Safety is an emergent system property
It is NOT a component property
It can only be analyzed in the context of the whole
Copyright Nancy Leveson, Aug. 2006
-
Example
Control
Structure
-
Communication and Control
Hierarchies characterized by control processes working at
the interfaces between levels
A control action imposes constraints upon the activity at a
lower level of the hierarchy
Open systems are viewed as interrelated components kept
in a state of dynamic equilibrium by feedback loops of
information and control
Control in open systems implies need for communication
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
Process models must contain:
- Required relationship among
process variables
- Current state (values of
process variables
- The ways the process can
change state
Controlled Process
Model of
Process
Control
ActionsFeedback
Controller
Control processes operate between levels
-
Relationship Between Safety and
Process Models
Accidents occur when models do not match processand
Incorrect control commands given
Correct ones not given
Correct commands given at wrong time (too early, toolate)
Control stops too soon
(Note the relationship to system accidents and tosoftwares role in accidents)
Copyright Nancy Leveson, Aug. 2006
-
Relationship Between Safety and
Process Models (2)
How do they become inconsistent?
Wrong from beginning
Missing or incorrect feedback
Not updated correctly
Time lags not accounted for
Resulting in
Uncontrolled disturbances
Unhandled process states
Inadvertently commanding system into a hazardous state
Unhandled or incorrectly handled system component failures
Copyright Nancy Leveson, Aug. 2006
-
Relationship Between Safety and
Human Mental Models
Explains most human/computer interaction problems
Pilots and others are not understanding the automation
Or dont get feedback to update mental models or
disbelieve it
What did it just do? Why wont it let us do that?
Why did it do that? What caused the failure?
What will it do next? What can we do so it does
How did it get us into this state? not happen again?
How do I get it to do what I want?
Copyright Nancy Leveson, Aug. 2006
-
Mental Models
Copyright Nancy Leveson, Aug. 2006
-
Relationship Between Safety and
Human Mental Models (2)
Also explains developer errors. May have incorrect
model of
Required system or software behavior for safety
Development process
Physical laws
Etc.
Copyright Nancy Leveson, Aug. 2006
-
STAMP
Systems-Theoretic Accident Model and Processes
Accidents are not simply an event or chain of events butinvolve a complex, dynamic process
Based on systems and control theory
Accidents arise from interactions among humans,machines, and the environment (not just componentfailures)
prevent failures
_
enforce safety constraints on system behavior
Copyright Nancy Leveson, Aug. 2006
-
STAMP (2)
View accidents as a control problem
O-ring did not control propellant gas release by sealing gapin field joint
Software did not adequately control descent speed of Mars
Polar Lander
Events are the result of the inadequate control
Result from lack of enforcement of safety constraints
Copyright Nancy Leveson, Aug. 2006
-
A Broad View of Control
Does not imply need for a controller
Component failures and dysfunctional interactions maybe controlled through design
(e.g., redundancy, interlocks, fail-safe design)
or through process
Manufacturing processes and procedures
Maintenance processes
Operations
Does imply the need to enforce the safety constraintsin some way
New model includes what do now and more
Copyright Nancy Leveson, Aug. 2006
-
STAMP (3)
Safety is an emergent property that arises when
system components interact with each other within a
larger environment
A set of constraints related to behavior of system
components enforces that property
Accidents occur when interactions violate those
constraints (a lack of appropriate constraints on the
interactions)
Controllers embody or enforce those constraints
Copyright Nancy Leveson, Aug. 2006
-
Example Safety Constraints
Build safety in by enforcing safety constraints onbehavior
Controllers contribute to accidents not by failing but by:
1. Not enforcing safety-related constraints on behavior
2. Commanding behavior that violates safety constraints
System Safety Constraint:
Water must be flowing into reflux condenser whenever catalyst
is added to reactor
Software Safety Constraint:
Software must always open water valve before catalyst valve
-
STAMP (4)
Systems are not treated as a static design
A socio-technical system is a dynamic process
continually adapting to achieve its ends and to react
to changes in itself and its environment
Migration toward states of high risk
Preventing accidents requires designing a control
structure to enforce constraints on system behavior
and adaptation
Copyright Nancy Leveson, Aug. 2006
-
Example
Control
Structure
-
Intelligent Cruise Control
-
Accident Causality
Accidents occur when
Control structure or control actions do not enforce
safety constraints
Unhandled environmental disturbances or conditions
Unhandled or uncontrolled component failures
Dysfunctional (unsafe) interactions among components
Control structure degrades over time (asynchronous
evolution)
Control actions inadequately coordinated among
multiple controllers
Copyright Nancy Leveson, Aug. 2006
-
Dysfunctional Controller Interactions
Boundary areas
Overlap areas (side effects of decisions and control
actions)
Controller 1
Controller 2
Process 1
Process 2
Controller 1
Controller 2
Process
Copyright Nancy Leveson, Aug. 2006
-
Uncoordinated Control Agents
Control Agent
(ATC)
InstructionsInstructions
SAFE STATE
ATC provides coordinated instructions to both planesSAFE STATE
TCAS provides coordinated instructions to both planes
Control Agent
(TCAS)
InstructionsInstructions
UNSAFE STATE
BOTH TCAS and ATC provide uncoordinated & independent instructions
Control Agent
(ATC)
InstructionsInstructions
No Coordination
-
Root Cause Analysis Example:
Exxon Valdez
Congress
Coast GuardExxon
Tanker Captain
And Crew
Tanker
Copyright Nancy Leveson, Aug. 2006
For each component identify:
Responsibility (Safety Requirements
and constraints
Inadequate control actions
Context in which decisions made
Mental model flaws
-
Applying STAMP
To understand accidents, need to examine safetycontrol structure itself to determine why inadequateto maintain safety constraints and why eventsoccurred
To prevent accidents, need to create an effectivesafety control structure to enforce the system safetyconstraints
Not a blame model but a why model
-
Modeling Accidents Using STAMP
Three types of models are used:
1. Static safety control structure
2. Dynamic safety control structure
Shows how control structure changed over time
3. Behavioral dynamics
Dynamic processes behind changes, i.e., why the
system changed over time
Copyright Nancy Leveson, Aug. 2006
-
Simplified System Dynamics Model of Columbia Accident
Copyright Nancy Leveson, Aug. 2006
-
STAMP vs. Traditional Accident Models
Examines inter-relationships rather than linear
cause-effect chains.
Looks at the processes behind the events
Includes entire socio-technical system
Includes behavioral dynamics (changes over time)
Want to not just react to accidents and impose controls for
a while, but understand why controls drift toward
ineffectiveness over time and
Change those factors if possible
Detect the drift before accidents occur
Copyright Nancy Leveson, Aug. 2006
-
Uses for STAMP
Basis for new, more powerful hazard analysis techniques
(STPA)
Inform early architectural trade studies
Identify and prioritize hazards and risks
Identify system and component safety requirements and constraints
(to be used in design)
Perform hazard analyses on physical and social systems
Safety-driven design (physical, operational, organizational)
More comprehensive accident/incident investigation and
root cause analysis
-
Uses for STAMP (2)
Organizational and cultural risk analysis
Identifying physical and project risks
Defining safety metrics and performance audits
Designing and evaluating potential policy and structural improvements
Identifying leading indicators of increasing risk (canary in the coal
mine)
New holistic approaches to security
-
Does it Work? Is it Practical?
MDA risk assessment of inadvertent launch (technical)
Architectural trade studies for the space exploration
initiative (technical)
Safetydriven design of a NASA JPL spacecraft
(technical)
NASA Space Shuttle Operations (risk analysis of a new
management structure)
NASA Exploration Systems (risk management tradeoffs
among safety, budget, schedule, performance in
development of replacement for Shuttle)
-
Does it Work? Is it Practical? (2)
Accident analysis (spacecraft losses, bacterial
contamination of water supply, aircraft collision, oil
refinery explosion, train accident, etc.)
Pharmaceutical safety
Hospital safety (risks of outpatient surgery at Beth Israel
MC)
Corporate fraud (are controls adequate?, Sarbanes-
Oxley)
Food safety
Train safety (Japan)
-
Accident/Incident Investigation
and Causal Analysis
Copyright Nancy Leveson, Aug. 2006
-
Using STAMP in Root Cause Analysis
Identify system hazard violated and the system safety design
constraints
Construct the safety control structure as it was designed to
work
Component responsibilities (requirements)
Control actions and feedback loops
For each component, determine if it fulfilled its responsibilities
or provided inadequate control.
If inadequate control, why? (including changes over time)
Determine the changes that could eliminate the inadequate
control (lack of enforcement of system safety constraints) in
the future.
Copyright Nancy Leveson, Aug. 2006
-
Components surrounding
Controller in Zurich
-
Links degraded due to
poor and unsafe practices
-
Links lost due to
sectorization work
-
Links
lost due
to
unusual
situations
-
Safety Requirements and Constraints
Must follow TCAS mandate
Context in which decisions were made
Flying over Western Europe ( TCAS is mandatory)
TU Crew doesnt have radio communication with Boeing Crew
Flight Crew has no simulator experience with TCAS
Flight Training is unclear on what to do in case of conflict betweenATC/TCAS
Flying at night
Inadequate Decisions and Control Actions
Reliance on optical contact
Ignores minority report from spare member of the crew
Follows controller instructions rather than TCAS
Mental Model Flaws
Optical illusion of distance
Belief that ATC is aware of everything that is happening
Belief that pilot, not TCAS has the last said in the evasion action
Understanding of TCAS as a backup system rather than a final resort
Tupulov Crew
-
Zurich ATC Operations
Safety Requirements and Constraints
Maintain safe separation between planes in airspace
Context in which decisions were made
Phone system prevented communication from other ATCs
Inadequate radar coverage
Insufficient personnel (only one controller)
Unaware of TCAS and/or impact of TCAS during a RA
Etc.
Inadequate Decision and Control Actions
Failure to communicate with DHL plane
Failure to adequately monitor situation
Mental Model Flaws
Unaware of conflicting TCAS procedures between Russian andEuropean pilots
Etc.
-
Regulatory Agencies (FAA, CAA,
Eurocontrol)
(No significant influence on accident according to the report)
Safety Requirements (Responsibilities)
Clearly articulate procedures for compliance with TCAS RAs.
Clearly articulate right of way rules in airspace.
Define the role of air traffic controllers and pilots in resolving conflicts inthe presence of TCAS.
Flawed Control Actions
AIP Germany regulations not up to date for current version of TCAS.
Procedural instruction for the actions to be taken by the pilots (from AIPGermany) in case of an RA not worded clearly enough.
LuftVO Air Traffic Order Pilots are granted a freedom of decisionwhich is not compatible with the system philosophy of TCAS II, Version7 use of term recommendation is inadequate.
Reasons for Flawed Control Actions, Dysfunctional Interactions
Overlapping control authority by several nations & organizations.
Asynchronous evolution between regulatory guidance documents andadopted technology.
-
Filnamn: Uberlingen enl STAMP
(C) FMV 2007
utg 9,3 2008-03-04
Bjrn
Koberstein
Uberlingen, Operators: STAMP s Static Control Structure (Action Feedback)
ICAO
EuroControl
Aircraft
authorities
(incl FAA)
Flight
operatorsAircraft
manufacturerTCAS
manufacturer
Pilots
Radar display
Missing
feedback
ICAO (a UN agency)-Cooperative aviation regulation
-Standards and recommendations .
. (including rules of the air,
. responsibilities of ATCOs & pilots,
. conflicts btw ATCO & TCAS)
Swiss Air Navigation Services .(ATC Zrich management)
-Air traffic control in Swiss airspace + in
. .delegated airspaces of adjoining states-ATC Safety policy (not fully implemented,. also SMOP in practice not prevented)
ATC operator-overloaded
-insufficiently trained
-unofficial practice deviating from the
. regular roster (night shift SMOP)
EUROCONTROL (European org)-Standardization and guidance
-Certification, education, information
Missing
feedback
Missing
feedback
No direct
feedback
Aircraft TCASCoC-safety, quality and risk management
-the implementation of the safety & risk
. management procedures were delayed
. due to their in-house development
-unaware of the sectorisation work
Radar
CoC
ATC-
operators
Missing
feedback
ATC Station
(ATC Zurich)
Missing
feedback
TCAS manufacturer-TCAS 2000 Pilots Guide
. (including TCAS-ATC conflicts)
Flight operators-TCAS training (B757-200/TU154M) * (obey RA unless/after conflict contact)
-TU154M Flight Operations (ATC has. precedence over TCAS)
JAA (Joint Aviation Authorities)-Guidance and training (TCAS-RAs
. precedence over ATCO instructions)
* BFU Investigation Report pp 62, 65
-
STPA
A new hazard analysis technique based
on the STAMP model of accident
causation
Copyright Nancy Leveson, Aug. 2006
-
STAMP-Based Hazard Analysis (STPA)
Supports a safety-driven design process where
Hazard analysis influences and shapes early design
decisions
Hazard analysis iterated and refined as design evolves
Goals (same as any hazard analysis)
Identification of system hazards and related safety
constraints necessary to ensure acceptable risk
Accumulation of information about how hazards can be
violated, which is used to eliminate, reduce and control
hazards in system design, development, manufacturing,
and operations
Copyright Nancy Leveson, Aug. 2006
-
Safety-Driven Design
Define initial control structure, refining system safety constraintsand design in parallel.
Identify potentially hazardous control actions by each ofsystem components that would violate system designconstraints. Restate as component safety design requirementsand constraints.
Perform hazard analysis using STPA to identify how safety-related requirements and constraints could be violated (thepotential causes of inadequate control and enforcement ofsafety-related constraints).
Augment the basic design to eliminate, mitigate, or controlpotential unsafe control actions and behaviors.
Iterate over the process, i.e. perform STPA on the newaugmented design and continue to refine the design until allhazardous scenarios are eliminate, mitigated, or controlled.
Document design rationale and trace requirements andconstraints to the related design decisions.
Copyright Nancy Leveson, Aug. 2006
-
Step 1: Identify hazards and translate into high-
level requirements and constraints on behavior
TCAS Hazards:
1. A near mid-air collision (NMAC): Two controlled aircraft violateminimum separation standards)
2. A controlled maneuver into ground
3. Loss of control of aircraft
4. Interference with other safety-related aircraft systems
5. Interference with the ground-based ATC system
6. Interference with ATC safety-related advisory
System Safety Design Constraints:
TCAS must not cause or contribute to an NMAC
TCAS must not cause or contribute to a controlled maneuverinto the ground
Copyright Nancy Leveson, Aug. 2006
-
Step 2: Define basic control structure
Copyright Nancy Leveson, Aug. 2006
-
Component Responsibilities
TCAS:
Receive and update information about its own and other aircraft
Analyze information received and provide pilot with
Information about where other aircraft in the vicinity are located
An escape maneuver to avoid potential NMAC threats
Pilot
Maintain separation between own and other aircraft using visual
scanning
Monitor TCAS displays and implement TCAS escape maneuvers
Follow ATC advisories
Air Traffic Controller
Maintain separation between aircraft in controlled airspace by
providing advisories (control action) for pilot to follow
Copyright Nancy Leveson, Aug. 2006
-
Aircraft components (e.g., transponders, antennas)
Execute control maneuvers
Receive and send messages to/from aircraft
Etc.
Airline Operations Management
Provide procedures for using TCAS and following TCAS
advisories
Train pilots
Audit pilot performance
Air Traffic Control Operations Management
Provide procedures
Train controllers,
Audit performance of controllers
Audit performance of overall collision avoidance system
Copyright Nancy Leveson, Aug. 2006
-
Step 3a: Identify potential inadequate control
actions that could lead to a hazardous state.
In general:
1. A required control action is not provided or not
followed
2. An incorrect or unsafe control action is provided
3. A potentially correct or inadequate control action is
provided too late or too early (at the wrong time)
4. A correct control action is stopped too soon.
Copyright Nancy Leveson, Aug. 2006
-
For the NMAC hazard:
TCAS:
1. The aircraft are on a near collision course and TCAS does not
provide an RA
2. The aircraft are in close proximity and TCAS provides an RA that
degrades vertical separation.
3. The aircraft are on a near collision course and TCAS provides an
RA too late to avoid an NMAC
4. TCAS removes an RA too soon
Pilot:
1. The pilot does not follow the resolution advisory provided by TCAS
(does not respond to the RA)
2. The pilot incorrectly executes the TCAS resolution advisory.
3. The pilot applies the RA but too late to avoid the NMAC
4. The pilot stops the RA maneuver too soon.
Copyright Nancy Leveson, Aug. 2006
-
Step 3b: Use identified inadequate control
actions to refine system safety design
constraints
When two aircraft are on a collision course, TCAS must
always provide an RA to avoid the collision
TCAS must not provide RAs that degrades vertical separation
The pilot must always follow the RA provided by TCAS
Copyright Nancy Leveson, Aug. 2006
-
Step 4: Determine how potentially hazardous
control actions could occur (scenarios of how
constraints can be violated). Eliminate from design
or control in design or operations.
Step4a: Augment control structure with process models for each
control component.
Step4b: For each of inadequate control actions, examine parts of
control loop to see if could cause it.
Guided by a set of generic control flaws
Step 4c: Design controls and mitigation measures
Step4d: Consider how designed controls could degrade over time.
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Generic Control Loop Flaws
1. Inadequate Enforcement of Constraints (inadequate
Control Actions)
- Design of control algorithm (process) does not enforce
constraints
Flaws in creation process
Process changes without appropriate change in control
algorithm (asynchronous evolution)
Incorrect modification or adaptation
- Inadequate coordination among controllers and decision
makers
Copyright Nancy Leveson, Aug. 2006
-
- Process models inconsistent, incomplete, or incorrect
Flaws in creation process
Flaws in updating (inadequate or missing feedback)
Not provided in system design
Communication flaw
Time lag
Inadequate sensor operation (incorrect or no information provided)
Time lags and measurement inaccuracies not accounted
for
Expected process inputs are wrong or missing
Expected control inputs are wrong or missing
Disturbance model is wrong
Amplitude, frequency, or period is out of range
Unidentified disturbance
Copyright Nancy Leveson, Aug. 2006
-
2. Inadequate Execution of Control Actions
Communication flaw
Inadequate actuator operation
Time lag
Copyright Nancy Leveson, Aug. 2006
-
Comparison with Traditional HA
Techniques
Top-down (vs bottom-up like FMECA)
Considers more than just component failure and failure
events (includes these but more general)
Guidance in doing analysis (vs. FTA)
Handles dysfunctional interactions and system accidents,
software, management, etc.
Copyright Nancy Leveson, Aug. 2006
-
Comparisons (2)
Concrete model (not just in head)
Not physical structure (HAZOP) but control (functional)
structure
General model of inadequate control (based on control
theory)
HAZOP guidewords based on model of accidents being
caused by deviations in system variables
Includes HAZOP model but more general
Compared with TCAS II Fault Tree (MITRE)
STPA results more comprehensive
Included Ueberlingen accident
Copyright Nancy Leveson, Aug. 2006
-
1. Identify high-level functional requirements andenvironmental constraints.
e.g. size of physical space, crowded area
2. Identify high-level hazards
a. Violation of minimum separation between mobile base andobjects (including orbiter and humans)
b. Mobile robot becomes unstable (e.g., could fall over)
c. Manipulator arm hits something
d. Fire or explosion
e. Contact of human with DMES
f. Inadequate thermal control (e.g., damaged tiles not detected,DMES not applied correctly)
g. Damage to robot
Thermal Tile Robot Example
Copyright Nancy Leveson, Aug. 2006
-
3. Try to eliminate hazards from system conceptual design.
If not possible, then identify controls and new design
constraints.
For unstable base hazard
System Safety Constraint:
Mobile base must not be capable of falling over under worst case operational conditions
Copyright Nancy Leveson, Aug. 2006
-
First try to eliminate:
1. Make base heavy
Could increase damage if hits someone or something.
Difficult to move out of way manually in emergency
2. Make base long and wide
Eliminates hazard but violates environmental constraints
3. Use lateral stability legs that are deployed when manipulatorarm extended but must be retracted when mobile base moves.
Two new design constraints:
Manipulator arm must move only when stabilizer legs are fullydeployed
Stabilizer legs must not be retracted until manipulator arm isfully stowed.
Copyright Nancy Leveson, Aug. 2006
-
Define preliminary control structure and refine
constraints and design in parallel.
Copyright Nancy Leveson, Aug. 2006
-
Identify potentially hazardous control actions by
each of system components
1. A required control action is not provided or not followed
2. An incorrect or unsafe control action is provided
3. A potentially correct or inadequate control action is providedtoo late or too early (at the wrong time)
4. A correct control action is stopped too soon.
Hazardous control of stabilizer legs:
Legs not deployed before arm movement enabled
Legs retracted when manipulator arm extended
Legs retracted after arm movements are enabled or retractedbefore manipulator arm fully stowed
Leg extension stopped before they are fully extended
Copyright Nancy Leveson, Aug. 2006
-
Restate as safety design constraints on components
1. Controller must ensure stabilizer legs are extended
whenever arm movement is enabled
2. Controller must not command a retraction of stabilizer legs
when manipulator arm extended
3. Controller must not command deployment of stabilizer legs
before arm movements are enabled. Controller must not
command retraction of legs before manipulator arm fully
stowed
4. Controller must not stop leg deployment before they are fully
extended
Copyright Nancy Leveson, Aug. 2006
-
Do same for all hazardous commands:
e.g., Arm controller must not enable manipulator armmovement before stabilizer legs are completely extended.
At this point, may decide to have arm controller and
leg controller in same component
Copyright Nancy Leveson, Aug. 2006
-
To produce detailed scenarios for violation of
safety constraints, augment control structure with
process models
Arm MovementEnabled
Disabled
Unknown
Stabilizer LegsExtended
Retracted
Unknown
Manipulator ArmStowed
Extended
Unknown
How could become inconsistent with real state?
e.g. issue command to extend stabilizer legs but external
object could block extension or extension motor could fail
Copyright Nancy Leveson, Aug. 2006
-
Problems often in startup or shutdown:
e.g., Emergency shutdown while servicing tiles. Stability legsmanually retracted to move robot out of way. When restart,
assume stabilizer legs still extended and arm movement could be
commanded. So use unknown state when starting up
Do not need to know all causes, only safety constraints:
May decide to turn off arm motors when legs extended or when
arm extended. Could use interlock or tell computer to power it off.
Must not move when legs extended? Power down wheel motors
while legs extended.
Coordination problems
Copyright Nancy Leveson, Aug. 2006
-
Some Examples and References
to Papers on Them
Copyright Nancy Leveson, Aug. 2006
-
Example: Early System Architecture
Trades for Space Exploration
Part of an MIT/Draper Labs contract with NASA
Wanted to include risk, but little information available
Not possible to evaluate likelihood when no designinformation available
Can consider severity by using worst-case analysisassociated with specific hazards.
Developed three step process:
Identify system-level hazards and associated severities
Identify mitigation strategies and associated impact
Calculate safety/risk metrics for each architecture
Copyright Nancy Leveson, Aug. 2006
-
Sample
First identify system hazards and severities
Copyright Nancy Leveson, Aug. 2006
-
ID# Phase Hazard H M EqG1 General Flamable substance in presence of ignition source (Fire) 4 4 4
G2 General Flamable substance in presnece of ignition source in confined space (Explosion) 4 4 4
G3 General Loss of life support (includes power, temperature, oxygen, air pressure, CO2, food, water, etc.) 4 4 4
G4 General Crew injury or illness 4 4 1
G5 General Solar or nuclear radiation exceeding safe levels 3 3 2
G6 General Collision (Micrometeroids, debris, with modules during rendevous or separation maneuver, etc.) 4 4 4
G7 General Loss of attitude control 4 4 4
G8 General Engines do not ignite 4 4 2
PL1 Pre-Launch Damage to Payload 2 3 3
PL2 Pre-Launch Launch delay (due to weather, pre-launch test failures, etc.) 1 4 1
L1 Launch Incorrect propulsion/trajectory/control during ascent 4 4 4
L2 Launch Loss of structural integrity (due to aerodynamic loads, vibrations, etc) 4 4 4
L3 Launch Incorrect stage separation 4 4 4
E1 EVA in Space Lost in space 4 4 1
A1 Assembly Incorrect propulsion/control during rendevous 4 4 4
A2 Assembly Inability to dock 1 4 3
A3 Assembly Inability to achieve airlock during docking 1 4 3
A4 Assembly Inability to undock 4 4 3
T1 In-Space Transfer Incorrect propulsion/trajectory/control during course change burn 4 4 3
D1 Descent Inability to undock 4 4 3
D2 Descent Incorrect propulsion/trajectory/control during descent 4 4 4
D3 Descent Loss of structural integrity (due to inadequate thermal control, aerodynamic loads, vibrations, etc) 4 4 4
A1 Ascent Incorrect stage separation (including ascent module disconnecting from descent stage) 4 3 3
A2 Ascent Incorrect propulsion/trajectory/control during ascent 4 3 3
A3 Ascent Loss of structural integrity (due to aerodynamic loads, vibrations, etc) 4 3 3
S1 Surface Operations Crew members stranded on M surface during EVA 4 3 3
S2 Surface Operations Crew members lost on M surface during EVA 4 3 3
S3 Surface Operations Equipment damage (including related to lunar dust) 2 3 3
NP1 Nuclear Power Nuclear fuel released on earth surface 4 4 2
NP2 Nuclear Power Insufficient power generation (reactor doesn't work) 4 3 3
NP3 Nuclear Power Insufficient reactor cooling (leading to reactor meltdown) 4 3 3
RE1 Re-Entry Inability to undock 4 3 3
RE2 Re-Entry Incorrect propulsion/trajectory/control during descent 4 3 3
RE3 Re-Entry Loss of structural integrity (due to inadequate thermal control, aerodynamic loads, vibrations, etc) 4 3 4
RE4 Re-Entry Inclement weather 4 2 2
Severity
Identified Hazards and their Severities
Copyright Nancy Leveson, Aug. 2006
-
For example, not performing a rendezvous in transit reduces hazard
of being unable to dock
Copyright Nancy Leveson, Aug. 2006
-
Evaluate Each Architecture and Calculate
Safety/Risk Metrics
Create an architecture vector with all parameters for
that architecture (column C of spreadsheet)
Compute metric on architecture vector:
Calculate a Relative Hazard Mitigation Index
Calculate a Relative Severity Index
Combine into an Overall Safety/Risk Metric
Details in http://sunnyday.mit.edu/papers/issc05-
final.pdf
Copyright Nancy Leveson, Aug. 2006
-
Sample Results
Copyright Nancy Leveson, Aug. 2006
-
Ballistic Missile Defense System (BMDS)
Non-Advocate Safety Assessment using STPA
A layered defense to defeat all ranges of threats in allphases of flight (boost, mid-course, and terminal)
Made up of many existing systems (BMDS Element)
Early warning radars
Aegis
Ground-Based Midcourse Defense (GMD)
Command and Control Battle Management andCommunications (C2BMC)
Others
MDA used STPA to evaluate the residual safety risk ofinadvertent launch prior to deployment and test
Copyright Nancy Leveson, Aug. 2006
-
8/2/2006 132
Status
Track Data
Fire Control
Radar
Operators
Engage Target
Operational Mode Change
Readiness State Change
Weapons Free / Weapons Hold
Operational Mode
Readiness State
System Status
Track Data
Weapon and System Status
Command Authority
Doctrine
Engagement Criteria
Training
TTP
Workarounds
Early WarningSystem
Status Request
Launch Report
Status Report
Heartbeat
Radar Tasking
Readiness Mode Change
Status Request
Acknowledgements
BIT Results
Health & Status
Abort
Arm
BIT Command
Task Load
Launch
Operating Mode
Power
Safe
Software Updates
Flight Computer
InterceptorSimulator
Launch Station
Fire DIsable
Fire Enable
Operational Mode Change
Readiness State Change
Interceptor Tasking
Task Cancellation
Command Responses
System Status
Launch Report
Launcher
Launch Position
Stow Position
Perform BIT
InterceptorH/W
Arm
Safe
Ignite
BIT Info
Safe & Arm Status
BIT Results
Launcher Position
Abort
Arm
BIT Command
Task Load
Launch
Operating Mode
Power
Safe
Software Updates
Acknowledgements
BIT Results
Health & Status
Breakwires
Safe & Arm Status
Voltages
Exercise Results
Readiness
Status
Wargame Results
Safety Control Structure Diagram for FMIS
Copyright Nancy Leveson, Aug. 2006
-
Results
Deployment and testing held up for 6 months because somany scenarios identified for inadvertent launch (the onlyhazard considered so far). In many of these scenarios:
All components were operating exactly as intended
Complexity of component interactions led to unanticipatedsystem behavior
STPA also identified component failures that could causeinadequate control (most analysis techniques consider onlythese failure events)
As changes are made to the system, the differences areassessed by updating the control structure diagrams andassessment analysis templates.
Adopted as primary safety approach for BMDS
Copyright Nancy Leveson, Aug. 2006
-
Safety-driven Design of an Outer Planets
Explorer Spacecraft for JPL
Demonstration of approach on the design of a deep
space exploration mission spacecraft (Europa).
Defined mission hazards
Generated mission safety requirements and design
constraints
Created spacecraft control structure and system design
Performed STPA and generated component safety
requirements and design features to control hazards
http://sunnyday.mit.edu/papers/IEEE-Aerospace.pdf
(complete specifications also available)
-
Organizational and Cultural Risk
Analysis
Copyright Nancy Leveson, Aug. 2006
-
Cultural and Organizational Risk
Analysis and Performance Monitoring
Apply STAMP and STPA at organizational level plus
system dynamics modeling and analysis
Goals:
Evaluating and analyzing risk
Designing and validating improvements
Monitoring risk (canary in the coal mine)
Identifying leading indicators of increasing or
unacceptable risk
Copyright Nancy Leveson, Aug. 2006
-
System Dynamics
Created at MIT in 1950s by Forrester
Used a lot in Sloan School (management)
Grounded in non-linear dynamics and feedback control
Also draws on
Cognitive and social psychology
Organization theory
Economics
Other social sciences
Use to understand changes over time (dynamics of a
system
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
People who
know
People who
don't knowrate of sharing
the news
Probability of Contact
with those in the know
Contacts between peoplewho know and people who
don't
+
++
+
100
75
50
25
00 1 2 3 4 5 6 7 8 9 10 11 1
Time (Month)
People who know
peo
ple
People who don't knowRate of sharing the news
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Copyright Nancy Leveson, Aug. 2006
-
Risk Analysis Process
for Independent Technical Authority
1. Preliminary
Hazard Analysis
2. Modeling the ITA
Safety Control
Structure
3. Mapping
Requirements to
Responsibilities
4. Detailed Hazard
Analysis using STPA
? System hazards
? System safety requirements
and constraints
? Roles and
responsibilities
? Feedback mechanisms
? Gap analysis ? System risks
(inadequate
controls)
5. Categorizing &
Analyzing Risks
6. System Dynamics
Modeling and Analysis
7. Findings and
Recommendations
? Immediate and
longer term risks
? Sensitivity
? Leading
indicators
? Risk Factors
? Policy
? Structure
? Leading indicators
and measures of
effectiveness
Copyright Nancy Leveson, Aug. 2006
-
1. Preliminary Hazard Analysis
System Hazard: Poor engineering and management decision-
making leading to an accident (loss).
System Safety Requirements and Constraints:
1. Safety considerations must be first and foremost in technical
decision-making.
2. Safety-related technical decision-making must be done by
eminently qualified experts with broad participation of the full
workforce.
3. Safety analyses must be available and used starting in the early
acquisition, requirements development, and design processes
and continuing through the system lifecycle.
4. The Agency must provide avenues for full expression of technical
conscience and a process for full and adequate resolution of
technical conflicts as well as conflicts between programmatic and
technical concerns. Copyright Nancy Leveson, Aug. 2006
-
Each of these was refined, e.g.,
1. Safety considerations must be first and foremost in technicaldecision-making.
a. State-of-the art safety standards and requirements for NASA missionsmust be established, implemented, enforced, and maintained thatprotect the astronauts, the workforce, and the public.
b. Safety-related technical decision-making must be independent fromprogrammatic considerations, including cost and schedule
c. Safety-related decision-making must be based on correct, complete,and up-to-date information.
d. Overall (final) decision-making must include transparent considerationof both safety and programmatic concerns.
e. The Agency must provide for effective assessment and improvementin safety-related decision-making.
To create a set of system safety requirements and constraintssufficient to eliminate or mitigate the hazard
Copyright Nancy Leveson, Aug. 2006
-
2. Model the ITA Control Structure
Copyright Nancy Leveson, Aug. 2006
-
For each component specified:
Inputs, outputs
Overall role and detailed responsibilities (requirements)
Potential inadequate control actions
Feedback requirements
For most added:
Environmental and behavior-shaping factors (context)
Mental model requirements
Controls
Copyright Nancy Leveson, Aug. 2006
-
Example from System Technical
Warrant Holder
1. Establish and maintain technical policy, technical
standards, requirements, and processes for a
particular system or systems.
a. STWH shall ensure program identifies and imposes
appropriate technical requirements at
program/project formulation to ensure safe and
reliable operations.
b. STWH shall ensure inclusion of the consideration of
risk, failure, and hazards in technical requirements.
c. STWH shall approve the set of technical
requirements and any changes to them
d. STWH shall approve verification plans for the
system(s)
Copyright Nancy Leveson, Aug. 2006
-
3. Map System Requirements to
Component Responsibilities
Took each of system safety requirements and
traced to component responsibilities
(requirements)
Identified omissions, conflicts, potential issues
Recommended additions and changes
Added responsibilities when missing in order for
risk analysis to be complete.
Copyright Nancy Leveson, Aug. 2006
-
4. Hazard Analysis using STPA
General types of risks for ITA:
1. Unsafe decisions are made by or approved by ITA
2. Safe decisions are disallowed (overly conservative decision-making that undermines the goals of NASA and long-termsupport for ITA)
3. Decision-making takes too long, minimizing impact and alsoreducing support for ITA
4. Good decisions are made by ITA, but do not have adequateimpact on system design, construction, and operation
Applied to each of component responsibilities
Identified basic and coordination risks
Copyright Nancy Leveson, Aug. 2006
-
Example from Risks List
CE Responsibility: Develop, monitor, and maintain technical
standards and policy
Risks:
1. General technical and safety standards and
requirements are not created (IC)
2. Inadequate standards and requirements are created
(IC)
3. Standards degrade as changed over time due to
external pressures to weaken them. Process for
approving changes is flawed (LT).
4. Standards not changed or updated over time as the
environment changes (LT).
Copyright Nancy Leveson, Aug. 2006
-
5. Categorize and Analyze Risks
Large number resulted so:
Categorized risks as
Immediate concern
Longer-term concern
Standard Process
Used system dynamics models to identify which risks
were most important to assess and measure
Provide most important assessment of current level of risk
Most likely to detect increasing risk early enough to prevent
significant losses (leading indicators)
Copyright Nancy Leveson, Aug. 2006
-
Risk
Shuttle Agingand
MaintenanceSystem Safety
Efforts &Efficacy
PerceivedSuccess by
Administration
System SafetyKnowledge,
Skills & Staffing
Launch RateSystem Safety
ResourceAllocation
Incident Learning& Corrective
Action
ITA
Copyright Nancy Leveson, Aug. 2006
-
6. System Dynamics Modeling
Modified our NASA manned space program model
to include Independent Technical Authority (ITA)
Independently tested and validated the nine models,
then connected them
Ran analyses:
Sensitivity analyses to investigate impact of various
parameters on system dynamics and risk
System behavior mode investigation
Metrics evaluations
Additional scenarios and insights
Copyright Nancy Leveson, Aug. 2006
-
Example Result
ITA has potential to significantly reduce risk and to
sustain an acceptable risk level
But also found significant risk of unsuccessful
implementation of ITA that needs to be monitored
200-run Monte-Carlo sensitivity analysis
Random variations of +/- 30% of baseline exogenous
parameter values
Copyright Nancy Leveson, Aug. 2006
-
Successful vs. Unsuccessful ITA
ImplementationIndicator of Effectiveness and Credibility of ITA
1
0.5
0Time
1
2
1
0.5
0 1
2
Time
System Technical Risk
-
Self-sustaining for short period of time if conditions in
place for early acceptance.
Provides foundation for a solid, sustainable ITA
program implementation under right conditions.
Successful scenarios:
After period of high success, effectiveness slowly
declines
Complacency
Safety seen as solved problem
Resources allocated to more urgent matters
But risk still at acceptable levels and extended period
of nearly steady-state equilibrium with risk at low levels
Successful Scenarios
Copyright Nancy Leveson, Aug. 2006
-
Unsuccessful Implementation Scenarios
Effectiveness quickly starts to decline and reaches
unacceptable levels
Limited ability of ITA to have sustained effect on system
Hazardous events start to occur, safety increasingly
perceived as urgent problem
More resources allocated to safety but TA and TWHs have
lost so much credibility they cannot effectively contribute to
risk mitigation anymore.
Risk increases dramatically
ITA and safety staff overwhelmed with safety problems
Start to approve an increasing number of waivers so can
continue to fly.
Copyright Nancy Leveson, Aug. 2006
-
Unsuccessful Scenario Factors
As effectiveness of ITA decreases, number of problems
increase
Investigation requirements increase
Corners may be cut to compensate
Results in lower-quality investigation resolutions and
corrective actions
TWHs and Trusted Agents become saturated and cannot attend
to each investigation in timely manner
Bottleneck created by requiring TWHs to authorize all safety-
related decisions, making things worse
Want to detect this reinforcing loop while interventions
still possible and not overly costly (resources, downtime)
Copyright Nancy Leveson, Aug. 2006
-
Identification of Lagging vs. Leading
Indicators
Number of waivers issued
good indicator but lags rapid
increase in risk
Incidents under investigation
is a better leading indicator
System Technical Risk Risk UnitsOutstanding Accumulated Waivers Incidents
Time
System Technical Risk Risk UnitsIncidents Under Investigation Incidents
Time
-
Modeling Exploration Enterprise (ESMD)
Built a large STAMP plus systems dynamics model
of the Project Constellation
Development-oriented vs. operations oriented
Space Shuttle model
Demonstrating how it can be used for risk
management decision-making
Copyright Nancy Leveson, Aug. 2006
-
Risk Management in NASAs New
Exploration Systems Mission Directorate
Created an executable model, using input from the NASAworkforce, to analyze relative effects of managementstrategies on schedule, cost, safety and performance
Developed scenarios to analyze risks identified by theAgencys workforce
Performed preliminary analysis on the effects of hiringconstraints, management reserves, independence of safetydecision-making, requirements changes, etc.
Derived preliminary recommendations to mitigate andmonitor program-level risks
Copyright Nancy Leveson, Aug. 2006
-
Structure of System Dynamics Model
Congress and White HouseDecision -Making
NASA Administration and ESMDDecision -Making
OSMA OCE
Exploration Systems Engineering Management
Technical
Personnel
Resources and
Experience
System Development and
Safety Analysis
Completion
Efforts and
Efficacy of Other
Technical
Personnel
Engineering
Procurement
NESC
Safety and Mission
Assurance
SMA Status , Efficacy ,
Knowledge and Skills
Exploration Systems Program /Project Management
Task Completion and Schedule Pressure Resource Allocation
Copyright Nancy Leveson, Aug. 2006
-
Design Work
Remaining
Design Work
Completed
Pending Technology
Development Tasks
Completed Technology
Development Tasks
Design Task
Completion Rate
Technology Development
Task Completion Rate
Technologies used in
DesignTechnology
Utilization Rate
Pending Hazard
Analyses
Incoming Program
Design Work
Incoming Hazard
Analysis Tasks
Incoming Technology
Development Tasks
Completed Hazard
Analyses
Hazard Analyses
used in DesignHA Completion
Rate
HA Utilization
Rate
Hazard Analyses
unused in Design
Decisions
HA Discard Rate
Abandoned
Technologies
Technology
Abandonment Rate
Design Task AllocationRate (from P/P
Management)
Technology Development Task
Allocation Rate (from P/P
Management)
Capacity for Performing
System Design Work 0
Capacity for
PerformingTechnology
Development Work 0
Design SchedulePressure from
Management
Fraction of HAs Too
Late to Influence Design
Average Hazard
Analysis Quality
Average Quality ofHazard Analyses used in
Design
Fraction of Design Tasks with
Associated Hazard Analysis
Technology Available to
be used in Design
Additional Incoming
Design Work
Progress Report to
Management
Additional Incoming Work
from Changes (from P/P
Management)
Design Work
Completed with
Undiscovered Safety
and Integration
Problems
Design Work Completion
Rate with Safety and
Integration Flaws
Total Design Work
Completion Rate
Work Discovered with
Safety and Integration
Problems
Flaw Discovery
Rate
Design Work with
Accepted Problems or
UnsatisfiedRequirements
Acceptance Rate
Unplanned Rework
Decision Rate
Additional Operations Cost for Safety
and Integration Workaround
Efficacy of Safety
Assurance (SMA)
Safety Assurance
Resources
Time to Discover
Flaws
Incentives to
Report Flaws
Efficacy of System
Integration
Quality of Safety
Analyses 0
Maximum System Safety
Analysis Completion Rate
System
Performance
Apparent Work
Completed
Desired Design Task
Completion Rate
Safety of
Operational System
System Design
Overwork
Desired Safety Analysis
Completion Rate
Ability to Perform
Contractor Safety
Oversight 2
Fraction of Design TasksCompleted with Safety and
Integration Flaws
Engineering - System Development Completion and Safety Analyses
Safety Rework
Rework Cycle
Integrated
Product
-
NASA ESMD Workforce Planning
ESMD Employee Gap4,000
2,950
1,900
850
- 2000 37.5 75 112.5 150
Time ( Month)
Transfers from
Shuttle
+
Limits on Hiring
+
(A)
5 0
Simulation varied:
Initial experience distribution of ESMD civil servant workforce
Maximum civil servant hiring rates
Transfers from Shuttle ops during Shuttle retirement
Important Issues:
- Increase in retirements
- Hiring limits
- Transfers
-
Example: Schedule Pressure and Safety Priority
in Developing the Shuttle Replacement
1. Overly aggressive schedule
enforcement has little effect
on completion time (
-
Using Model for Policy Decisions
The results of the analyses can be used to make policydecisions, for example:
Reduce limitations (external and internal) that will impede civilservant hiring in the next few years
Monitor management reserves and use them to alleviate overwork
Enhance, monitor, and maintain influence of safety analysts ondecision-making
Rotate Rising Stars in the Agency through the safety organization
Monitor overwork of SE&I and safety engineering, as they controlthe rework cycle (safety, cost, and schedule impact)
Continue planning to minimize downstream requirements changes,and allow for on/off ramps (technologies and designs) to reducenegative impact
-
Exploring Limits of Use
Medical error and medical safety (risk analysis of
outpatient surgery at Beth Israel Deaconess Hospital)
Safety in pharmaceutical testing and drug development
Food safety
Control of corporate fraud
-
For More Information
New book draft on STAMP
http://sunnyday.mit.edu/book2.html
(link to CER Early Trades paper also here)
NASA ITA Risk Analysis Final Report
http://sunnyday.mit.edu/ITA-Risk-Analysis.doc
NASA ESMD Risk Management Demonstration
http://sunnyday.mit.edu/ESMD-Final-Report.pdf
-
Summary and Conclusions
A more powerful approach to hazard analysis and systemsafety engineering
Based on a new, more comprehensive model of accidentcausation
Includes what do now but also much more
Works for the complex, software-intensive systems (andsystems-of-systems) we are building
Considers the entire socio-technical system
Can be used early in concept formation and development toguide design for safety
Has been validated and is being used on real systems
Potential for very powerful automated tools and assistance
Copyright Nancy Leveson, Aug. 2006
-
Differences with Traditional Approaches
More comprehensive view of causality
A top-down systems approach to preventing losses
Includes organizational, social, and cultural aspects of
risk as well as physical system
Emphasizes non-probabilistic and qualitative approaches
Combines static (structural) and behavioral models
Looks at dynamics and changes over time
Migration toward states of increasing risk
Includes human decision making and mental models
Handles much more complex systems than traditional
safety engineering approaches
top related