Software Safety in Embedded Systems &
Software Safety: Why, What, and How – Leveson
UC San DiegoCSE 294
Spring Quarter 2006Barry Demchak
Previous Paper
System Safety in Computer-Controlled Automotive Systems – Leveson (2000) Types of accidents Safeware Methodology
Project Management Software Hazard Analysis Software Requirements Specification & Analysis Software Design & Analysis Design & Analysis of Human-Machine Interaction Software Verification Feedback from Operational Experience Change Control and Analysis
Roadmap
Safety definitions Industrial safety and risk Systems Issues – hardware and software Software Safety Analysis and Modeling Verification and Validation System Safety Engineering
Safety Before Computers
NASA: 10-9 chance of failure over a 10 hour flight
British nuclear reactors: no single fault can cause a reactor to trip, and 10-7 chance over 5000 hours of failure to meet a demand to trip
FAA: 10-9 chance per flight hour (i.e., not within total life span of entire fleet)
Introduction of Computers
Nuclear Power Plants Space Shuttle Airbus Aircraft Space Satellites NORAD
Purpose: perform functions that are too dangerous, quick, or complex for humans
System Safety (def.)
Subdiscipline of systems engineering Applies scientific, management, and
engineering principals Ensures adequate safety throughout the
system life cycle Constrained by operational effectiveness,
time, and cost MilSpec: “freedom from those conditions that
can cause death, injury, occupational illness, or damage to or loss of equipment or property”
More Definitions
Accident Unwanted and unexpected release of energy
Mishap (or failure) Unplanned event or series of events Death, injury, occupational illness, damage, or
loss of equipment or property, or environmental harm
Hazard A condition that can lead to a mishap
More Definitions (cont’d)
Risk Probability of a hazardous state occurring Probability of a hazardous state leading to a
mishap Perceived severity of the worst potential
mishap that could result from a hazard Hazard probability Hazard criticality (severity)
Early Approach
Operational or Industrial Safety Examining system during operating life Correcting unacceptable hazards Ignores crushing effect of single catastrophe
Assumptions All faults caused by human errors could be
avoided completely or located and removed prior to delivery and operation
Relatively low complexity of hardware
Ford Pinto (early 1970s)
Specifications: 2000 pounds, $2000 sale price Use existing factory tooling Safety issue with gas tank placement Analysis
Deaths cost $200,000, burns cost $67,000 Cost to make change $137M, benefit $49M
Ford engineer: “But you miss the point entirely. You see, safety isn't the issue, trunk space is. You have no idea how stiff the competition is over trunk space.”
Ford president: “Safety doesn’t sell” Verdict: $100M
Anecdotes
Safety devices themselves have been responsible for losses or increasing chances of mishaps
Redundancy sometimes degrades safety Unrelated (but related) systems cause errors
Later Approach
System Safety Design acceptable safety level before actual
production or operation Optimize safety by applying scientific and
engineering principals to identify and control hazards through analysis, design, and management procedures
Hazard analysis identifies and assesses Criticality level of hazards Risks involved in system design
Later approach (cont’d)
Assumptions Complexity of software and hardware
interaction causes non-linear increase in human-error-induced faults
Impossible to demonstrate safety ahead of usage
Complexity and coupling are covariant
Hardware vs Systems
Hardware Widgets have long history of use and fault
analysis … highly responsive to redundant techniques
Infinite number of stable states Software
No history with software … reuse is rare Large number of discrete states without
repetitive structure Difficult to test under realistic conditions
More Systems Issues
Difficult to specify completely – what it does, and what it does not do
Cannot identify misunderstandings about requirements
Engineers assume perfect execution environments, don’t consider transient faults
Lack of system-level methods and viewpoints
Even Bigger Systems Issues
Specification and implementation of components is not the same as between components
Between-component interactions grow exponentially and are often underrepresented in analyses
Components include Software and components Hardware Human operators
Still Bigger Systems Issues
More Components Development Methodologies Source code maintenance Verification/Validation Methodologies Stakeholder Values
Management Individual Programmers Customer Human Users Suppliers
Definitions
Reliability Probability that system will perform intended
function Safety
Probability that hazard will not lead to a mishap
Reliability = failure free Safety = mishap free Reliability and Safety often conflict
Safety
Studied separately from security, reliability, or availability
Separation of concerns Safety requirements are identified and
separated from operational requirements Conflicts resolved in a well-reasoned manner
Definitions
System Sum total of all component parts Software is only a part, and its correctness
exists only in relation to other system components
Software Safety
Ensures software will execute within a system context without resulting in unacceptable risk
Safety-critical software functions Directly or indirectly allow a hazardous system
state to exist Safety-critical software
Contains safety-critical functions
System Characteristics
Inputs and outputs over time Control subsystem
Description of function to be performed Specification of operating constraints (quality,
capacity, process, and safety) Safety constraints are hazards rewritten as
constraints Safety constraints written, maintained, and
audited separately
Analysis and Modeling
Preliminary Hazard Analysis (PHA) Subsystem Hazard Analysis (SSHA) System Hazard Analysis (SHA) Operating and Support Hazard Analysis
(OSHA)
Safeware – Leveson
Hazard Analysis
Start with list of identifiable hazards Work backward to discover combination of
faults that produce the hazard Categorization
Frequent Occasional Reasonably remote Remote … physically impossible
Hazard Examples (Nuclear Weapons)
Inadvertent nuclear detonation Inadvertent prearming, arming, launching,
firing, or releasing Deliberate prearming, arming, launching,
firing, or releasing under inappropriate conditions
Software Requirement Analysis
Hard to do Cubby-hole mentality Rarely includes what the system should not
do Techniques
Fault Tree Analysis (FTA) Real Time Logic (RTL) Petri nets
Real Time Logic
Model the system in terms of events and actions (both data dependency and temporal ordering)
Generate predicates Determine whether a safety assertion is a
theorem derivable from the model Inherently unsafe means that the assertion
cannot be derived from the model
Time Petri Nets
Mathematical modeling of discrete event systems in terms of conditions and events and the relationship between them
Facilitates backward analysis Points to failures and faults which are
potentially most hazardous Nontrivial to build and maintain
Research Question
What is the place of these analysis techniques in an agile development environment??
Safety Verification and Validation
Showing that a fault cannot occur Showing that if a fault occurs, it is not
dangerous Only as good as the specifications Specifications are usually incomplete, and
hardware specifications are rare
Safety Verification and Validation
Methodologies Proofs of adequacy Software Fault Tree (proofs of fault tree
analyses) Determine safety requirements Detect software logic errors Identify multiple failure sequences involving
different parts of the system Inform critical runtime checks Inform testing
Safety Verification and Validation
Methodologies Nuclear Safety Cross Check Analysis
(NSCCA) Demonstrate that software will not contribute to a
nuclear mishap Multiple technical analyses demonstrate
adherence to specifications Demonstrate security and control measures A lot of qualitative judgment regarding criticality
Software Common Mode Analysis Sneak Software Analysis
Safety Analysis – Quantitative
Requires statistical histories which may not exist
Applies mostly to physical systems Single-valued Best Estimate
Information sufficient for determinate models Probabilistic
Science is understood, but limited parameters available
Bounding Putting a ceiling on the answer
System Safety Engineering
Identify hazards Assessing hazards (likelihood and criticality) Design to eliminate or control hazards Assess risks that cannot be eliminated or
controlled
Failure Mode Definitions
Fail-safe Default is safe mode, no attempt to execute
operational mission Fail-operational
Default is to correct fault and continue with operational mission
Fail-soft Default is to continue with degraded
operations
Designing for Safety
Not possible to ensure safety by analysis or verification alone
Analysis and verification may be cost-prohibitive
Different standard hierarchy Intrinsically safe Prevents or minimizes occurrence of hazards Controls the hazard Warns of presence of hazard
Safety Design Mechanisms
Lockout device Prevents event from occurring when hazard is
present Lockin device
Maintains an event or condition Interlock device
Assuring operation sequences in correct order
Safety Design Principals
Provide leverage for certification Avoid complexity where possible Reduce risk by reducing hazard likelihood, or
severity, or both Modularize to separate safety-critical
functions from non-critical functions Execute safety-critical functions under
separate authority Fail on a single-point failure
Safety Design Principals (cont’d)
Start out in safe state, and take affirmative actions to reach higher risk states
Check critical flags as close as possible to actions they protect
Avoid compliments: absence of “armed” is not “safe”
Use “true” values to indicate safety … “false” values can result from common hardware failures
Safety Design Principals (cont’d)
Detection of unsafe states Watchdog timer Independent monitors Asserts and exception handlers Use backward recovery (return system to safe
state) instead of forward recovery (plow ahead)
Human Factors
Define partnership between human and computer Avoid complacency Avoid confusion Avoid passive monitoring
Conclusion
Select suite of techniques and tools spanning entire software development process
Apply them consciensciously, consistently, and thoroughly
Consider implementation tradeoffs Low catastrophe, high cost alternatives Moderate catastrophe, moderate cost
alternatives High catastrophe, low cost alternatives