system safety

39
SYSTEM SAFETY Using Probability for Risk Analysis Barbara Luckett – 20 October 2009

Upload: briar

Post on 26-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Using Probability for Risk Analysis. System Safety. Barbara Luckett – 20 October 2009. Personal Background. Naval Surface Warfare Center Dahlgren Division (NSWCDD ) “…premier research and development center that serves as a specialty site for weapon system integration.” -- NSWCDD website - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: System Safety

SYSTEM SAFETY

Using Probability for Risk Analysis

Barbara Luckett – 20 October 2009

Page 2: System Safety

Personal Background Naval Surface Warfare Center Dahlgren Division

(NSWCDD )“…premier research and development center that

serves as a specialty site for weapon system integration.” -- NSWCDD website

Platform System Safety Branch Department of Defense (DoD) acquisition

projectshttp://www.austal.com/index.cfm?objectID=6B42CC6

2-65BF-EBC1-2E3E308BACC92365

Page 3: System Safety

System Safety Terms and Concepts System – “a composite, at any level of

complexity, of personnel, procedures, materials, tools, equipment, facilities, and software… used together in the intended operational or support environment to perform a given task or achieve a specific purpose.” – MIL-STD 882C

Safety – “freedom from those conditions that can cause death, injury, occupational illness, or damage to or loss of equipment or property, or damage to the environment.” – MIL-STD 882C

Page 4: System Safety

What is System Safety? “The application of engineering and

management principles, criteria, and techniques to optimize all aspects of safety within the constraints of operational effectiveness, time, and cost throughout all phases of the system life cycle.” – MIL-STD 882C

“For almost any system, product, or service, the most effective means of limiting product liability and accident risks is to implement an organized system safety function beginning in the conceptual design phase, and continuing through to its development, fabrication, testing, production, use, and ultimate disposal.” – System Safety Society website

Page 5: System Safety

System Safety Terms and Concepts Hazard – “any real or potential condition that

can cause death, injury, occupational illness; or damage to or loss of equipment or property; or damage to the environment.” – MIL-STD 882C

Mishap – “an unplanned event or series of events resulting in death, injury, occupational illness; or damage to or loss of equipment or property; or damage to the environment.” – MIL-STD 882C

Effect – “the result of a mishap (ie: death, injury, occupational illness; or damage to or loss of equipment or property; or damage to the environment).” – MIL-STD 882C

Page 6: System Safety
Page 7: System Safety
Page 8: System Safety
Page 9: System Safety

Mishap SeverityDescriptio

nCategor

yEnvironmental, Safety, and Health Result

CriteriaCatastrophi

c1 Could result in: death, permanent total disability;

system loss, loss exceeding $1M; or irreversible severe environmental damage that violates law or

regulationCritical 2 Could result in: permanent partial disability,

injuries, or occupational illness that may result in hospitalization of at least three personnel; major

system damage, loss exceeding $200K but less than $1M; or reversible environmental damage causing a

violation of law or regulationMarginal 3 Could result in: injury or occupational illness

resulting in one or more lost workdays; minor system damage, loss exceeding $10K but less than $200K; or mitigable environmental damage without

violation of law or regulation where restoration activities can be accomplished

Negligible 4 Could result in: injury or illness not resulting in a lost work day; insignificant system damage, loss exceeding $2K but less than $10K; or minimal

environmental damage not violating law or regulation

Page 10: System Safety

Mishap ProbabilityLeve

lDescriptio

nItem Criteria Fleet Criteria

A Frequent Likely to occur often in the life of an item, with a probability of

occurrence greater than 10-1

Continuously experienced

B Probable Will occur several times in the life of an item, with a probability of

occurrence less than 10-1 but greater than 10-2 in that life

Will occur frequently

C Occasional Likely to occur some time in the life of an item, with a probability of

occurrence less than 10-2 but greater than 10-3 in that life

Will occur several times

D Remote Unlikely but possible to occur in the life of an item, with a probability of

occurrence less than 10-3 but greater than 10-6 in that life

Unlikely but can be reasonably be

expected to occur

E Improbable So unlikely, it can be assumed occurrence may not be experienced, with a probability of occurrence less

than 10-6

Unlikely to occur but possible

Page 11: System Safety

Mishap Risk Index (MRI)

Mishap Probability:

Mishap Severity Categories:1:

Catastrophic2: Critical 3:

Marginal4:

NegligibleA: Frequent HIGH HIGH SERIOUS MEDIUMB: Probable HIGH HIGH SERIOUS MEDIUM

C: Occasional

HIGH SERIOUS MEDIUM LOW

D: Remote SERIOUS MEDIUM MEDIUM LOWE:

ImprobableMEDIUM MEDIUM MEDIUM LOWRisk

LevelSuggested Criteria Acceptance Authority

HIGH Unacceptable Service Acquisition Executive

SERIOUS Undesirable Program Executive Officer

MEDIUM Acceptable with review

Program Manager

LOW Acceptable without review

Program Manager

Page 12: System Safety

How do we get these values? Severity values are obtained by

brainstorming “worst credible” mishaps in each of three categories:1. Personnel injury/death2. Damage to system equipment3. Environmental damage

Probability values are a little more technical…

Page 13: System Safety

Probability Terms and Concepts

“The probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions.” - COMAP

“The sample space S of a random phenomenon is the set of all possible outcomes” - COMAP

“An event is any outcome or set of outcomes of a random phenomenon… An event is a subset of the sample space.” - COMAP

-- COMAP text, 7th edition, pages 289-299

Page 14: System Safety

Probability Rules1. 0 ≤ P(A) ≤ 12. P(S) = 13. P(Ac) = 1 – P(A)4. P(A or B) = P(A) + P(B) – P(A and

B) 5. P(A and B) = P(A) x P(B)

Page 15: System Safety

Methods of Obtaining Probability Values Fault Tree Analysis Historical Mishap Data Given Information Standard Calculations

Page 16: System Safety

Fault Tree Analysis (FTA) Originally developed by Bell Telephone

Laboratories in 1962 for the U.S. Air Force.Used to analyze probabilities of inadvertent

launch of Minuteman missiles Technique was expanded and improved

upon by Boeing Company Fault Trees are now one of the most widely

used methods in system reliability and failure probability analysis

Page 17: System Safety

Fault Tree Analysis (FTA) A Fault Tree is a top-down structured graphical

representation of the interaction of failures of events within a system Basic events (hazards and their causal factors) are at the

bottom of the fault tree and are linked via logic symbols (known as gates) to a top event (mishap).

Events in a Fault Tree are continually expanded until sub-events are created for which you can assign a probability.

We can use known probability values for the basic events as well as knowledge of logic gates and boolean logic to calculate the probability of the mishap occurring.

Page 19: System Safety

AND Gate Logic

All on-site power fails iff Generator #1 fails and Generator #2 fails and Generator #3 fails

A = B and C and D P(A) = P(B) x P(C) x P(D)

Generator #1 fails

B

All on-site power failed

A

Generator #2 fails

C

Generator #3 fails

D

Page 20: System Safety

OR Gate Logic

Elevator door ‘closed’ failed iff hardware failure or human error or software failure

A = B or C or D P(A) = P(B) + P(C) + P(D) – P(B)P(C) –

P(B)P(D) - P(C)P(D) + P(B)P(C)P(D)

Hardware failure

B

Elevator door ‘closed’ failed

A

Human ErrorC

Software failureD

Page 21: System Safety

FTA Methodology Generally involves five steps:

1. Define the undesired top event (mishap)2. Obtain an understanding of the system3. Construct the fault tree○ Deductively define all potential failure

paths4. Evaluate the probability of the top event5. Analyze output and determine what is

required to mitigate the top event

Page 22: System Safety

1. Define the undesired top event (mishap)○ Fire Protection Systems Fail

2. Obtain an understanding of the system○ Primary smoke detection system with secondary heat

detection system○ AFFF (Aqueous film-forming foam) fire suppression system

3. Construct the fault tree○ Deductively define all potential failure paths

Fire Protection Systems Fail

Fire detection system fails

Fire suppression system fails

Smoke detection

system failsHeat detection

system failsBlocked nozzles

Pump fails

Page 23: System Safety

Historical Mishap Data Using probability of an event occurring in the past to predict

probability of the event occurring in the future EX) If we have a fleet of 5 ships (each with 6 freight

elevators onboard) that have been in operation for 20 years (each elevator used approx. 35 hours/year), with 3 injuries caused by elevator malfunctions:

The probability of a mishap can now be determined by dividing the number of times a mishap has occurred by the total operational hours

P(mishap) = # of mishaps / total hours = 3/21000 = 1.42857 x 10-4

This falls into the REMOTE severity category

Operational Hours per Year

(per ship)

Total Operational

Time in years

Total Operational

Hours210 100 21000

Page 24: System Safety

Given Information:Hardware Components Often, probability of failure for a system’s hardware components may be

available EX) Consider a system with an operational function that is dependent on

all four of the individual components working (ie: the system function fails if any one of the components fail):

P(system failure) = 1 – [P(component A does not fail) x P(B does not fail) x P(C does not fail) x P(B does not fail)] = 1 – [(0.9357)(0.9083)(0.925)(0.9083)] = 1 - 0.71406 = 0.28594 per 1 million operational hours

Component name

Component type

P(failure) per 1 million operational

hours

P(success)

100 A 0.0643 0.9357101 B 0.0917 0.9083102 C 0.075 0.925103 B 0.0917 0.9083

Page 25: System Safety

Given Information:Test Scenarios Operational tests can be conducted to provide an

estimate of failure for certain system components EX) We can run a series of tests on a fire

suppression system and note when the fire is extinguished.Define a success here as an event where the fire is

extinguished in less than 60 seconds from system activation.

If we conduct 10 tests, and the system fails to extinguish the fire in under a minute once, we have P(failure) = 0.1This is not incredibly accurate due to the small sample

size

Page 26: System Safety

Standard Calculations:Event Types Let qi(t) = P(Failure of unit i occurs at time t) Different types of events:

1. Non-repairable unit○ Unit i is not repaired when a failure occurs○ Failure rate of λi

○ qi(t) = 1 − e−λit ≈ λit 2. Repairable unit (repaired when failure occurs)○ Unit i is repaired when a failure occurs and is assumed

to be as good as new following a repair○ Failure rate of λi

○ Mean Time to Repair of MTTRi

○ qi(t) ≈ λit x MTTRi

Page 27: System Safety

Standard Calculations:Event Types

3. Periodically tested (hidden failures)○ Unit i is tested periodically with test interval τ○ Failure may occur at any time in the test interval, but the

failure is only detected in a test or if a demand for the unit occurs.

○ Typical for safety-critical units (ie: smoke detectors)○ Failure rate of λi

○ Test interval of τ i

○ qi(t) ≈ λi x τ i

24. On-demand probability○ Unit i is not active during normal operation, but may be

subject to one or more demands○ Often used for human (operator) error○ qi(t) = P(i fails on request)

Page 28: System Safety

Standard Calculations: Why is Human Error important? Human beings are an integral part of any system, so we cannot

accurately estimate the probability of failure without taking people into consideration

“Estimates of the probability that a person will, for example, have a moment’s forgetfulness or lapse of attention and forget to close a valve or close the wrong valve, press the wrong button, make a mistake in arithmetic, and so on… They are not estimates of the probability of error due to poor training or instructions, lack of physical or mental ability, lack of motivation, or poor management”

“… Because so much judgment is involved, it is tempting for those who wish to do so to try to ‘jiggle’ the figures to get the answers they want… Anyone who uses estimates of human reliability outside the usual ranges should be expected to justify them.” – An Engineer’s View of Human Error by Trevor Kletz

Page 29: System Safety

Standard Calculations: Human Error Probability

Human Error Probability ParametersType of Activity:• Simple, routine• Requiring attention, routine• Not routine

K1

0.0010.010.1

Temporary Stress Factor for routine activities, seconds available:• 2 • 10• 20

K2

101 0.5

Temporary Stress Factor for non-routine activities, seconds available:• 3• 30• 45• 60

K2

1010.30.1

P (Human Error) ≈ K1 x K2 x K3 x K4 x K5

Page 30: System Safety

Standard Calculations: Human Error Probability

Human Error Probability Parameters, continuedOperator Qualifications:• Carefully selected, expert, well-trained• Average knowledge and training• Little knowledge, poorly trained

K3

0.513

Activity Anxiety Factor:• Situation of grave emergency• Situation of potential emergency• Normal situation

K4

321

Activity Ergonomic Factor:• Excellent microclimate, excellent interface with plant• Good microclimate, good interface with plant• Discrete microclimate, discrete interface with plant• Discrete microclimate, poor interface with plant• Worst microclimate, poor interface with plant

K5

0.113710

Page 31: System Safety

Standard Calculations: Human Error Probability Consider one scenario:

Type of activity: Requiring attention, routine K1 = 0.01 Stress factor: More than 20 seconds available K2 = 0.5 Operational qualities: Average knowledge and training K3 =

1 Activity anxiety factor: Potential emergency K4 = 2 Activity ergonomic factor: Good microclimate, good interface

with plant K5 = 1

P (Human Error) ≈ K1 x K2 x K3 x K4 x K5 = 0.01 x 0.5 x 1 x 2 x 1 = 0.01

In this situation, a person will fail 1% of the time

This falls into the PROBABLE category

Page 32: System Safety

Back to a Fault Tree Example…

Page 33: System Safety

Alarm clock does not wake

you up

Alarm clock failure

You don’t hear it

Main (plug-in) clock failure

Backup (wind-up) clock failure

Faulty clock

Power outag

e

Forgot to set (or set incorrectl

y)

Electrical Fault

Faulty clock

Forget to set (or set incorrectl

y)

Forget to

wind

Mechanical Fault

Page 34: System Safety

Alarm clock does not wake

you up

Alarm clock failure

You don’t hear it

negligible

Main (plug-in) clock failure

Backup (wind-up) clock failure

Faulty clock

Power outage

P = 0.012

Forgot to set (or set

incorrectly) P = 0.008

Electrical Fault

P = 0.0003

Faulty clockP =

0.0004

Forget to set (or set

incorrectly)P = 0.008

Forget to wind

P = 0.012

Mechanical Fault

P = 0.0004

Page 35: System Safety

Probability that the Backup (wind-up) clock fails?

P (backup clock failure) = P (faulty clock) + P (forget to wind) + P (forget to set)

P (backup clock failure) = 0.0004 + 0.012 + 0.008

P (backup clock failure) = 0.0204

Backup (wind-up) clock failure

Faulty ClockP =

0.0004

Forget to set (or set

incorrectly)P = 0.008

Forget to wind

P = 0.012

Page 36: System Safety

Probability that the Main (plug-in) clock fails?

P (main clock failure) = P (power outage) + P (faulty clock) + P (forget to set)

P(main clock failure) = 0.012 + (0.0003 +0.0004) + 0.008

P(main clock failure) = 0.012 + 0.0007 +0.008

P(main clock failure) = 0.0207

Main (plug-in) clock failure

Faulty clock

Power outage

P = 0.012

Forgot to set (or set

incorrectly) P = 0.008

Electrical Fault

P = 0.0003

Mechanical Fault

P = 0.0004

Page 37: System Safety

Probability that the Alarm Clock Does Not Wake You Up?

P (Alarm Clock Failure) = P (Main Clock Failure) + P (Backup Clock Failure) = 0.0207 x 0.0204 = 0.0oo422

P (Alarm Clock Does Not Wake You Up) = P (Alarm Clock Failure) + P (You Don’t Hear It)

P (Alarm Clock Does Not Wake You Up) = 0.000422 = 4.22 x 10-4

This falls into the REMOTE category

Alarm clock does not wake

you up

Alarm clock failure

You don’t hear it

negligible

Main (plug-in) clock failureP = 0.0207

Backup (wind-up) clock failureP = 0.0204

Page 38: System Safety

Conclusions System Safety is a risk management strategy based

on identifying, analyzing, and eliminating or mitigating hazards using a systems-based approach.

Hazards are evaluated and analyzed based on the severity and probability values for their corresponding mishap.

Probability values can be obtained by using basic probability rules and boolean logic in addition to historical data, published failure values, an understanding of potential failure paths in a system, and some simple calculations.

The allows us to quantitatively analyze risk levels and make an informed recommendation /decision.

Page 39: System Safety

Sources MIL_STD 882C Introduction to System Safety: Tutorial for the 19th

International System Safety Conference by Dick Church An Engineer’s View of Human Error by Trevor Kletz For all Practical Purposes: Mathematical Literacy in

Today’s World, 7th edition http://www.navsea.navy.mil/nswc/dahlgren/default.aspx http://www.system-safety.org/about/ http://www.weibull.com/basics/fault-tree/index.htm http://www.fault-tree.net/papers/andrews-fta-tutor.pdf http://www.fault-tree-analysis-software.com/fault-tree-an

alysis-basics.html

http://www.ntnu.no/ross/srt/slides/fta.pdf