root cause analysis workshop - rbcs, inc...•root cause analysis (rca) for software really is...
TRANSCRIPT
© 2019, E.F. Weller Slide 1 Ver 1.0
Root Cause Analysis (RCA)
The Down Part of “Shift Left and Down”
Presented by
Ed Weller
Integrated Productivity Solutions, [email protected]
© 2019, E.F. Weller Slide 2
Introductions
• Who am I?
• Who are you?
– RCA can be used by people in these roles (and
others)
• Project managers
• Developers
• Testers
• Process
• Business analysts
Agenda
• What this short webinar should provide to attendees
• Definition of “Root Cause”
• Why are defects prevalent in software
• What does “Shift Left then Down” mean
• Techniques to achieve “then Down”
• Tips on starting to use RCA
© 2019, E.F. Weller Slide 3
Root Cause Definition (1)
• “Bug” is overused and inaccurate– Error - a human action that produces an incorrect result
– Defect – a flaw in the component or system that may cause a
failure
– Failure – deviation of the component from its expected result
• Failures occur in test or use
• Defects are found by reviews, inspections, and
static analyzers
• Errors are the mistakes we make
© 2019, E.F. Weller Slide 4
Above definitions taken from the ASTQB Glossary of Terms
Root Cause Definition (2)
• We use these definitions in software
development
– The cause of the failure is identified by the developer
or tester when debugging the failure in test or use
– The root cause is the underlying error or mistake
made by the developer, tester, business analyst, etc.
• By eliminating the root cause, we reduce the
number of errors, hence fewer defects
© 2019, E.F. Weller Slide 5
Shift Left Then Down
• Shift Left means remove defects sooner
– System Test before use
– Unit Test before System Test
– Reviews, Inspections, or pairing
• Shift Down means reducing the number of errors
that lead to defects
– Directly addresses the “Error” in the Error-Defect-
Failure sequence
© 2019, E.F. Weller Slide 6
© 2019, E.F. Weller Slide 7
Setting the Stage (1)
• Defects in software development are a natural
consequence of the development processes
• That defects are a natural consequence does
NOT mean we have to accept them
– Many organizations have learned how to prevent or
detect defects to the point their products are nearly
defect free and failure free
© 2019, E.F. Weller Slide 8
Setting the Stage (2)
• Professionals learn over time how to reduce
their defects – some better than others
• Systematic approaches to defect prevention are
provably better than individual or haphazard
approaches
– 10:1 or better reductions in defect injection rates
– 100:1 or better on deliverable quality levels
• This webinar will cover the reasons for, and
techniques used, for RCA
© 2019, E.F. Weller Slide 9
Setting the Stage (3)
• Root cause analysis (RCA) for software really ISdifferent– Industrial accidents
• Single events
• High Cost
• Examples: Three Mile Island, TWA 800, chemical plants
– Software failures• Multiple failures caused by many defects
– 1-2 defects per KSLOC in million line products (1000-2000 defects per million lines of code)
– In many cases lower cost per failure (but not always)
– Significant impact on the RCA process
© 2019, E.F. Weller Slide 10
Setting the Stage (4)
• Impact of large number of failures
– Cannot (economically) do RCA on all failures
– Suggests a systematic selection of failures to be
analyzed
– Looking for common causes (to eliminate multiple
defects with one analysis and solution)
– Velocity of change is important (time from RCA to
impact in development)
• If it isn’t going to happen again, don’t analyze
the cause!
© 2019, E.F. Weller Slide 11
Setting the Stage (5)
• The goal of RCA can be simply stated:
– By analyzing defects and failures, eliminate the
introduction of faults into software work products by
understanding and eliminating the human error that
introduced the fault
– A secondary goal is the earlier detection and
correction of defects
© 2019, E.F. Weller Slide 12 Ver 1.0
RCA Process Overview
© 2019, E.F. Weller Slide 13
Defect and Problem Sources
• Defects from Production ,Test, Reviews
• Issues raised in Retrospectives
• Other problems causing inefficiency in the
organization
© 2019, E.F. Weller Slide 14
Barriers to Defect Prevention
• Need defect data
– Production, test, and development problems
– Defect description / fix information or problem
description sufficient to allow root cause analysis
• Need open teams
– Participation of person who made the error and peers
– “What happens in Vegas stays in Vegas”
• Defect data is used to appraise personnel
– Making mistakes happens, failing to learn from
mistakes is another story
© 2019, E.F. Weller Slide 15
Prerequisites for Success of RCA
• Infrastructure to enable the following
– Project management: provides a basis for planning
the resources needed for successful RCA and
completion of action plans
– Engineering/Development: analysis of defect data
found in production, test, and development to
eliminate errors that cause defects
– Engineering/Test: analysis of defect data from
production and test to improve tests and detect
problems before entering production
© 2019, E.F. Weller Slide 16
Are There Alternatives to RCA?
• How can defects be prevented without analysis
and involvement of teams with domain
knowledge? (i.e., by a third party)
• How much defect prevention will occur without a
plan?
• Answer: Not really and not much!
© 2019, E.F. Weller Slide 17
What Will RCA Cost?
• Non-recurring costs include:
– Training
– Plans
– Procedures
• Recurring costs include:
– Causal analysis meetings
– Analyzing the defect data and proposing actions
– Revising checklists and processes
– Providing feedback to the organization
– Measuring the RCA activities
© 2019, E.F. Weller Slide 18
Where’s the Payback?
• Defect types are prevented– One analysis session may eliminate a class of defects
– One cause analysis may lead to finding similar problems in production releases before failures occur
• Processes are revised based on data and analysis
• Returns from RCA– 13:1 in a report at a Software Engineering Institute
conference
– Mays-Jones reference - last slide (50% reduction in defects injected)
© 2019, E.F. Weller Slide 19 Ver 1.0
Defect “Maturity”
© 2019, E.F. Weller Slide 20
Defect Causes
• Understanding the causes of defects
– Communication
– Short Term Memory
– Cognitive Dissonance
– Complexity of Task
– Processes
© 2019, E.F. Weller
Communication Noise
3 People
3 Paths
10 People 45 paths
5 People
10 Paths
Is this likely to be lossless communications?
From: Software Inspections: Key Elements for Success
STARWest 2006 by E. F. Weller, Software Technology TransitionSlide 21
© 2019, E.F. Weller Slide 22
Short Term Memory
• 7 +/- 2 (or 5 +/- 2)
• Interruptions
• Phone, Boss, Co-worker
• Multi-tasking
• Context switch and reload on each interrupt
© 2019, E.F. Weller Slide 23
Cognitive Dissonance
We tend to see what we “know”,
not what is really there
What we think
we see
What’s really there
D.R. Graham, “Test is a Four Letter Word”
1st International Software Testing Analysis & Review Conf
© 2019, E.F. Weller Slide 24
* Robert Glass, “Persistent Software Errors: 10 Years Later”
1st International Software Testing Analysis & Review Conf
Complexity of Task
• The most persistent defects are related to
inherent complexity of the product to be
developed*
• Design complexity exceeds our level of
understanding
• Code developed does not match complexity of
the design (what does this say about code
complexity measures?)
© 2019, E.F. Weller Slide 25
Processes
• Process, or lack thereof, may cause up to 85%
of the defects in a product
• This means that personnel capability is NOT one
of the major contributors to defective work
products
• Unless these 5 defect causes are not
understood and believed by the organization,
RCA will be limited to non-threatening
analysis and preventive actions that cannot
be realized
© 2019, E.F. Weller Slide 26
Using Defect Data
• Defect causes and root cause taxonomy indicate
people are (usually) not the cause
• If defect data is seen to be tied to individual
performance appraisals or a merit system,
defect prevention (as well as the inspection
process, and test data) will be compromised
• Do you have a policy treating the use of defect
data?
© 2019, E.F. Weller Slide 27 Ver 1.0
Varieties of RCA
© 2019, E.F. WellerSlide 27
© 2019, E.F. Weller Slide 28
Many Varieties
• Most common are– “Light bulb” – informal, “It is obvious (maybe)”
– 5-Why
– Apollo
– Ishikawa (Fishbone) analysis
• With any of these, the trick is knowing when to
stop, or when you really are not finished
– Stop too soon and you do not have an actionable root
cause
– Go too far and you get caught in endless discussion
• searching for “the perfect answer”
• Causes that you do not/can not control
© 2019, E.F. Weller Slide 29
What Is the Expectation?
• While earlier detection is a good thing,
prevention is the goal
– “System test should have found it”
– “Requirements were incomplete”
– Are not root causes
• But they may be out of your control and require
passing to a different team
– Know your limitations
– Group these into batches as the root cause may be
easier to identify with multiple defects
© 2019, E.F. Weller Slide 30
Why Look at Three Methods
• Understanding multiple techniques broadens
knowledge
• Fishbone method may presume sets of causes
and limit analysis
• 5-Why and Apollo (see Gano reference), while
different, may enable a better understanding of
how to think about root causes
– Similar, so we will look at Apollo first
– After Apollo/5-Why, Fishbone may be appropriate to
group similar causes
© 2019, E.F. Weller Slide 31
Apollo Method
• Built on principle there may be multiple causes for a defect; the solution we chose ultimately identifies “the” cause
• Method avoids presupposing a cause, or converging on a cause too soon
• For example – fire/explosion caused by– Ignition source
– Combustibles
– Oxygen
• Which is the cause? Absent any one, you cannot have the fire/explosion – ref TWA 800
© 2019, E.F. Weller Slide 32
“Caused By”
• Key phrase used to connect elements in a cause
effect diagram; or more properly stated, an
effect-cause diagram
Primary
Effect
Caused
ByBranch is read
as “and’
Condition
Cause
Evidence
Action
Cause
Evidence
Condition
Cause
Evidence
Action
Cause
Evidence
Caused
By
One or more of the 4 branches could be the prevention solution
© 2019, E.F. Weller Slide 33
5-Why
• Less structured than Apollo method
– Action, Condition, and Evidence are not usually part
of the process
– The number “5” can lead to going too far, or not far
enough – 5 becomes the goal
• For more straightforward problems this may
work, particularly if you find the additional rigor
of the Apollo method is not returning value
© 2019, E.F. Weller Slide 34
Ishikawa Method (1)
• More “restrictive” if you start with anticipated
categories in the cause –effect analysis
• Useful if RCA teams need more structure, and
history indicates causes set is repeatable and
limited From: www.isixsigma.com
© 2019, E.F. Weller Slide 35
Ishikawa Method (2)
• Presupposing causes may limit the team
• Fishbone diagram can be unwieldy
• More useful as a sorting and grouping technique
to refine outputs from Apollo or 5-Why
– Takes effort
– Limit to the significant few problems?
© 2019, E.F. Weller Slide 36 Ver 1.0
Exercise
One of two used in the full day workshop, as well
as real problems from the attendees
© 2019, E.F. Weller Slide 37
It Was a Cold, Dark, and Stormy Night
• Identify the root cause(s) that would eliminate
the primary effect (woodshed burning down) in
the following scenario
© 2019, E.F. Weller Slide 38
The Story
It was a dark, cold and stormy night as Seth staggered home from the local tavern. As Seth entered his house, he noticed the power was down and the house was frigid. When he went to the ready wood-pile on his porch, it was empty, meaning a trip in the dark to the woodshed. Grabbing a flashlight, Seth noticed the batteries were dead, and there were no spares. He lit his lantern, and staggered to the woodshed to get firewood.
Upon entering the woodshed, he tripped over a log left in front of the door, losing his balance and dropping the lantern. The lantern broke, spilling burning oil on the floor. The burning oil ignited the sawdust and woodchips on the floor, causing the woodshed to burn down. Seth fortunately escaped without injury.
© 2019, E.F. Weller Slide 39 Ver 1.0
Solutions Phase
© 2019, E.F. Weller Slide 40
Walk the Chart Backwards
• Start from the right, ask
– Why is the Cause there?
– What can we do to prevent the primary cause at the
lowest cost?
Primary
Effect
Caused
ByBranch is read
as “and’
Conditional
Cause
Evidence
Action
Cause
Evidence
Start Here
© 2019, E.F. Weller Slide 41
Solution Evaluation Criteria
• Prevent recurrence (or earlier detection if
prevention is deemed unlikely)
• Under our control
• Meet goals and objectives
– Does not cause unacceptable problems
– Applies to other problem causes
– Reasonable value for cost
© 2019, E.F. Weller Slide 42
Solutions to Avoid
• Punish, reprimand, fire
• Any output that says “investigate, review,
analyze”
– This means you do not have an actionable solution
• Warning signs or slogans
• A new process if the old process wasn’t followed
• Fault transference (someone else/another group
should have caught it) as an out
© 2019, E.F. Weller Slide 43
Phrases to Avoid
• Too hard to find
• Not the way we do things
• We tried that before
• No budget
• Management won’t like it
• Good Idea, but…
• It will cost too much (depends on the return;
facts welcome, keep opinions to yourself)
• That failed before (need to know why it failed)
© 2019, E.F. Weller Slide 44
Design
Time Lag for RCA of Production Failures
Production – Release N
Failure
RCA
Action
Release N+1- Production
Release N+2 - Production
Production – Release N+3….
No effectTest
TestCode
TestCodeDesign
No effect
Production – Release N+3….TestCode
Production – Release N+4….TestCodeDesignReq
Minimal Effect
Most effect
© 2019, E.F. Weller Slide 45
References
• Card, D., “Understanding Causal Systems”http://stsc.hill.af.mil/crosstalk/2004/10/0410Card.html
• Chillarege, Bhandare, Halliday,A Case Study of Software Process Improvement During Development, IEEE Transactions on SW Engineering, Vol 19. No 2, Dec 1993
• Gano, Dean L., Apollo Root Cause Analysis, Apollonian Publications, 2003
• Humphrey, Managing the Software Process, Addison-Wesley, 1989
• Jones, A process-integrated approach to defect prevention, IBM Systems Journal, Vol 24, No 2, 1985
• Mays, Jones, Holloway, and Studinski, Experiences with Defect Prevention, IBM Systems Journal, Vol 29, No 1, 1990
– http://domino.research.ibm.com/tchjr/journalindex.nsf/Home?OpenForm and search for “Defect Prevention”