root cause analysis workshop - rbcs, inc · 2019-03-25 · • root cause analysis (rca) for...
TRANSCRIPT
2019, E.F. Weller Slide 1 Ver 1.0
Root Cause Analysis (RCA) The Down Part of “Shift Left and Down”
Presented by
Ed Weller Integrated Productivity Solutions, LLC
2019, E.F. Weller Slide 2
Introductions
• Who am I? • Who are you?
– RCA can be used by people in these roles (and others)
• Project managers • Developers • Testers • Process • Business analysts
Agenda
• What this short webinar should provide to attendees • Definition of “Root Cause” • Why are defects prevalent in software • What does “Shift Left then Down” mean • Techniques to achieve “then Down” • Tips on starting to use RCA
2019, E.F. Weller Slide 3
Root Cause Definition (1)
• “Bug” is overused and inaccurate – Error - a human action that produces an incorrect result – Defect – a flaw in the component or system that may cause a
failure – Failure – deviation of the component from its expected result
• Failures occur in test or use • Defects are found by reviews, inspections, and
static analyzers • Errors are the mistakes we make
2019, E.F. Weller Slide 4
Above definitions taken from the ASTQB Glossary of Terms
Root Cause Definition (2)
• We use these definitions in software development – The cause of the failure is identified by the developer
or tester when debugging the failure in test or use – The root cause is the underlying error or mistake
made by the developer, tester, business analyst, etc. • By eliminating the root cause, we reduce the
number of errors, hence fewer defects
2019, E.F. Weller Slide 5
Shift Left Then Down
• Shift Left means remove defects sooner – System Test before use – Unit Test before System Test – Reviews, Inspections, or pairing
• Shift Down means reducing the number of errors that lead to defects – Directly addresses the “Error” in the Error-Defect-
Failure sequence
2019, E.F. Weller Slide 6
2019, E.F. Weller Slide 7
Setting the Stage (1)
• Defects in software development are a natural consequence of the development processes
• That defects are a natural consequence does NOT mean we have to accept them – Many organizations have learned how to prevent or
detect defects to the point their products are nearly defect free and failure free
2019, E.F. Weller Slide 8
Setting the Stage (2)
• Professionals learn over time how to reduce their defects – some better than others
• Systematic approaches to defect prevention are provably better than individual or haphazard approaches – 10:1 or better reductions in defect injection rates – 100:1 or better on deliverable quality levels
• This webinar will cover the reasons for, and techniques used, for RCA
2019, E.F. Weller Slide 9
Setting the Stage (3)
• Root cause analysis (RCA) for software really IS different – Industrial accidents
• Single events • High Cost • Examples: Three Mile Island, TWA 800, chemical plants
– Software failures • Multiple failures caused by many defects
– 1-2 defects per KSLOC in million line products (1000-2000 defects per million lines of code)
– In many cases lower cost per failure (but not always) – Significant impact on the RCA process
2019, E.F. Weller Slide 10
Setting the Stage (4)
• Impact of large number of failures – Cannot (economically) do RCA on all failures – Suggests a systematic selection of failures to be
analyzed – Looking for common causes (to eliminate multiple
defects with one analysis and solution) – Velocity of change is important (time from RCA to
impact in development) • If it isn’t going to happen again, don’t analyze
the cause!
2019, E.F. Weller Slide 11
Setting the Stage (5)
• The goal of RCA can be simply stated: – By analyzing defects and failures, eliminate the
introduction of faults into software work products by understanding and eliminating the human error that introduced the fault
– A secondary goal is the earlier detection and correction of defects
2019, E.F. Weller Slide 12 Ver 1.0
RCA Process Overview
2019, E.F. Weller Slide 13
Defect and Problem Sources
• Defects from Production ,Test, Reviews • Issues raised in Retrospectives • Other problems causing inefficiency in the
organization
2019, E.F. Weller Slide 14
Barriers to Defect Prevention
• Need defect data – Production, test, and development problems – Defect description / fix information or problem
description sufficient to allow root cause analysis • Need open teams
– Participation of person who made the error and peers – “What happens in Vegas stays in Vegas”
• Defect data is used to appraise personnel – Making mistakes happens, failing to learn from
mistakes is another story
2019, E.F. Weller Slide 15
Prerequisites for Success of RCA
• Infrastructure to enable the following – Project management: provides a basis for planning
the resources needed for successful RCA and completion of action plans
– Engineering/Development: analysis of defect data found in production, test, and development to eliminate errors that cause defects
– Engineering/Test: analysis of defect data from production and test to improve tests and detect problems before entering production
2019, E.F. Weller Slide 16
Are There Alternatives to RCA?
• How can defects be prevented without analysis and involvement of teams with domain knowledge? (i.e., by a third party)
• How much defect prevention will occur without a plan?
• Answer: Not really and not much!
2019, E.F. Weller Slide 17
What Will RCA Cost?
• Non-recurring costs include: – Training – Plans – Procedures
• Recurring costs include: – Causal analysis meetings – Analyzing the defect data and proposing actions – Revising checklists and processes – Providing feedback to the organization – Measuring the RCA activities
2019, E.F. Weller Slide 18
Where’s the Payback?
• Defect types are prevented – One analysis session may eliminate a class of defects – One cause analysis may lead to finding similar
problems in production releases before failures occur • Processes are revised based on data and
analysis • Returns from RCA
– 13:1 in a report at a Software Engineering Institute conference
– Mays-Jones reference - last slide (50% reduction in defects injected)
2019, E.F. Weller Slide 19 Ver 1.0
Defect “Maturity”
2019, E.F. Weller Slide 20
Defect Causes
• Understanding the causes of defects – Communication – Short Term Memory – Cognitive Dissonance – Complexity of Task – Processes
2019, E.F. Weller
Communication Noise
3 People 3 Paths
10 People 45 paths
5 People 10 Paths
Is this likely to be lossless communications?
From: Software Inspections: Key Elements for Success STARWest 2006 by E. F. Weller, Software Technology Transition
Slide 21
2019, E.F. Weller Slide 22
Short Term Memory
• 7 +/- 2 (or 5 +/- 2)
• Interruptions • Phone, Boss, Co-worker
• Multi-tasking
• Context switch and reload on each interrupt
2019, E.F. Weller Slide 23
Cognitive Dissonance We tend to see what we “know”, not what is really there
What we think we see
What’s really there
D.R. Graham, “Test is a Four Letter Word” 1st International Software Testing Analysis & Review Conf
2019, E.F. Weller Slide 24
* Robert Glass, “Persistent Software Errors: 10 Years Later” 1st International Software Testing Analysis & Review Conf
Complexity of Task
• The most persistent defects are related to inherent complexity of the product to be developed*
• Design complexity exceeds our level of understanding
• Code developed does not match complexity of the design (what does this say about code complexity measures?)
2019, E.F. Weller Slide 25
Processes
• Process, or lack thereof, may cause up to 85% of the defects in a product
• This means that personnel capability is NOT one of the major contributors to defective work products
• Unless these 5 defect causes are not understood and believed by the organization, RCA will be limited to non-threatening analysis and preventive actions that cannot be realized
2019, E.F. Weller Slide 26
Using Defect Data
• Defect causes and root cause taxonomy indicate people are (usually) not the cause
• If defect data is seen to be tied to individual performance appraisals or a merit system, defect prevention (as well as the inspection process, and test data) will be compromised
• Do you have a policy treating the use of defect data?
2019, E.F. Weller Slide 27 Ver 1.0
Varieties of RCA
2019, E.F. Weller Slide 27
2019, E.F. Weller Slide 28
Many Varieties
• Most common are – “Light bulb” – informal, “It is obvious (maybe)” – 5-Why – Apollo – Ishikawa (Fishbone) analysis
• With any of these, the trick is knowing when to stop, or when you really are not finished – Stop too soon and you do not have an actionable root
cause – Go too far and you get caught in endless discussion
• searching for “the perfect answer” • Causes that you do not/can not control
2019, E.F. Weller Slide 29
What Is the Expectation?
• While earlier detection is a good thing, prevention is the goal – “System test should have found it” – “Requirements were incomplete” – Are not root causes
• But they may be out of your control and require passing to a different team – Know your limitations – Group these into batches as the root cause may be
easier to identify with multiple defects
2019, E.F. Weller Slide 30
Why Look at Three Methods
• Understanding multiple techniques broadens knowledge
• Fishbone method may presume sets of causes and limit analysis
• 5-Why and Apollo (see Gano reference), while different, may enable a better understanding of how to think about root causes – Similar, so we will look at Apollo first – After Apollo/5-Why, Fishbone may be appropriate to
group similar causes
2019, E.F. Weller Slide 31
Apollo Method
• Built on principle there may be multiple causes for a defect; the solution we chose ultimately identifies “the” cause
• Method avoids presupposing a cause, or converging on a cause too soon
• For example – fire/explosion caused by – Ignition source – Combustibles – Oxygen
• Which is the cause? Absent any one, you cannot have the fire/explosion – ref TWA 800
2019, E.F. Weller Slide 32
“Caused By”
• Key phrase used to connect elements in a cause effect diagram; or more properly stated, an effect-cause diagram
Primary Effect
Caused By Branch is read
as “and’
Condition Cause
Evidence
Action Cause
Evidence
Condition Cause
Evidence
Action Cause
Evidence
Caused By
One or more of the 4 branches could be the prevention solution
2019, E.F. Weller Slide 33
5-Why
• Less structured than Apollo method – Action, Condition, and Evidence are not usually part
of the process – The number “5” can lead to going too far, or not far
enough – 5 becomes the goal • For more straightforward problems this may
work, particularly if you find the additional rigor of the Apollo method is not returning value
2019, E.F. Weller Slide 34
Ishikawa Method (1)
• More “restrictive” if you start with anticipated categories in the cause –effect analysis
• Useful if RCA teams need more structure, and history indicates causes set is repeatable and limited
From: www.isixsigma.com
2019, E.F. Weller Slide 35
Ishikawa Method (2)
• Presupposing causes may limit the team • Fishbone diagram can be unwieldy • More useful as a sorting and grouping technique
to refine outputs from Apollo or 5-Why – Takes effort – Limit to the significant few problems?
2019, E.F. Weller Slide 36 Ver 1.0
Exercise
One of two used in the full day workshop, as well as real problems from the attendees
2019, E.F. Weller Slide 37
It Was a Cold, Dark, and Stormy Night
• Identify the root cause(s) that would eliminate the primary effect (woodshed burning down) in the following scenario
2019, E.F. Weller Slide 38
The Story
It was a dark, cold and stormy night as Seth staggered home from the local tavern. As Seth entered his house, he noticed the power was down and the house was frigid. When he went to the ready wood-pile on his porch, it was empty, meaning a trip in the dark to the woodshed. Grabbing a flashlight, Seth noticed the batteries were dead, and there were no spares. He lit his lantern, and staggered to the woodshed to get firewood. Upon entering the woodshed, he tripped over a log left in front of the door, losing his balance and dropping the lantern. The lantern broke, spilling burning oil on the floor. The burning oil ignited the sawdust and woodchips on the floor, causing the woodshed to burn down. Seth fortunately escaped without injury.
2019, E.F. Weller Slide 39 Ver 1.0
Solutions Phase
2019, E.F. Weller Slide 40
Walk the Chart Backwards
• Start from the right, ask – Why is the Cause there? – What can we do to prevent the primary cause at the
lowest cost?
Primary Effect
Caused By Branch is read
as “and’
Conditional Cause
Evidence
Action Cause
Evidence
Start Here
2019, E.F. Weller Slide 41
Solution Evaluation Criteria
• Prevent recurrence (or earlier detection if prevention is deemed unlikely)
• Under our control • Meet goals and objectives
– Does not cause unacceptable problems – Applies to other problem causes – Reasonable value for cost
2019, E.F. Weller Slide 42
Solutions to Avoid
• Punish, reprimand, fire • Any output that says “investigate, review,
analyze” – This means you do not have an actionable solution
• Warning signs or slogans • A new process if the old process wasn’t followed • Fault transference (someone else/another group
should have caught it) as an out
2019, E.F. Weller Slide 43
Phrases to Avoid
• Too hard to find • Not the way we do things • We tried that before • No budget • Management won’t like it • Good Idea, but… • It will cost too much (depends on the return;
facts welcome, keep opinions to yourself) • That failed before (need to know why it failed)
2019, E.F. Weller Slide 44
Design
Time Lag for RCA of Production Failures
Production – Release N
Failure
RCA Action
Release N+1- Production
Release N+2 - Production
Production – Release N+3….
No effect Test
Test Code
Test Code Design
No effect
Production – Release N+3…. Test Code
Production – Release N+4…. Test Code Design Req
Minimal Effect
Most effect
2019, E.F. Weller Slide 45 Ver 1.0
Root Cause Analysis Myths
Remaining slides will not be covered in the webinar – they are fully addressed in onsite
workshops
Contact Ed Weller at [email protected] or [email protected] or
https://rbcs-us.com/contact/
2019, E.F. Weller Slide 46
Graybeards
• Assumption that a tiger team or gurus can identify root causes – Unless they made the initial mistake, they do not
understand the causality that only the developer or tester knows
– They are not working in the environment that is responsible for the defect
• Successful RCA must be done by the person or team that introduced the mistake
• Time is critical as the mistakes are the result of an intellectual process; memory fades with time
2019, E.F. Weller Slide 47
RCA Is for High Maturity Only
• Previously addressed, but these enabling factors must exist – RCA can be applied to any well defined activity
• Does not require full life cycle to be well defined • But undefined processes will lead to sub-optimal solutions,
reaction to noise rather than real defects or problems – Sufficient planning discipline to ensure corrective or
preventive action occurs – Sufficient discipline to search for multiple causes and
select the optimum solution • Sometimes early detection is better (lower cost) than
prevention
2019, E.F. Weller Slide 48 Ver 1.0
Starting and Maintaining a Root Cause Analysis Program
2019, E.F. Weller Slide 49
Set Your Goals
• Productivity
• Quality
• Critical Issues
2019, E.F. Weller Slide 50
Plan - Determine What You Can Do
• Match the defect prevention activities to the organizations capability – Process based:
• Lessons Learned/Retrospectives • Process improvements • Checklists, etc
– Problem based: • Critical problems (is spite of earlier comments)
– Buy-in easier – The big win
• Problem Classes
• Systematic, data analysis based
2019, E.F. Weller Slide 51
What Is Your Budget?
• How much effort will be spent on corrective action?
• Results of Lessons Learned and other RCA activities must be scaled to the resources available for corrective action
2019, E.F. Weller Slide 52
Do
• Consider the role of “Champion”
• Conduct the lessons learned and causal analysis meetings – Moderator training, education, and selection – Attendees (management generally excluded*)
• Track the action items to closure
*Unless the problem analysis is for a management process
2019, E.F. Weller Slide 53
Warning Signs (1)
• Participants say “Why are we meeting, we never do anything with the meeting results?”
• Lessons Learned or Causal Analysis meetings held without a budget reserve for corrective or preventive action
• Signals the RCA team isn’t digging deep enough – “That was too hard to prevent!” – “The system is too complex to understand” – Root Causes are not actionable
2019, E.F. Weller Slide 54
Warning Signs (2)
• The same problems show up in RCAs after the preventive actions should have become effective – “The first lesson of Lessons Learned is that we did
not learn from the last Lessons Learned” – This also applies to RCA
• Problems corrected in one part of the organization occur elsewhere
• Repeating the same mistakes will introduce cynicism and a “Why am I here?” attitude
2019, E.F. Weller Slide 55
Act
• If the metrics indicate success, expand the change to the organization
• If the metrics indicate failure, probe for the cause of the failure – Bad initial analysis – Incorrect implementation – Other changes (product or process) – But do not give up!
• RCA works - it might not be apparent in one cycle
2019, E.F. Weller Slide 56
References
• Card, D., “Understanding Causal Systems” http://stsc.hill.af.mil/crosstalk/2004/10/0410Card.html
• Chillarege, Bhandare, Halliday,A Case Study of Software Process Improvement During Development, IEEE Transactions on SW Engineering, Vol 19. No 2, Dec 1993
• Gano, Dean L., Apollo Root Cause Analysis, Apollonian Publications, 2003
• Humphrey, Managing the Software Process, Addison-Wesley, 1989 • Jones, A process-integrated approach to defect prevention, IBM
Systems Journal, Vol 24, No 2, 1985 • Mays, Jones, Holloway, and Studinski, Experiences with Defect
Prevention, IBM Systems Journal, Vol 29, No 1, 1990 – http://domino.research.ibm.com/tchjr/journalindex.nsf/Home?Op
enForm and search for “Defect Prevention”