root cause failure

14
February 2008 2008 IPEIA Conference, Banff, Canada 1 An Overview of a Root Cause Failure Analysis (RCFA) Process Roger Zavagnin EnCana Corporation This presentation on the fundamentals of a Root Cause Failure Analysis (RCFA) process briefly describes What RCFA is, and why it is done, Types of root causes, 7 generic steps in an RCFA investigation, Challenges in setting up an RCFA process, Setting up and sustaining an RCFA process.

Upload: rajanand

Post on 11-Dec-2015

104 views

Category:

Documents


3 download

DESCRIPTION

Failure Cause Analysis

TRANSCRIPT

Page 1: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 1

An Overview of a Root Cause Failure Analysis (RCFA) Process

Roger ZavagninEnCana Corporation

• This presentation on the fundamentals of a Root Cause Failure Analysis (RCFA) process briefly describes

• What RCFA is, and why it is done,• Types of root causes,• 7 generic steps in an RCFA investigation,• Challenges in setting up an RCFA process,• Setting up and sustaining an RCFA process.

Page 2: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 2

What and Why RCFA?

A class of problem-solving methods to eliminate recurrence of failure, ormanage the consequences of failure

Reactive method

Failure classificationsChronicSporadic

• Root Cause Failure Analysis is a class of problem solving methods using a step-by-step method to discover the basic causes of failure. • Many commercial solutions exist with associated training, consulting, and software costs, but all of these methods share the same fundamentals within this presentation.• The fundamental purposes why RCFA is implemented are

• to eliminate the recurrence of a failure, or• to manage the consequences of a failure should they occur again.

• In many cases, it is not possible to completely eliminate the probability of a failure, and that is why we consider failure management policies to manage the consequences to a tolerable risk level.• Note that RCFA requires a failure to occur first before investigating and analyzing. Thus, it is a reactive means to developing failure management policies. If consequences of failure are intolerable, then a proactive method is required.• These failures typically are classified as sporadic or chronic.

• Sporadic failure events often are one-time events that usually gain significant attention because they usually involve significant, unexpected, and severe consequences.• Chronic events, unfortunately, are those that are accepted but may have significant cumulative losses over a long period. Most failure events that occur more than once should be considered chronic.

Page 3: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 3

What are “Root Causes”?

PhysicalPhysical component or material that failed

HumanHuman actions or decisions leading to failure event

LatentReasons why actions or decisions were made

• What are Root Causes? The root cause is the most basic reason for the problem occurring that, if eliminated, will eliminate the failure event from recurring. We define three types of root causes: physical, human, and latent. Often, a failure results from one or more root causes.• Physical root causes are the physical component that caused the failure event. These are almost always present and are typically the overall physical reason the event occurred. Traditional “Failure Analysis” is key to determining the physical root cause. Unfortunately, if we only rely upon it, we will stop too early and implement a physical redesign because all we know is what physically failed. This is a common error in performing RCFA.• Human root causes are the last human actions that led to the failure event. These usually are, but not always, present. Too often, organizations seek out the individual that did a wrong action and stop there. This is counterproductive and will make people unsupportive of RCFA because it becomes a “witch-hunt.”• Latent root causes are the reasons why decisions were made that resulted in the error. There will usually be more than one latent root, and typically, if these did not exist, then the human root likely would have been avoided. Examples are organizational systems and processes that made the human think a certain way and make the improper action. Eliminating latent root causes will eliminate the failure event, and should be the focus of the investigation.• Generally, analysis teams are hesitant to address the latent root causes because these are weaknesses in the existing organization. It is important that they are supported in this approach, and the method is fully understood among all individuals.

Page 4: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 4

7 Steps of RCFA?

Scoping

Preserving evidence and collecting data

ORganizing

Analyzing

Documenting

Implementing

Confirming

• These seven steps describe the method used in almost any RCFA process. Each is described in following slides.• Note the steps can be remembered by the acronym “SPORADIC” – a classification of failure discussed earlier.• Not all of these steps are performed by the same individuals. It is important that RCFA is viewed as a problem-solving method that spans the organization and beyond!

Page 5: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 5

Scoping

Consequences and risk

Who needs to be involved?Internal – non-technical functions?External – jurisdictional bodies?

Formal vs. informal methods

Local vs. external “principal analysts”

• Our method starts with a scoping of the failure. Scoping first evaluates the consequences of the failure and its risk. Evaluating the risk identifies the consequences for what could have happened if the failure recurs, and the associated frequency or probability of recurrence. Doing so allows us to understand the reasonable, worst-case consequences and either eliminate or manage them.• Scoping also considers the nature of the failure event and directs the efforts to include the internal and external functions as required. Internal functions, such as Environment / Health / Safety departments, Insurance, Communications, etc., or external functions such as jurisdictional bodies, may be required to participate in the investigation. In some cases, clearance to proceed with an investigation is necessary and good policies defining scoping ensures appropriate steps are taken.• The last purpose of scoping determines the level of formality. Small, relatively simple failure events often follow a straightforward investigative method using local facilitators or analysts. • It is not uncommon that complex failures, or those involving multiple internal and external parties, use an external and experienced facilitator to provide an unbiased approach. In some cases, these principal analysts come from an external source such as a jurisdictional body.

Page 6: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 6

Preserving Evidence and Collecting Data

Most important step!

Basic skills using best practices

Typical tasksCoordinating activities among all partiesInterviewing and taking notesPhotographingHandling parts Collecting logs, databooks, alarm data, etc.

• Preserving evidence and collecting data is the most important step in RCFA. Too often we work in environments focused on repairing and returning equipment to service as soon as possible. Within minutes, key evidence is lost or altered.• Without effective evidence preservation and data collection, an RCFA becomes lengthy, drawn out, identify the wrong root causes, lead to wasted resources, and allow failures to recur.• Developing basic skills using best practices are viewed as a suitable responsibility for nearly all field operations, mechanical staff, and contractors. These individuals typically

• have the most experience with the equipment, • are present when the failure occurs, or are first-responders, and• participate in or coordinate the repairs and/or clean-up.

• Common tasks during this step include • Coordinating activities to preserve evidence and collect data - at the site and off-site such as repair shops, labs, etc.• Interviewing parties, taking notes, and witness statements• Photographing the overall site, the unaltered scene, damage, and all stages of disassembly,• Handling parts, including disassembly methods, cutting or torching to avoid altering the evidence, and preservation / packaging• Collecting logs, databooks, alarm data, drawings, manuals, etc.

• Because so many stages are involved with the disassembly, repair, and lab analysis, it is reasonable to expect this step to span weeks.

Page 7: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 7

ORganizing the Analysis

Analysis teamFacilitator (Principal Analyst)Participants

Reviewer(s)

Implementers

• Organizing the analysis team usually occurs during preserving evidence but sometimes is completed afterwards when the failure is better understood.• Typically, the analysis team consists of a Facilitator or “Principal Analyst” responsible for managing the analysis team through the analysis stage and documenting the findings / recommendations. Typically, it is undesirable to have the facilitator who is a technical expert since bias may be introduced. Ideally, the Facilitator thoroughly understands the RCFA method, plans the project, and manages the dynamics among the participants’ during the sessions.• Participants are individuals with expertise in the equipment (its manufacture, fabrication, application, operation, servicing, and maintenance). The RCFA flows better and much more efficiently when Participants have basic training in the RCFA process. Participants are not expected to be Facilitators.• Reviewers often are not on the analysis team, but later validate the technical conclusions leading to the root causes and the technical feasibility of the recommendations.• Often the implementation of tasks is the responsibility of individuals other than the analysis team or reviewers. Communication is essential!

Page 8: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 8

Analyzing – Sequence of Events

Good for simple, linear events with few root causes

Evidence-based

Identify means to break the chain of events

20% LEL alarm trips

Sept 3/06 09:45

Eroded wall atelbow found High gas

concentration present

Gas leak at elbow

Site ESDs closeSept 3/06

09:45

Elbow erodes(between June 1/06through Sept 3/06

Entrained sandentering separator

(during regular operation)

4” sand found inseparator

Sand building up inseparator

(since cleaning onApr 3/04)

Level controller fouled by sand

(around June 1/06)

Level controller jammed by sand

Liquid levels risingsince April 30/06

Sand carried in gas / liquid stream

from separator(after

June 1/06)

Site logs do not have regular entries and inconsistent levels recorded

• The next step is the analysis. One method is the Sequence of Events method. • Sequence of events analysis is very useful for

• straightforward problems that have a known sequence of events leading to the failure event, • complex problems where combinations of root causes exist and the approach is to determine which cause(s) must be eliminated to break the chain,• establishing timelines and identifying which events require some other analysis tool such as a logic tree.

• It requires an understanding of what is controllable, and the resulting outcome of the control, action, or response• Approach:

1. Map the sequence of events that lead to the failure2. For each event, determine if it is controllable, and if so, what alternatives exist to change what happens,3. Compare the alternatives and identify which can be implemented to break the chain of events.4. Create recommendations for physical, human, and latent roots contributing to the sequence of events.

Page 9: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 9

Analyzing – Logic Tree

Good for complex / ambiguous problems with many root causes

Hypothesis developmentSimple questionsEvidence-based confirmationAccommodates “confidence”

• Another analysis method uses a “Logic Tree.” It is very well-suited for complex or ambiguous problems with many root causes.• The analysis is managed by constructing a logic tree using structured and simple questions. These questions are used to

• first define a failure event at the top of the tree, • identify possible hypotheses for the preceding cause of the failure, • test the hypotheses using evidence and data collected earlier.

• This hypothesis/verification continues until the trail can be traced back to a latent root for which a suitable failure management policy can be defined. The next level of hypotheses must clearly flow from its predecessor (the one before it). If it is clear that a step is missing between causes it is added in and evidence sought to support its presence.• Once the fault tree is completed and checked for logical flow, the team then determines recommendations to prevent the sequence of causes and effects from recurring.• This method also accommodates a confidence rating based on the accuracy or quality of collected evidence.

Page 10: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 10

Analyzing – Logic Tree

Event

Mode 4Mode 3Mode 2Mode 1

Hyp. 1 Hyp. 2

Hyp. 3 Hyp. 4

P

HHyp. 5

L Hyp. 6

What is the abnormal state of failure?How has this event occurred in the past?

What evidence do we have at hand describing what caused the failure?

How could the preceding event have occurred?

What was the action or decision that allowed this physical root to occur?

Factual Causes

Hypothesis and Verification

How do business practices and systems contribute to this thinking?

• This slide demonstrates the simple questions used throughout the logic tree. We will not discuss the questions in detail considering the time constraints around this presentation.• In short, the failure event is a straightforward description of the loss of function, not of the failure itself.• It is followed by asking “How has this event occurred in the past?” (for chronic failures), or “What evidence do we have at hand describing what caused the failure?” (for sporadic failures). In both cases, only the facts are listed without any guesses on the causes.• Hypotheses for physical roots follow using the question “How could the preceding event have occurred?” so an educational guess can be made. Evidence is used to prove or disprove the hypothesis. If a hypothesis is disproved, or has a low confidence associated with it, then it is no longer pursued. Only the developing roots that are proven with high confidence are pursued. This prevents wasted resources chasing “red herrings”• At a point, the question for physical roots no longer makes sense. Usually this is when we transition into discovering human root causes, and the question “What was the action or decision that allowed this physical root to occur?” As stated earlier, the analysis does not stop here. This question only allows us to understand the human root cause.• Once the human root cause is identified, it becomes apparent that the more suitable question is “How do business practices and systems contribute to this thinking?” Both internal and external business practices and systems are within the scope of this question. Simply put, include your manufacturers, suppliers, vendors, engineers, packagers, distributors, shippers, constructors, commissioners, operators, and maintainers. • Typical latent root causes include training, skills verification, operating procedures, standards of workmanship, time pressures, methods, drawing updates, communications, role and responsibility definition, work scope definition, work conditions, management of change, holdpoints, inspections, and procedures.

Page 11: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 11

Documenting, Implementing, Confirming

3 stages of communication

Selecting recommendationsEffort vs. likelihood to prevent recurrenceNot all causes need a corrective actionHigh payback is not uncommon, but surprising!

Long time periods to confirm resultsInvestigate similar failures

New causes?Originated before the RCFA?

• Communicating the analysis involves three stages.• 1st: a summary of the failure event, the root causes, and the associated recommendations coming out of the analysis;• 2nd: which recommendations were selected during the evaluation, how they will be implemented, when, and by whom;• 3rd: whether or not the implemented recommendations were successful.

• It is important to understand that not all causes need a correction action applied to them to prevent recurrence or to adequately manage the consequences of failure. For example, an Sequence of Events requires the sequence to be broken, and often only a few recommendations with a high impact require implementation. A Logic Tree analysis could identify a number of root causes, but only a few have technically feasible recommendations or have such a high impact that the remaining risk of recurrence is tolerable.• During the selection of recommendations, it is not uncommon to find payback in the range of 30:1 or higher! Because latent roots deal with organizational systems, policies, and procedures, the effort to change and manage those is significantly less than complex physical redesigns.• Lastly, note that it may take months, years, or decades to confirm whether the implementation was successful. Too often, the organizations pursuing an RCFA program expect immediate results with a financial quarter or two. Many failure mechanisms commence thousands of hours before the failure is recognized. It is foolish to immediately conclude the implementation was unsuccessful without understanding the root cause of the failure!

Page 12: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 12

Challenges

Attitude – The failure is preventable or manageable!

Learning – History and experience has value

Capacity – Busy repairing vs. busy eliminating

Capability – Follow a simple, well-defined process with technical support

Expectations – Results may take a long time

Change – Already doing it to a degree

• Now we consider some challenges in setting up an RCFA process.• First, the culture or status quo likely accepts failures because “stuff happens.” Starting on small, chronic failures allows for quick demonstration that failures are preventable or manageable and provides quick to real results.• Engaging experienced employees through the process fosters a culture of learning from their experience, and gains ”buy-in” by recognizing the importance of their experience.• Often, being “too busy repairing” is a challenge. The question to be asked of employees, and demonstrated by their supervisors, is whether they intend to remain busy repairing, or get busy eliminating. Eliminating chronic failures tends to hit a critical mass where reduced repair time easily accommodates additional RCFA activities.• Starting with a straightforward, simple RCFA process that everyone can comprehend and identify their responsibilities is key. Ensure experienced RCFA individuals are available to train, coach, and do analyses.• Setting the right expectations is important. As stated earlier, it may take years to confirm the prevention of sporadic failures, or perhaps months for chronic failures. It is important that sponsors understand this duration. After implementation it is necessary to ensure the organization does not slip back to its bad practices.• Lastly, fear of change is common. Most technical people already do ad-hoc RCFA although not to the level of identifying latent roots (typically just physical roots, leading to expensive redesigns). Building upon existing though processes is a good start to fine-tune their skills to this more thorough analysis method.

Page 13: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 13

Setting-Up & Sustaining RCFA

Dedicated Trainer / Coach resource

Training based on roles & responsibilitiesPreserving Evidence & Data CollectionParticipant / Facilitators / Reviewers

Leadership & Active SponsorshipAssign resources / select / implement / track

Starting rightChronic vs. Sporadic problemsSensible and achievable methodLearn the method before a software tool Focused efforts in a “friendly sandbox”

• Here are some considerations to set up and sustain an RCFA process. First, dedicate at least one individual as a trainer, coach, and “doer” until a larger network of facilitators is established.• Establish competencies among your field staff by setting up training for Preserving Evidence & Data Collection. Within EnCana, we have an online e-learning module supported with a Quick Reference Card to make training accessible to nearly everyone. • Train your principal analysts (facilitators), reviewers, and participants (generally your technical specialists) in the RCFA method. Preparing them for the analysis ensures their fluency in RCFA terminology and working with a common methodology.• Ensure you have leadership and active sponsorship for the RCFAs. Success during the first few analyses is essential to demonstrate the efficiency, simplicity, and effectiveness of the method. • Focus on chronic problems since these have a faster confirmation of results. • Pick a sensible and achievable method of doing your RCFA work. Many commercialized methods exist, but you must consider scalability costs and suitability of training for all roles and responsibilities. • Learn the method before attempting to use software as a tool. Developing the thought processes is more important!• If possible, start in a “friendly sandbox” surrounded with sponsors and peers who understand there will be glitches, but will accept these as the process is tuned.

Page 14: Root Cause Failure

February 2008

2008 IPEIA Conference, Banff, Canada 14

Acknowledgements & Further Reading

“Root Cause Analysis: Improving Performance for Bottom-Line Results, 2nd ed.”

Robert Latino, Kenneth LatinoISBN 0-8493-1318-X

“Root Cause Failure Analysis”R. Keith MobleyISBN 0-7506-7158-0

• The two books above are recommended if you are seeking additional information on RCFA. The first book (Latino) presents the logic tree analysis in the PROACT methodology. The second book (Mobley) presents the sequence of events analysis tool.

• In conclusion, I encourage you to implement an RCFA process if you have not done so yet. Strive to discover the latent root causes with your organization and externally. By doing so, you will avoid pursuing many expensive physical redesigns and realize significant reductions in your environmental / safety incidents and production costs.