extreme failutre analysis

12
Extreme failure analysis: never again a repeat failure  Apply root-cause failure analysis to recurring reliability problems K. Bloch, Flint Hills Resources, L.P., Rosemount, Minnesota The ultimate purpose of this article is to significantly reduce the risk for catastrophic equipment failures. Readers may believe that having been trained in root-cause failure analysis (RCFA) is enough. Why, then, is some equipment allowed to repeatedly fail? Are low-consequence repeat failures discretionary maintenance opportunities, or precursors to more serious reliability and safety problems? What really constitutes effective RCFA? Let's consider real life experiences to answer these questions. For equipment failure analysis to be effective, our beliefs (and even the most reasonable of assumptions) must align with the facts. Unfortunately, an extreme failure (an explosion, fire, wreck or crash) often complicates matters by compromising much of the information that we would normally use to determine an accident's cause. The issue with an extreme failure is that although limited physical evidence remains, its consequences are devastating. Indeed, the consequences are so severe that it is unthinkable to take action without being certain that the problem will be solved. Determining causes with scant physical evidence. Without physical evidence it can be very difficult to look at an effect and determine its cause. In contrast, predicting the effect of an observed cause is a relatively simple task. For example, consider the simple mental experiment1 shown in Fig. 1. First predict the outcome of a melting ice cube on hot concrete. Then look at the photo under it and explain how the water stain got there. Note that you would be mistaken to believe that an ice cube left behind this stain. In situations where conclusive physical evidence has been compromised, it is sometimes easier to pass failures off as acts of sabotage or conspiracy. Worse yet, events leaving behind no physical evidence are often dismissed as an "act of God," and the case is closed. Fig. 1 Melting ice cubes leave a stain on concrete, but

Upload: alexandroneis

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 1/12

Extreme failure analysis: never again a repeat failure 

Apply root-cause failure analysis to recurring reliability problems 

K. Bloch, Flint Hills Resources, L.P., Rosemount, Minnesota 

The ultimate purpose of this article is to significantly reduce the risk for catastrophic equipmentfailures. Readers may believe that having been trained in root-cause failure analysis (RCFA) isenough. Why, then, is some equipment allowed to repeatedly fail? Are low-consequence repeatfailures discretionary maintenance opportunities, or precursors to more serious reliability andsafety problems? What really constitutes effective RCFA? Let's consider real life experiences toanswer these questions.

For equipment failure analysis to be effective, our beliefs (and even the most reasonable ofassumptions) must align with the facts. Unfortunately, an extreme failure (an explosion, fire,wreck or crash) often complicates matters by compromising much of the information that wewould normally use to determine an accident's cause. The issue with an extreme failure is thatalthough limited physical evidence remains, its consequences are devastating. Indeed, theconsequences are so severe that it is unthinkable to take action without being certain that the

problem will be solved.

Determining causes with scant physical evidence. Without physical evidence it can be verydifficult to look at an effect and determine its cause. In contrast, predicting the effect of anobserved cause is a relatively simple task. For example, consider the simple mentalexperiment1 shown in Fig. 1. First predict the outcome of a melting ice cube on hot concrete.Then look at the photo under it and explain how the water stain got there. Note that you wouldbe mistaken to believe that an ice cube left behind this stain. In situations where conclusivephysical evidence has been compromised, it is sometimes easier to pass failures off as acts ofsabotage or conspiracy. Worse yet, events leaving behind no physical evidence are oftendismissed as an "act of God," and the case is closed.

Fig. 1 Melting ice cubes leave a stain on concrete, but

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 2/12

what left the other stain behind? 

In reality, the evidence you need to solve the problem is most likely available but hidden fromplain sight. Therefore, identifying a probable cause involves knowing where to find this

evidence. Admittedly, resolving who or what left the water behind in Fig. 1 is hardly a matter ofgreat consequence, but in extreme failures the implications are infinitely higher. Moreover, sincethere is usually low confidence in the physical evidence left behind by extreme failures, we mustturn our attention to their latent, or hidden causes. 

Latent cause identification. Hidden but powerful forces within our organizations allowincremental mistakes to negatively impact safety and reliability. We must identify these latentcauses to develop an action plan toward assured failure prevention. Latent cause identificationis simplified somewhat by recognizing that a specific sequence of events is shared betweenmany different extreme failures. The "extreme failure life cycle" shown in Fig. 2 represents therelationship between a failure, a repeat failure and an extreme failure. Underlying maintenanceand design defects can usually be detected as the probable cause of many controversialfailures when this pattern is kept in mind.

Fig. 2 Extreme failure life cycle showing theprocess a failure goes through tobecome an extreme failure. Notice therepeat failure's position. 

Fact-based conclusions ultimately add more value than unproductive conspiracy and sabotagetheory debates. Assigning blame instead of confronting the latent cause is a certain prescriptionfor repeating the same problem. The extreme-failure life cycle indicates that when repeatreliability events are disregarded they eventually become the catalyst for progressively moreserious and potentially highly dangerous equipment failures. 

Repeat failures tell an important story. The role that a "repeat failure" plays in the life cycle ofan extreme failure is of great interest. In a "hindsight is 20/20" world, we often wish we had

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 3/12

acted differently after suffering the painful consequences of a decision under our control. Sincerepeat failures are the likely intermediate step leading up to an extreme failure, they are alsoreliable warning signals that precede many catastrophic equipment failures. Taking control overrepeat failures to consciously prevent a catastrophic accident reinforces the precept that we arein charge of equipment reliability and not victims of their "unpredictable" behavior.

 A repeat failure is simply defined as a recurring equipment difficulty that prevents it fromachieving its anticipated life expectancy. Repeat failures exist because we have perhapsconcluded that a particular failure mechanism is more economical to manage than to correct. Ifallowed to persist, a repeat failure will eventually be perceived as a discretionary, low-risknuisance with no potential safety or environmental consequence. This defective riskassessment approach is also known as "normalization of deviance" and must be resisted.2

Repeat failures build a reactive work order history in our maintenance management systems.More often than not, the entries abound with useless information such as "bearing replaced"when the entry "bearing failed due to oil starvation resulting from use of pressure-unbalancedconstant-level lubricator" would have added real value. Regardless, repeat failure work orderstend to get buried under higher-priority items that represent a more immediate productionconstraint. Repeat failures are often addressed only as time allows and without asking why the

failure occurred. Knowing why the failure occurred may require a failure analysis—andperforming a failure analysis on something viewed as a low-consequence risk takes time awayfrom addressing immediate production constraints that show up on the daily maintenance plan.In truth, this highly reactive "reliability strategy" is the trademark of a repair-focusedorganization. While they might claim to be reliability-focused, such organizations exhibit few, ifany, of the requisite traits or do so in name only.

Extreme failures. While we are obviously not condoning repeat failures, extreme failures are

much more offensive. Extreme failures are "extreme" in every sense of the word and aredifferentiated as:

•  Being of, or having the potential for, the most extreme consequences

•  Leaving behind extremely little physical evidence to readily expose a probable cause•  Statistically, extremely improbable.

 Also, because precursor repeat failures leave their tracks in the maintenance managementsystem, extreme failures, in retrospect, always appear to be very predictable. Therefore, themaintenance management system contains not only evidence critical for investigating anextreme failure, but also reproof for not taking preventive action. The following examplesillustrate the relationship between repeat and extreme failures.

The Hindenburg disaster: an extreme failure. The Hindenburg disaster is one of the mostidentifiable extreme failures in the history of modern machines. The circumstances behind thisfailure still stir considerable controversy and debate, led by various conspiracy and sabotageassertions that accompany most extreme failures. The purpose of examining it here is todemonstrate how the pattern shown in Fig. 2 applies to all extreme failures no matter wherethey occur. Only by associating the extreme failure with its adjunct repeat failure can wedetermine a fact-based credible scenario that moves us away from accepting theories fueled byspeculation.

The Hindenburg airship was built with a lightweight metal airframe held rigid by a network of0.125-in.-diameter steel bracing wires under tension. Its outer covering consisted of cotton linenpainted with a metallic cellulose acetate butyrate "dope" to repel water and reflect sunlight.Sixteen inflatable bags were filled with 7 MMscf of hydrogen to lift the airship, since thepreferred medium (helium) was not available.

Like every machine, the Hindenburg had an operating envelope and violating its limits would

greatly increase the mechanical failure risk. Operating procedures were used to mitigate thesefailure risks, and the Zeppelin Company's enviable safety record was evidence of an effective

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 4/12

training program. Top among these procedures were strict rules governing landing maneuversto avoid exceeding the bracing wires' 1,000-lbs tensile force limit in the tail-to-fuselage section,which absorbs the energy produced while turning the massive airship. Regardless, theHindenburg's maintenance records contain a history of bracing wire failures in the tail-to-fuselage section.3

The Hindenburg's otherwise perfect transatlantic flight was spoiled by unexpected headwindsthat put it 12 hours behind schedule upon its arrival in Lakehurst, New Jersey. Eager to land theship without further delay, the captain ordered a risky sharp left turn after the wind suddenlychanged direction to quickly reorient the airship's nose back into the wind. This violated landingprocedures that required aborting the landing attempt if the wind shifted direction. Followingprocedures was needed to safely point the airship's nose back into the wind without exceedingthe bracing wires' stress limit.

 After making the sharp left turn, the captain noticed the Hindenburg suddenly becoming tail-heavy. Since procedures also required landing the airship horizontally to avoid damaging the tailfin, the captain released the remaining ton of water from the ship's rear ballast tanks (Fig. 3).Several minutes later, the captain ordered six crewmen to the front of the airship tocounterbalance the continued tail section downward-slope. Next, he dropped the anchor ropes

from the airship's nose.

Fig. 3 Rear ballast tanks are emptied to avoid hittingthe ground after the Hindenburg unexpectedlybecomes tail-heavy during landing maneuvers. 

On the ground, everything appeared normal. The ground crew grabbed the anchor ropes and

began walking the airship to the mooring mast. Before they were able to fasten the ropes to themast however, a fire broke out in front of the top tail fin, where evidence of a hydrogen leak (tail-heaviness) existed after the captain deviated from procedures by executing a sharp left turnafter the wind changed direction. The entire airship burned from the tail forward, destroying allphysical evidence within 32 seconds. Thirty-five of the 97 people on board were killed alongwith one ground crew member. 

In hindsight, knowing that a repeat failure is somehow involved makes it easy to understand thata bracing wire probably broke upon exceeding its stress limit, just as expected. While this failurehad occurred previously, this time the unstable wire penetrated a hydrogen bag and theairship's outer skin, which set off a sequence of events that resulted in one of history's mostfamous disasters. The repeat failure became extreme by an unlikely combination of contributing

factors:

•  A very tight schedule, made even tighter by strong headwinds during the flight

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 5/12

•  Procedure deviation

•  Hydrogen containment was lost

•  The failure occurred during a critical phase during the landing procedure

•  Light rain was falling, which made the anchor ropes capable of conducting an electricalcharge after becoming adequately moistened.

Some may wonder why the Zeppelin Company did not address the Hindenburg's design riskwith something more reliable than an administrative control procedure, like stress-resistantmaterials in the vulnerable tail-to-fin section. But it is important to consider how the ZeppelinCompany's perfect safety record influenced its risk tolerance for bracing wire failures. Inhindsight, their maintenance records show that this repeat failure represented a discretionarymaintenance nuisance that could be managed with little inconvenience. Living with the failuremechanism was, therefore, a more economical alternative. Would the choice to sacrifice a wirein the interest of preserving the airship's remaining turnaround time have been consideredacceptable if the procedure deviation had not ended in an extreme failure? While the ZeppelinCompany's safety record was indicative of a reliability-focused organization it was, in fact, guiltyof making decisions associated with a repair-focused organization.

Inherently safe technology advocates will argue that the use of hydrogen instead of helium iswhat caused the accident, while minimizing the impact of maintenance practices that led to aloss of containment scenario. Whether or not helium was available to Germany in the mid 1930sis not the issue here. In modern times we must operate responsibly because it is not practical tomake similar substitutions. To illustrate, let's turn our attention to industries where OSHA'sProcess Safety Management (PSM) Standard (29 CFR 1910.119) applies. The standard'spurpose is to achieve safe and continuous containment of hazardous substances inherent to themanufacturing process.

Spent caustic tank explosion. Refineries use caustic (sodium hydroxide) to purify liquefiedpetroleum gas (LPG). As the caustic reacts with LPG contaminants, its concentrationdecreases. In other words, it becomes "spent."

To maintain the minimum caustic concentration needed to continue the reaction, spent causticmust be periodically removed and replaced with an equal volume of fresh caustic. In onerefinery, the spent caustic batches into a 35,000-gallon intermediate cone-roof storage tank.From there the caustic slowly drains to the waste treatment facility (Fig. 4). This disposalstrategy absorbs large slugs of spent caustic that would otherwise upset the biological treatmentsystem.

Fig. 4  A degassing vessel was installed to vent

hydrocarbons from spent caustic beforeentering the storage tank. 

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 6/12

In 2004, a spent-caustic system hazard and operability (HAZOP) study concluded that operatorerror could result in sending a large volume of LPG directly into the spent-caustic storage tank.Upon entering the tank, the LPG would vaporize and release a propane vapor cloud into therefinery. The history of fugitive vapor releases in refineries is not comforting; vapor releasescontinue to be responsible for extensive equipment damage and fatalities upon ignition.Therefore, a HAZOP action item was assigned to mitigate the risk for a vapor cloud release

from the atmospheric spent-caustic storage tank pressure relief system.  

 A degassing vessel was retrofitted in front of the spent-caustic storage tank and commissionedon day 1 (actually in 2005). This system satisfied the HAZOP action item's purpose forhydrocarbon removal from the spent caustic entering the tank. For most of the time the systemwould operate in "fill" mode, where spent caustic from the upstream liquid/liquid LPG contactprocess would stagnate in the degassing vessel while venting hydrocarbons into the refineryflare header. After allowing sufficient time to pass, operators would perform a manual "dump"procedure by opening the discharge valve under nitrogen pressure to drain its degassed(vented) contents into the tank. Operators were expected to stand by the transfer valve duringthis manual procedure, to verify that the liquid seal above the degassing vessel's dischargenozzle inlet remained intact.

On day 529 (in 2007) the spent-caustic storage tank failed a leak detection and repair (LDAR)test, with over 2,000 ppm hydrocarbon measured exiting the tank's atmospheric pressure reliefdevice (PRD). In compliance with refinery policy, a work order was issued to repair the leakingPRD within 15 days of discovery. The repair involved tightening the bolts around the PRD tostop the hydrocarbon leak.

 After the repair, a second LDAR test was performed to confirm that the repair was successful sothat the work order could be closed. However, the LDAR test failed again with over 2,000 ppmhydrocarbon being measured exiting the tank after the repair. In response, the results of thefailed repair attempt were logged in the maintenance management system and another repairwas scheduled. For the second repair, the PRD's sealing gasket was replaced.

The LDAR test failed again after the second repair attempt, with about 1,000 ppm hydrocarbondetected leaking out of the tank. The maintenance management system was again updated withthe failure information, and a third repair attempt was scheduled. This repair was canceled,however, because a final LDAR test conducted before executing the work showed zero ppmhydrocarbon at the PRD.

On day 621 (2007) two contractors working near the tank both prematurely shut down their jobsat the same time, after a foul odor from an unidentified source invaded their work area.Operators were advised of the situation and they immediately responded by investigating theproblem. However, the source for the release was not positively identified because the odor haddissipated by the time they entered the process unit to investigate the complaints. Thecontractors were allowed to resume working in the area and the odor did not return.

On day 628 (2007) the spent-caustic storage tank exploded suddenly and without warningshortly after operators initiated the procedure to drain spent caustic from the degassing vesselinto the tank. Because the operator had left the valve to attend to another part of the process,there were no injuries or fatalities. However, the accident was severe. It caused the tank tobecome airborne, spread fire into the unit, and interrupted spent-caustic disposal operations.The damage imposed by the accident (Fig. 5) compromised any physical evidence that wouldexpedite root-cause identification.

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 7/12

 

Fig. 5 Spent-caustic storage tank after explosion. 

Only after the incident were the repetitive LDAR failures and odor complaints recognized aswarning signals that hydrocarbon was leaking through the degassing vessel into the tank.Remembering the ignition triangle, this satisfied the fuel requirement for an explosion. Although50 years of reliable spent-caustic storage system operation had been experienced before theaccident, the refinery was faced with compelling evidence that elements of a repair-basedculture existed. This culture allowed three repeat failures (hydrocarbon vapor emission events)

without investigating why hydrocarbons were entering the tank after commissioning thedegassing vessel. 

Fig. 6 Minimum nozzle submergence requirements (feet) toprevent vapor entrainment when draining liquid withouta vortex breaker.

In the post-accident investigation, it was proven that the spent-caustic interface level did notdrop below the degassing vessel's drain nozzle at the time of the accident. Therefore, attention

shifted to alternative scenarios that would explain how hydrocarbons could penetrate thedegassing vessel's liquid seal. By chasing down this thread, the investigation uncovered

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 8/12

evidence that an unintended design condition existed, which allowed flare gas and LPG in thedegassing vessel to contaminate the spent-caustic storage tank during the draining procedure.Since the degassing vessel was draining without a vortex breaker, it would have to operateaccording to the nozzle submergence requirements shown in Fig. 6 to avoid entraininghydrocarbon vapor in spent caustic. Archived process data provided evidence that thedegassing vessel operated outside of these limits (Fig. 7). This means that hydrocarbon vapor

was passing into the tank every time a transfer was made. The investigation uncoveredadditional systemic defects that explain how the failure became extreme. These conditionsproduced an unlikely combination of contributing factors: 

•  A procedure deviation that made it possible for operators to transfer spent causticwithout using nitrogen, which greatly increased the amount of hydrocarbon vapor in thedegassing vessel headspace

•  The formation of a pyrophoric iron sulfide ignition source on the internal tank roofsurface

•  Oxygen in the tank.

Both examples strongly reinforce repeat failures' involvement in extreme failures. In every case,a trustworthy and actionable cause emerges. It is based on evidence associated with apreceding repeat failure.

Recall, however, that the goal of a reliability-based organization is to recognize the warningsignals and take action before an extreme failure triggers an accident investigation. The finalexample shows how this can be accomplished by taking appropriate intervention steps upondetecting a repeat failure.

Extreme failure avoidance. A five-stage, barrel-type, hydrogen recycle centrifugal compressorsimilar to the one shown in Fig. 8 is in service in a large midwestern refinery's platformer unit.The compressor operates at 8,200 rpm and processes a recycle gas flow of about 97 MMscfd.The suction gas is contaminated with ammonium chloride. This situation is conducive todepositing salt on the rotor, which has been the presumed source for a series of recurring

vibration events over the compressor's 30-year history.

Fig. 8 Typical barrel compressor internal bundleassembly after casing removal. 

Fifteen months into a stable run after overhaul, the compressor tripped offline and coasted to astop without lubrication following an unintended shutdown of both lube-oil supply pumps. After awarm restart, vibration appeared to be stable and in general very similar to conditions before the

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 9/12

trip. Stable operation was interrupted a month later when the outboard radial bearing vibrationsuddenly jumped to 1.7 mils. 

Vibration analysis indicated that subsynchronous vibration had developed due to a fluidinstability problem that produced an "oil whirl" pattern. Two months later, the vibration profiledeteriorated further into an "oil whip" pattern. This resulted in increasingly unstable and

unpredictable vibration spikes exceeding 2 mils.

Reducing the frequency and severity of the vibration spikes was possible only by operating thecompressor at speeds below 7,600 rpm. The speed curtailment resulted in a significantplatformer unit rate cut. The economics favored shutting down the unit to repair the compressorrather than continuing to operate the machine below its normal running speed. The repair planwas limited to replacing the inboard and outboard floating-ring oil seals and tilt-pad radialbearings. These components were suspected to have been damaged by the accidental loss oflube oil. The repair plan also provided a rationale for the type of vibration experienced soonafter, which indicated a fluid instability problem characterized by oil whip.

When the machine was opened for inspection, the maintenance staff was pleased to find radial-bearing and floating-ring oil seal damage consistent with their diagnosis. The damagedcomponents were replaced and the compressor restarted. Unfortunately, the unstablesubsynchronous vibration component remained at speeds above 7,600 rpm upon thecompressor's return to service.

 A second repair at considerable expense was scheduled in response to this unfortunate turn ofevents. Since the compressor barrel was to be opened for inspection, a complete overhaul wasplanned. A comprehensive vibration study was performed to narrow down the repair scope. Aninvestigation was launched to determine if a repeat failure could explain this machine's longhistory of what appeared to be unrelated, but persistent unstable vibration events at high speed.

 Although the compressor is armed with an eddy-current type noncontacting shaft vibrationmonitoring and shutdown system, "unstable" and "high speed" are words that do not go well

together in reliability and safety-based organizations. Therefore, refinery staff wanted todetermine if rotor fouling and other discrete events were somehow related. Among these eventsthe most recent one was where replacing the damaged components did nothing to correctunstable vibration.

The vibration study provided evidence needed to determine both probable cause and,ultimately, avoidance of a repeat failure. Fig. 9 shows how the subsynchronous componentadjusts to maintain a constant fractional relationship with the rotor speed. It is "locked-in" at arotating frequency of 3,000 cpm that corresponds to the rotor's first natural fundamentalfrequency (critical speed). These characteristics apply to flexible rotors that operate above oneor more shaft critical speeds.

4 The compressor maintenance file contains a history of unstable

vibration events at speeds above 7,600 rpm. These events date back to 1985 and consistentlyappeared within 18 to 24 months after overhaul. References document similar cases involving

the aerodynamic excitation of a rotor's first natural fundamental frequency.5 This condition maybe experienced with flexible rotors, due to the gradual deterioration of damping propertiesassociated with normal operation after compressor overhaul.

6

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 10/12

 

Fig. 7  Actual degassing vessel operation comparedwith minimum nozzle submergencerequirements shows vapor entrainmentoccurring. 

Fig. 9 Cascade plot showing a troublesomesubsynchronous vibration component "locked-in" at 3,000 cpm along with expectedsynchronous (1X) vibration. 

 Aerodynamic rotor instability was thus identified as the probable cause for the history ofcompressor vibration events. This fact-based explanation developed the confidence

management needed to approve the investigation team's long-term recommendation, i.e., toaddress the inherent instability by either redesigning or replacing the compressor. Mostimportantly, it interrupted an extreme failure's life cycle that might have resulted in unacceptableconsequences, no matter what their relative "improbability." Bottom line: Tolerating repeatfailures is inconsistent with reliability-focused thinking. 

The science of warning signals. As these examples illustrate, rarely will an extreme failure

occur simply based on a single, isolated event. Rather, extreme failures are produced when anexisting repeat failure combines with other factors that are statistically unlikely to coexist. Byway of analogy, repeat failures keep reappearing like bars on a gambling casino slot machine.Repeat failures are common, predictable events that independently represent low risk. Butwhen all the bars line up, there is a payout. When certain deviations line up with repeat failuresyou get negative payout in the form of an extreme failure.

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 11/12

This is the basis for the "coupling" argument introduced by Charles Perrow in his classic Normal Accidents text. Perrow's basic premise is that complex systems are uniquely suited for two ormore independent and innocuous conditions to combine at once to produce an unexpectedcatastrophic event.7 This principle is best reflected in our compressor example, where a flexiblerotor (the latent cause) is no problem at all until it interacts with the contributing factors that alignwithin 18 to 24 months of normal operation. Likewise, the normal deterioration from start-of-run

conditions expected after 18 to 24 months would have little impact on a rigid rotor'saerodynamic stability operating in this specific service.

The benefit of recognizing and controlling a repeat failure is that eliminating only one of thecoupling requirements can mitigate the risk for an extreme failure. For example, the accidentssuffered in the case of the Hindenburg and the spent-caustic storage tank could have beenprevented had the repeat failures (snapped bracing wires and hydrocarbon leakage,respectively) been resolved. It is more rewarding to trigger an investigation that prevents anaccident rather than investigating the accident you could have prevented.

What can you do? Knowledge about the relationship between repeat failures and extremefailures adds value in two ways. First, it becomes possible to locate the facts we need to filterour beliefs, so that a credible probable cause can be identified when physical evidence has

been compromised. Second, it promotes confidence that we control process reliability andsafety and will not let it control us. By recognizing warning signals we can take deliberateactions to prevent extreme failures before suffering unacceptable consequences.

Since failure and accident prevention are the reliability-based organization's trademark, here area few suggestions:

•  Recognize repeat failures. Check reactive work orders and challenge the ones thatpop up regularly. Ask yourself, "Do I know why I'm working on this again?" Perform anRCFA if the answer is no.

•  Follow and enforce procedures. Shortcuts tend to introduce risks that procedures

mitigate. Follow procedure steps in order. Communicate openly when you think there

may be a better way to execute a procedure or if the steps do not make sense or seemout of order before deviating from them.

•  Use good judgment. When changing conditions or circumstances interfere with theplan, don't be afraid to enter a holding pattern or call time out. Stopping a job makesmore sense than executing it unsafely.

•  Operate a near-miss awareness, reporting and investigation program. Askemployees to report things that don't look, sound or smell right. Follow up on employeeconcerns about unresolved problems. Resolve the issue and communicate findingsback to them. Look for trends that indicate a bigger problem looming.

•  Develop and apply internal RCFA skills. Our biggest opportunity lies with correctingsmall failures to avoid the bigger ones. Ultimately, no time will be saved unless RCFA isperformed.

•  RCFA triggers must be linked to repeat failures. Many organizations tier their RCFA

levels according to safety, environmental and economic thresholds. Reserve a categoryfor repeat failures and measure improvement (reduction) over time. The maintenancestaff will appreciate reducing the backlog and their frustration over experiencing thesame problems. You also benefit in knowing that you are systematically mitigating therisk for an improbable, yet far too costly, extreme failure (PSM incident).

•  Communicate and incorporate lessons learned. Lessons obtained by investigatingrepeat failures extend far beyond the equipment type on which they occur. They willbenefit different units, areas, sites and even industries. Maximizing value from a singlefailure involves communicating lessons learned effectively throughout an organization.Lessons learned from outside resources can be obtained from numerous sources, suchas the annual NPRA Safety Conference (www.npra.org), semiannual API/NPRAOperating Practices Symposium (www.api.org), the AIChE Spring National Meeting(www.aiche.org), and the US Chemical Safety Board (www.csb.gov).

8/12/2019 Extreme Failutre Analysis

http://slidepdf.com/reader/full/extreme-failutre-analysis 12/12

 Above all, remember that the machines we build perform and respond exactly as expectedunder the conditions to which they are exposed. Rarely, if ever, is the cause for a failure out ofour control. Be convinced that answers and solutions will come to those who act on theirresponsibility to explain unacceptable equipment performance.

LITERATURE CITED 

1Taleb, N. N., The Black Swan, Random House, New York, New York, p. 196, (ISBN 978-1-4000-6351-2), 2007.

2 Bloch, K. and S. Williams, "Normalize Deviance at Your Peril," Chemical Engineering , 111, No. 5, pp. 52–56, 2004.3  "The Hindenburg Airship," Seconds From Disaster , Yavar Abbas, The National Geographic Channel, November 15, 2005.

4 Eisenmann, Sr., R., and R. Eisenmann, Jr., Machinery Malfunction Diagnosis and Correction, Prentice-Hall, Inc., UpperSaddle River, New Jersey, p. 436, (ISBN 0-13-240946-1, out of print), 1998.5  Nicholas, J. C. and J. Kocur, "Rotordynamic Design of Centrifugal Compressors in Accordance with New API StabilitySpecifications," Proceedings of the Thirty-Fourth Turbomachinery Symposium, Turbomachinery Laboratory, Texas A&MUniversity, College Station, Texas, pp. 25–34, 2005.6  Eisenmann, op. cit., p. 436.

7 Perrow, C., Normal Accidents: Living With High-Risk Technologies, Princeton University Press, Princeton, New Jersey, p. 7,(ISBN 0-691-00412-9), 1999.8  Lieberman, N., Troubleshooting Refinery Processes, Penwell Publishing Co., Tulsa, Oklahoma, p. 272, 1981.

The author  

Kenneth Bloch is lead process reliability engineer at Flint Hills Resources' PineBend Refinery in Rosemount, Minnesota. He is responsible for mitigating andinvestigating process-governed failures on refinery assets. A Certified API 510Inspector, Mr. Bloch publishes articles on equipment failure analysis, life cycleextension, and reliability improvement in Hydrocarbon Processing  and ChemicalEngineering  magazines, and is a regular participant and speaker at thesemiannual API/NPRA Operating Practices Symposium and annual NPRANational Safety Conference. He holds a BS degree (honors) from LamarUniversity in Beaumont, Texas.