case studies in process safety: lessons learned from software-related accidents

7
Case Studies in Process Safety: Lessons Learned from Software-Related Accidents Terry L. Hardy Safety & Risk Management, Great Circle Analytics, LLC, 1238 Race Street, Denver, CO, 80206; [email protected] (for correspondence) Published online 30 October 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/prs.11638 Software and automation can play a significant role with respect to safety in the chemical process and energy produc- tion industries. Because computing systems are increasingly used to control critical functions, software may directly con- tribute to an accident. On the other hand, software can also be used as part of the hazard control strategy to reduce risks, and computing systems can provide valuable information to help make safety decisions. The importance of including soft- ware as part of an organization’s efforts to analyze and manage hazards and risks seems clear, but for many organi- zations software is not effectively incorporated into process safety efforts. This article reviews lessons learned from acci- dents and incidents to illustrate the potential for a software- related accident even when process safety management tools and techniques are used. This discussion is intended to pro- vide insights to help improve process safety and software safety efforts. V C 2013 American Institute of Chemical Engineers Process Saf Prog 33: 124–130, 2014 Keywords: safety analysis; software safety; computing sys- tems; process safety management INTRODUCTION Process Safety Management (PSM) generally refers to the application of management principles and systems to the identification, understanding, and control of process hazards to protect employees, facility assets, and the environment. PSM is focused on prevention of, preparedness for, mitiga- tion of, response to, and restoration from catastrophic releases of chemicals or energy from a process associated with a facility. In Process Safety Management, a process is defined as any use, storage, manufacturing, handling, or the on-site movement of highly hazardous chemicals, or combi- nation of these activities. PSM is regulated by OSHA through the PSM Standard. Software and computing systems are becoming essential to the operation of most industries, often through the automation of tasks, and as such can have a major impact on process safety. Software and computing systems offer tremendous advantages that encourage the use of automation. Software can allow complex operations to be performed and can per- form complicated calculations. Computing systems are reliable in the sense that there are a small number of components to perform a large number of actions. Computing systems are also flexible and can be modified to fit the application. From a safety perspective, automation can reduce the risks to opera- tors by separating them from dangerous environments such as those in hazardous chemical processing facilities. In addition, automation can help by performing tasks that might pose a threat to humans, such as lifting large objects or operating high-speed cutting tools. Software, computing systems, and automation have disad- vantages as well. A major disadvantage to computing systems is that they often introduce increased complexity to systems, making them harder to understand, operate, maintain, and test. This complexity can also introduce new failure modes and interactions. Complex systems have a greater number of requirements, more interfaces, more subsystems, and more logical operations, increasing the potential for error. Comput- ing systems can create a situation where no single operator can immediately foresee the consequences of a given action in the system. Computing systems can also increase coupling. Tight coupling occurs when processes are highly integrated and intrinsically time-dependent; for example, once a process has been set in motion, it must be completed within a certain period of time. Tightly coupled systems allow little time for operators to take action to prevent adverse effects. In addi- tion, automation may require that operators think about their tasks in new ways, and may change the fundamental role of the operator from one who physically performed tasks to someone who monitors and communicates information. Finally, software will only do what a programmer has told it to do. Therefore, unlike humans, software cannot react to an abnormal situation unless specifically told to. In spite of the fact that software is such an important part of complex systems, the analysis of hazards and risks from software has been inconsistent in the process industries. Safety analyses have historically been hardware-focused. Therefore, many analysts may not understand how to incor- porate software into their system hazard analyses, and eval- uators of those analyses may not understand what should be included and assessed. Organizations may be focused on compliance to regulations, which often do not address soft- ware, and as a result organizations may not properly assess or mitigate software risks. Organizations need to increase the attention given to addressing and analyzing the potential for hazards related to software and computing systems. Software includes computer programs, procedures, scripts, rules, and associated documentation and data per- taining to the development and operation of a computer. Software can be developed by the organization implement- ing the system, by an outside software developer, or may be V C 2013 American Institute of Chemical Engineers Process Safety Progress (Vol.33, No.2) June 2014 124

Upload: terry-l

Post on 02-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Case studies in process safety: Lessons learned from software-related accidents

Case Studies in Process Safety Lessons Learned

from Software-Related AccidentsTerry L HardySafety amp Risk Management Great Circle Analytics LLC 1238 Race Street Denver CO 80206 thardygcirccom (forcorrespondence)

Published online 30 October 2013 in Wiley Online Library (wileyonlinelibrarycom) DOI 101002prs11638

Software and automation can play a significant role withrespect to safety in the chemical process and energy produc-tion industries Because computing systems are increasinglyused to control critical functions software may directly con-tribute to an accident On the other hand software can alsobe used as part of the hazard control strategy to reduce risksand computing systems can provide valuable information tohelp make safety decisions The importance of including soft-ware as part of an organizationrsquos efforts to analyze andmanage hazards and risks seems clear but for many organi-zations software is not effectively incorporated into processsafety efforts This article reviews lessons learned from acci-dents and incidents to illustrate the potential for a software-related accident even when process safety management toolsand techniques are used This discussion is intended to pro-vide insights to help improve process safety and softwaresafety efforts VC 2013 American Institute of Chemical Engineers

Process Saf Prog 33 124ndash130 2014

Keywords safety analysis software safety computing sys-tems process safety management

INTRODUCTION

Process Safety Management (PSM) generally refers to theapplication of management principles and systems to theidentification understanding and control of process hazardsto protect employees facility assets and the environmentPSM is focused on prevention of preparedness for mitiga-tion of response to and restoration from catastrophicreleases of chemicals or energy from a process associatedwith a facility In Process Safety Management a process isdefined as any use storage manufacturing handling or theon-site movement of highly hazardous chemicals or combi-nation of these activities PSM is regulated by OSHA throughthe PSM Standard

Software and computing systems are becoming essential tothe operation of most industries often through the automationof tasks and as such can have a major impact on processsafety Software and computing systems offer tremendousadvantages that encourage the use of automation Softwarecan allow complex operations to be performed and can per-form complicated calculations Computing systems are reliablein the sense that there are a small number of components toperform a large number of actions Computing systems arealso flexible and can be modified to fit the application From

a safety perspective automation can reduce the risks to opera-tors by separating them from dangerous environments such asthose in hazardous chemical processing facilities In additionautomation can help by performing tasks that might pose athreat to humans such as lifting large objects or operatinghigh-speed cutting tools

Software computing systems and automation have disad-vantages as well A major disadvantage to computing systemsis that they often introduce increased complexity to systemsmaking them harder to understand operate maintain andtest This complexity can also introduce new failure modesand interactions Complex systems have a greater number ofrequirements more interfaces more subsystems and morelogical operations increasing the potential for error Comput-ing systems can create a situation where no single operatorcan immediately foresee the consequences of a given actionin the system Computing systems can also increase couplingTight coupling occurs when processes are highly integratedand intrinsically time-dependent for example once a processhas been set in motion it must be completed within a certainperiod of time Tightly coupled systems allow little time foroperators to take action to prevent adverse effects In addi-tion automation may require that operators think about theirtasks in new ways and may change the fundamental role ofthe operator from one who physically performed tasks tosomeone who monitors and communicates informationFinally software will only do what a programmer has told itto do Therefore unlike humans software cannot react to anabnormal situation unless specifically told to

In spite of the fact that software is such an important partof complex systems the analysis of hazards and risks fromsoftware has been inconsistent in the process industriesSafety analyses have historically been hardware-focusedTherefore many analysts may not understand how to incor-porate software into their system hazard analyses and eval-uators of those analyses may not understand what should beincluded and assessed Organizations may be focused oncompliance to regulations which often do not address soft-ware and as a result organizations may not properly assessor mitigate software risks Organizations need to increase theattention given to addressing and analyzing the potential forhazards related to software and computing systems

Software includes computer programs proceduresscripts rules and associated documentation and data per-taining to the development and operation of a computerSoftware can be developed by the organization implement-ing the system by an outside software developer or may beVC 2013 American Institute of Chemical Engineers

Process Safety Progress (Vol33 No2) June 2014 124

purchased as Commercial Off-The-Shelf software Softwaresafety encompasses not just the software but also the com-puting system A computing system includes the softwareand supporting hardware sensors effectors humans whointeract with the system and data necessary for successfuloperation Examples of computing systems include Program-mable Logic Controllers (PLC) and Supervisory Control andData Acquisition (SCADA) systems

Through the review of hundreds of software-related acci-dents and incidents and the authorrsquos personal experience inthis discipline common themes and lessons have beenfound [1] This article will use case studies to illustrate thosethemes and lessons learned The discussion will center onsoftware safety as part of a broader process safety effort andwill include recommendations that can be used to improvethe safety of software-driven systems

LESSONS LEARNED SOFTWARE AND COMPUTING SYSTEM RELATED ACCIDENTS

This section discusses a number of lessons learned relatedto accidents and incidents where software and computingsystems were contributors These accidents are taken fromdetailed reports and investigative summaries from multipleindustries and organizations and are included to broadlyillustrate hazards and risks in software and computing sys-tems Note that in discussing these accidents this article doesnot intend to oversimplify the events and conditions that ledto the accidents or place blame on any individuals or organ-izations There is rarely a single identifiable cause leading toan accident Accidents are usually the result of complex fac-tors that include hardware software human interactionsenvironments and procedures The descriptions are meantto provide examples of where the analysis process failed insome way Readers are encouraged to review the completeaccident and mishap investigation reports referenced tounderstand the often complex conditions and chain of eventsthat led to each accident discussed here

Decisions Made in the Acquisition and PlanningPhases of Development Can Profoundly Affect Safety

Planning typically involves trade-offs between many dif-ferent facets of the program including cost schedule per-formance and safety Poor planning can lead to unexpectedsafety consequences and many safety decisions are actuallymade in the planning and acquisition phase For exampleinadequate resources may be allocated to the software safetyeffort This can result in a failure to perform hazard analysesand identify safety requirements early in the program whenthese activities have the most impact Therefore safety per-sonnel should be included in early phases of a programespecially when developing contractual requirements andsoftware safety considerations should be included in sched-ule and resource discussions

On August 11 1985 an Institute West Virginia chemicalproduction facility leaked methylene chloride and aldicarboxime toxic chemicals used to manufacture the pesticideTemik Six workers were injured and more than a hundredresidents living near the facility were sent to the hospitalThe toxic chemical leak occurred when a tank overheatedafter a steam valve failed Due to the failure steam leakedinto the heating jacket of a storage tank containing a mixtureof methylene chloride and aldicarb oxime The methylenechloride vaporized from the heating leaving concentratedaldicarb oxime to settle on the bottom of the tank Operatorslater pumped the material from the storage tank for process-ing They thought they had emptied the tank but concen-trated aldicarb oxime remained in the storage tank because alevel gauge was broken and the operators performedno additional checks The concentrated aldicarb oxime

continued to be heated over several days by the steam in theheating jacket ultimately resulting in a runaway chemicalreaction days later The high pressure from the reaction ledto release of aldicarb oxime through a flare and largeamounts of the chemical were also released to the atmos-phere when a burst disc failed

The root causes of the leak were found to be the use of atank that was not designed to hold aldicarb oxime faultylevel indicators defective safety valves mistaken transmis-sion of steam to a tank and failure of control room operatorsto notice critical pressure and temperature changes In addi-tion a computerized monitoring system contributed to theseverity of the accident Earlier in the year the company hadinstalled a new air-monitoring system that could automati-cally identify and detect releases of hazardous gases Thatcomputerized system could also predict whether the gaseswould be contained within the release area or would migrateand potentially harm the public This information was impor-tant because it could be used to mitigate risks through evac-uation or other measures if a vapor cloud appeared to bespreading Company officials initially stated that the com-puter system had failed and therefore they were notinformed of the lack of containment of the aldicarb oximegas cloud However they later admitted that they had notpurchased a version of the software programmed to detectaldicarb oxime Had the company invested in the moreexpensive version of the software they could have predictedthat a cloud of aldicarb oxime would not have been con-tained in the production area However the company hadonly ordered the basic software model that did not containaldicarb oxime properties [23]

Hazard Causes May Not Include Software andComputing Systems

Many organizations are improving their safety analyseswith respect to software However software safety still lackssufficient number of qualified practitioners and the methodsfor performing comprehensive analyses are not universallyknown In many cases software causes may not be includedin hazard analyses Or if they are the software causes maybe stated in general terms in hazard analyses such as includ-ing a generic hazard cause called ldquosoftware errorrdquo instead ofdefining specifically what software functionality can lead toan undesirable outcome In addition the software hazardanalyses may not pay enough attention to those cases wherethe software works exactly as intended but the implementedfunctionality was unsafe

On October 18 2005 an explosion occurred at a naturalgas decompressionrecompression facility near EmpressAlberta Extensive damage occurred to the facility but noinjuries were reported The facility decompressed and condi-tioned high-pressure gas from a pipeline to achieve pres-sures and temperatures consistent with what was needed inthe Natural Gas Liquids (NGL) stripping facilities At the timeof the accident the facilityrsquos ldquoArdquo PLC lost communication withthe NGL facility control room Transportation Safety Board ofCanada (TSB) investigators found that a processor card hadfailed shutting down the PLC When the PLC failed its out-puts were de-energized However some of the microproces-sor functions required an energized signal to initiate physicalshutdown of some functions These functions included mainand auxiliary lube oil pumps purgedamper valves and suc-tion and discharge valves As a result of this condition thepumps did not shutdown and valves did not close in a safeposition when the PLC failed Operators in the NGL controlroom received an alarm indicating that the communicationswere lost They investigated the problem and believed thatthe failure in communication had triggered an emergency

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 125

shutdown of the facility The operators believed that theemergency shutdown meant that all components were in asafe state However because the recompressor motor did notshut down a surge event occurred in the pipe This surgecaused a pipe to break leading to a natural gas leak Theleaking natural gas ignited resulting in an explosion TheTSB noted that the loss of a PLC had not been identified as asafety concern in the hazard analyses prior to the accident[4]

Risks May Be Underestimated or OptimisticallyEvaluated

The risk assessment process helps decision makers intheir risk decisions (eg accept the risk without changereduce the risk to an acceptable level through hazard elimi-nation and mitigation transfer the risk or forego the activ-ity) assists in justifying the acceptance of residual risk andallows communication of risks throughout the organizationA risk assessment can either be qualitative or quantitativealthough the emphasis in process safety is often on qualita-tive risk assessment Evaluating software and computing sys-tem risk can be difficult especially if an extensive historydoes not exist with the automated equipment Risk must becarefully evaluated based on known technical factors suchas design complexity and maturity degree of system testinguse of unproven technologies potential for unexpectedhuman intervention and so on Accidents have illustratedunfounded optimism in the use of software to assure safety

On April 7 2000 a pipeline failure occurred at the ChalkPoint Generating Station in Maryland The pipeline releasedapproximately 140400 gallons of fuel oil into surroundingwetlands Swanson Creek and the Patuxent River The costof the environmental response and cleanup exceeded $71million The US National Transportation Safety Board(NTSB) determined that the probable cause of the accidentwas a fracture in a buckle in the pipe The buckle wentundiscovered because data from an in-line inspection toolwere interpreted incorrectly The NTSB stated that a contrib-utor to the severity of the accident was inadequate operatingprocedures and practices for monitoring the magnitude offlow through the pipeline When the leak occurred the 12-inch-diameter pipeline was being cleaned An automatedpipeline monitoring system existed but this automated sys-tem could not adequately monitor pipeline conditions duringthis operation because the locations of the meters pressure-sensing points and temperature-sensing points had not beenconfigured to be in the direct liquid flow path Thereforethe automated system did not provide alarms or other infor-mation to operations personnel Field personnel manuallyrecorded tank level measurements at the upstream anddownstream points but did not use this information to evalu-ate whether any fuel oil had been lost The manual monitor-ing procedures and lack of alarms from the automatedsystem led to a 7-hour delay in responding to the leak Thefield personnel finally realized there was a problem whenthe fuel pump began to cavitate Using hand calculations thecrew then realized that they had not received over 3000 bar-rels of fuel oil and they shut the pipeline down Followingthe accident the company installed a SCADA system withsoftware-based leak detection and radar tank gauges In thiscase the operators underestimated the risks associated with aleak and overestimated their ability to detect a failure [5]

Hazard Controls May Rely Only on Good SoftwareProcesses and Testing

Hazard controls are devices and approaches to mitigaterisk presented by a hazard by reducing either the severity ofthe hazard or the probability of its occurrence Strong hazard

control design follows the design order of precedencewhere the first approach is to try to design out the hazard orminimize the risks through design selection Many hazardreports will focus on the implementation of good softwareprocesses or extensive unit testing to prove that the designis safe However the software processes and testing will nothelp if the software design is flawed with respect to safe sys-tem operations (including hardware software processeshumans and environments) In these cases the software mayoperate exactly as designed but the operation may beunsafe

On February 11 2003 an employee of a manufacturer inGonzales Texas was fatally injured while performing mainte-nance on a reaction tank The US Mine Safety and HealthAdministration (MSHA) determined that the cause of the acci-dent was a failure to close and secure a manual gate valvefor a steam line and a failure to place the batch PLC in thestop mode The company was a surface clay mill that pur-chased clay and blended refined milled and processed thematerial into products used in paints inks and grease Onthe day of the accident the employee had been informedthat there had been a product change in one of the batchprocessing systems The employee was assigned to performcleanup duties on a reactor tank Two valves controlledsteam entry into the tank a manual gate valve and a butter-fly valve with an automatic pneumatic actuator The PLCcontrolled the functioning of the batch system based on sen-sors that monitored material flow At the time of the accidentthe PLC was in ldquoslurry holdrdquo mode In this mode the systemwas programmed to actuate the steam valve when the clayslurry level reached 55 feet An aluminum extension ladderused by the employee caused the level sensor to falselysense that slurry was in the reactor which resulted in thePLC sending a command to open the steam valve Becausethe manual valve had been left open steam at 350o F thenentered the tank fatally burning the employee [6]

Organizations May Fail to Consider the Possibility ofHazard Controls Not Working

Hazard controls are never 100 percent reliable Thereforeit is important for the analyst to consider what could happenif the controls fail Consideration should be given to unex-pected hardware software and human interactions and aldquodefense in depthrdquo approach should be used where multipleindependent methods are used to prevent a hazard frombecoming an accident Controls should be carefully eval-uated to identify what could happen if they do not work asexpected especially if those controls rely heavily on softwareand personnel actions

On February 18 2009 an employee was fatally injured ata coal preparation plant in the Hunter Valley region of NewSouth Wales Australia The accident occurred when 10 tonsof waste rock were inadvertently released from the reject binand fell onto the cabin of the employeersquos truck At the plantraw coal was extracted from the mine and usable coal wasseparated from waste rock The waste rock was transferredapproximately 2 kilometers on conveyers to the reject binThe waste rock was then loaded from the reject bin ontotrucks and hauled away The process of loading the truckswith waste rock was controlled by a PLC system The PLCsystem included truck detection sensors traffic lights bincapacity sensing and remote control hand-held transmittersused by the truck drivers On the day of the accident thetruck driver drove his truck under the reject bin deliverychute A signal was sent from the handheld remote controlto command the chute to open The accident report statedthat it was not clear whether the signal was sent inadver-tently or intentionally Opening the chute required that two

DOI 101002prs Process Safety Progress (Vol33 No2)126 June 2014 Published on behalf of the AIChE

of three lines of truck detection sensors be blocked in addi-tion to a command from the remote control to assure thatthe truck was in the correct location Each sensor line con-tained three sensors and all three sensors had to be blockedfor the entire line to be considered as blocked At the timeof the accident the truck was obscuring one line of sensorsand a second line of sensors was obscured by dirt on thelenses and therefore was not working correctly Because twoof the sensor lines were blocked and the remote control sig-nal had been sent the PLC automatically opened the rejectbin chute door and dropped 10 tons of material on the truckcab before the driver had safely cleared the chute resultingin the fatal injury [7]

Testing May Not Provide Information on Subsystemand Component Interactions or Hazard ControlOperations

The safety verification and validation process is intendedto determine that the design solution has met all the safetyrequirements (verification) and that the correct system isbuilt (validation) The verification and validation process ifperformed correctly will provide evidence that risk has beenreduced Testing is an important part of the verification andvalidation process and comprehensive software testingshould be conducted throughout the development cycleTesting should include not only nominal conditions basedon requirements but also abnormal conditions such asimproper inputs inputs outside expected conditions inputsexactly at and near the boundary conditions and inputsstuck at some value Safety-critical software must include fullsystem integration testing of end-to-end events That testingmust include stressing of the software and should includeinteractions of the software with hardware humans andenvironments A number of accidents have occurred whenno component failed in the conventional sense but the inter-action of components caused a system failure Testing mustinclude verifications of hazard controls and it should assurethat redundancy works when needed

On October 24 2002 a grinder exploded at a limestoneprocessing plant in Foreman Arkansas An operator waskilled when flammable waste fuel covered him and ignitedThe operator had started the pump for solid waste fuel proc-essing when the accident occurred The MSHA stated in itsreport that the cause of the accident was that the safety mon-itoring system designed to shut off the waste fuel systempump had not been maintained so that it functioned prop-erly Kilns were used in processing activities at the plantand these kilns were heated by burning coal natural gasand liquid waste fuel The liquid waste fuel was delivered bytruck or railcar and pumped into large storage tanks Fromthe storage area it was pumped through a grinder to reducethe particle size of the solids in the fuel Two independentsystems monitored and controlled the waste fuel delivery AFoxboro Intelligent Automation Distribution Control Systemmonitored and recorded normal operating parameters TheFoxboro also issued audible and visual alarms that wereavailable at the plant control room A PLC provided basicstartup and shutdown of the system and responded to com-mands from the Foxboro On the day of the accident theFoxboro sensed that the fuel delivery pressure was lowapparently due to blockage in the line As designed the Fox-boro sent a command to the PLC to shut down the pumpsHowever the PLC failed to respond and the pumps keptrunning Three months prior to the accident this PLC hadbeen installed this was supposed to be a simple replacementof an older PLC of similar capability However the Foxborohad not been connected to the newer PLC and the connec-tions remained to the older non-functioning PLC The

complete system had never been tested with the new PLC Asystem test had been scheduled 3 days prior to the accidentbut had been aborted when a pump failed during the testthe test had never been rescheduled The accident reportstated that the blockage may have broken free just prior tothe accident With the pumps running the pressure elevatedsignificantly and a ldquowater hammerrdquo effect caused overpressu-rization in the system at the grinder The grinder was tornloose from its base spraying fuel and pulling loose a 480-volt cable that ultimately served as an ignition source [8]

Software Change Management and Hazard AnalysesProcesses May Not Be Integrated

Engineering by its very nature is an activity that requireschange Changes can come about for a variety of reasonsincluding the discovery of problems during developmentchanges in requirements or routine upgrades of software orhardware As the development cycle proceeds and as anorganization gains more operational experience new haz-ards are often uncovered and some hazards may no longerbe relevant If the hazard analysis is not updated to reflectthese changes then resources may be expended on previ-ously identified hazards that may no longer be relevant andnew hazards may not be discovered as the design maturesIn addition routine changes and modifications to softwareand computing systems must be analyzed for potential haz-ards as part of the broader hazard analysis process

On June 10 1999 a 16-inch-diameter pipe carrying gaso-line ruptured in Bellingham Washington The ruptured pipe-line released approximately 237000 gallons of gasoline intoa nearby creek according to the NTSB That gasoline thenignited burning approximately 11=2 miles along the creekThree people died in the fire One home and the Bellinghamwater treatment plant were also damaged in the accident

The NTSB investigated the accident and determined thatthe probable cause of the rupture was damage to the pipeduring a modification project performed in 1994 This dam-age weakened the pipeline making it susceptible to ruptureunder increased pressure in the pipe The NTSB also statedthat inspections of the pipeline during the project were inad-equate and the company did not identify and repair damageThe report noted that in-line pipeline inspection data shouldhave prompted the company to excavate and examine thatsection of pipeline but the company failed to perform suchwork after reported anomalies The SCADA system com-puters also played a role in the accident The SCADA systemwas used for operation of the pipeline for example to openand close valves remotely as required or to operate pumpsas needed Just prior to the accident the operator was pre-paring to initiate delivery of gasoline to a terminal in Seattlediverting delivery from another facility During the processof switching delivery destinations the pressure in the pipe-line began to increase which is a normal condition but onethat required the operator to start a pump to reduce pres-sure However when the operator tried to start that pumpthe SCADA system failed to execute the start commandissued by the operator The operator soon found that theSCADA system was unresponsive to any commands some-thing that had never happened before The report statedthat ldquoHad the controller been able to start the pump atWoodinville it is probable that the pressure backup wouldhave been alleviated and the pipeline operated routinely forthe balance of the fuel deliveryrdquo Instead the pressure in thepipe increased and the increased pressure likely caused thedamaged pipe to rupture

The cause of computer system failure was likely a changemade to the system database just prior to the accident TheNTSB accident report stated that the SCADA system

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 127

administrator entered new records into the live database atthe time of the accident The system administrator howeverdid not check the records or test the system software to seeif those changes introduced any problems The computingsystem problem could not be replicated after the accidentand therefore the cause of the anomaly could not be defini-tively identified The report stated ldquoThe Safety Board con-cludes that had the SCADA database revisions that wereperformed shortly before the accident been performed andthoroughly tested on an off-line system instead of the pri-mary on-line SCADA system errors resulting from those revi-sions may have been identified and repaired before theycould affect the operation of the pipeline [9]rdquo

Human-Software Interactions Have Significant SafetyImplications That Are Often Underestimated

Humans interact with hardware and software in a varietyof ways A number of accidents have occurred where theseinteractions have been a significant contributing cause Ofspecial concern is when a new computing system is imple-mented that changes what is expected of the operator innormal and emergency situations

On August 28 2008 an explosion occurred at a pesticidemanufacturing plant in Institute West Virginia Two workerswere killed in the explosion and eight others were injuredThe US Chemical Safety and Hazard Investigation Board(CSB) found that the explosion was the result of a runawaychemical reaction The company was starting up a methomylunit for the first time after several months of down time formaintenance to install a new computer control system andreactor vessel Normal operations called for dissolvedmethomyl and waste chemicals to be fed into a preheatedresidue treater vessel partially filled with solvent The treaterallowed the methomyl to decompose safely after which itwas mixed with other waste chemicals and used to fuel facil-ity boilers On the day of the accident a methomyl-solventmixture was added prematurely to the residue treater vesselbefore solvent had been added to the tank This eventoccurred when operators were troubleshooting equipmentproblems during the startup This mixture was supposed tobe added to the vessel only after that vessel was filled withclean solvent and heated to a minimum safe temperature Aninterlock system existed as part of the automatic feed controlsystem to prevent inadvertent introduction of methomyl Thisinterlock was password-protected to prevent inadvertentoverride but operators intentionally overrode the interlockBypassing the interlocks had apparently been a practice con-doned by management in the past as a workaround foroperational problems Once the methomyl decompositionreaction started it could not be stopped and the pressurerapidly rose in the vessel due to gas from the reaction lead-ing to the explosion Uncrystallized methomyl existed in thetank due to equipment problems and this material greatlyincreased the methomyl concentration in the residue treatercontributing to the runaway reaction

The CSB stated that the company initiated this processstartup before the company had completed critical checksincluding valve lineups equipment checkouts and computercalibrations The CSB said that the operating procedure forthe startup was complex but this procedure had not beenreviewed or approved In addition the company had notperformed training on the new computer system put in placeas part of the maintenance This new computer systemoffered significant improvements to help automatically con-trol the operation For example the control system includedgraphical display screens that simulated the process flowBut the system was also complex and the modificationschanged the way operators interacted with the system The

operators thought the screens were difficult to navigate andresponding to troubleshooting alarms was difficult The acci-dent report stated that had the operators received adequatetraining on the new computer system they may have beenable to recognize problems in operation before theexplosion

The report faulted the company for not performing anadequate pre-startup safety review System operators did notparticipate in the safety review and review checklistsshowed items as completed when they were not The CSBstated that the company had also failed to perform a thor-ough Process Hazard Analysis The report stated that the Pro-cess Hazard Analysis was performed quickly becausemanagement had not allotted sufficient time for analysis Inaddition the CSB stated that the Process Hazard Analysisincluded invalid assumptions and said that the team did notapply the analysis tools properly resulting in unmitigatedaccident scenarios [10]

Support Software Including Models and SimulationsMay Be As Critical To Safety As Control Software

While the focus of software safety efforts is usually onsoftware directly controlling an application support softwaremay contribute to an accident or to the effectiveness of theemergency response Examples of support software includedatabases used for maintenance activities software to esti-mate load stability computer-based models that providedesign calculations or assurance information and so on Afailure to analyze the hazards associated with this supportsoftware including models and simulations can result inunforeseen system failures

On November 12 2008 a 2 million gallon liquid fertilizertank at in Chesapeake Virginia collapsed Two workers per-forming welding operations at the site were seriously injuredand an adjacent neighborhood was partially flooded as aresult of the accident The CSB found that the company hadnot assured that welds to replace vertical joints met acceptedindustry standards and the CSB faulted the company for itsfailure to perform inspections of the welds The companywas also faulted for not having proper procedures in placefor filling the tanks following major facility modifications Inits report the CSB also noted that the contractor hired by thecompany to calculate the maximum fill height had usedsome faulty assumptions The maximum liquid level wassupposed to be calculated in part based on the minimummeasured shell thicknesses and the extent of the weldinspection (full spot or no radiography) The contractorused the maximum (not minimum) measured thickness andassumed full inspection of the welds These assumptions ledto an overestimation of the allowable liquid level The tankfailed at a fill level of 2674 feet below the calculated maxi-mum of 2701 feet The CSB also noted a number of previ-ous overfilling accidents The CSB found 16 other tankfailures at nine facilities in other states between 1995 and2008 These 16 failures resulted in one death four hospital-izations one community evacuation and two releases towaterways Eleven occurred due to defective welding [11]

RECOMMENDATIONS FOR THE SOFTWARE SYSTEM SAFETY PROCESS

There is a great deal of information in the literature onhow to improve software development processes And stand-ards such as IEC 61511 Functional Safety - Safety Instru-mented Systems for the Process Industry Sector and DO-178B Software Considerations in Airborne Systems andEquipment Certification provide invaluable information tohelp improve the safety of complex systems However theabove themes and lessons learned indicate that improvingsoftware and computing system safety is more than just

DOI 101002prs Process Safety Progress (Vol33 No2)128 June 2014 Published on behalf of the AIChE

providing additional training using standards or implement-ing improved software development techniques or toolsImproving software safety requires a change in the way pro-cess safety practitioners think Software safety must be fullyintegrated into the process safety analysis process And pro-cess safety practitioners must begin to ask critical questionsabout their software safety efforts Some examples of thesequestions include the following

bull Do plans reflect how business is really done Are plansreviewed Do plans have unrealistic schedules orresource allocations Are software and computing sys-tems considering in the planning and acquisition proc-esses Poor or unrealistic plans may reflect anorganization that does not truly place a priority onsafety activities

bull Is there a convincing story that the safety analysis iscomplete Is there a sufficient description of the systemto support an analysis including the software and com-puting system Are computing systems treated as ldquoblackboxesrdquo Have software causes been thoroughly eval-uated using a systematic approach or are the softwarecauses listed as ldquosoftware errorrdquo with no further expla-nation Do the hazard analyses include support soft-ware such as design models or databases Failure toshow that the problem is being looked at systematicallycould indicate that there are holes in the analysis withpotentially significant problems overlooked

bull Are the hazard reports detailed enough Are causesdescriptive Are the recommendations clear and suffi-ciently detailed Does the logic make sense and is itcomplete Do controls match up with the causesshowing a one-to-one or many-to-one relation Lack ofdetail could be an indication of insufficient knowledgeof the system or lack of information on the system

bull Do the risk assessments have a logical basis Have therisk assessments considered software complexity matu-rity testing use of unproven technologies etc Areassessments overly optimistic based on what is knownabout the computing system A failure to provide abasis for your risk assessments may result in unrealisticassessments

bull Are the hazard controls primarily procedural rather thandesign changes safety features or devices Is there anoverreliance on alarms and signage Do software con-trols only rely on good software processes Is there anoverreliance on humans and software to ldquosave the dayrdquoOverreliance on operational controls may indicate aweak safety design

bull Are hazard control recommendations being imple-mented Can the selected hazard control strategyactually be implemented and verified Is the controlstrategy so complex that it will be impossible to deter-mine whether it will work when needed Complex con-trols overlapping control strategies or inadequateimplementation may be an indication of a weak safetydesign

bull Has the risk assessment truly considered the worst caseWhat is the basis for the likelihood levels Is the riskanalyzed only for steady-state operation or have start-ups and shutdowns also been considered Failure toprovide good answers to these questions indicates apotential misunderstanding of the risk

bull Are problems found in test and design included in thehazard reports and factored into the design Have prob-lems and incidents been fully investigated Does aprocess exist for tracking action items includingwhether recommendations have been implementedAre changes to software and computing systems prop-

erly factored into the hazard analysis Failure to incor-porate problems changes and corrective actions is anindication of the potential to miss serious design flaws

These questions help to identify whether the hazard anal-ysis process is robust However we must also ask questionsspecifically related to the use of software and computing sys-tems in complex systems The best questions come fromreal-world examples of accidents where software has been acontributor Some examples of questions are as follows andothers can be found in the literature [12]

bull Have safety-critical software commands and data beenidentified

bull Do hazard controls for software-related causes combinegood practices and specific safeguards

bull Is software and system testing adequate and do testsinclude sufficient off-nominal conditions

bull Is the computing system design overly complexbull Is the design based on unproven technologiesbull What happens if the software locks upbull Are the sensors used for software decisions fault

tolerantbull Has software mode transition been consideredbull Has consideration been given to the order of commands

and potential out of sequence inputsbull Will the software and system start up and shut down in

a known safe statebull Are checks performed before initiating hazardous

operationsbull Will the software properly handle spurious signals and

power outages

These are by no means all the questions a decision makershould ask and positive answers to these questions provideno assurance that an accident will be prevented These ques-tions should encourage critical thinking about safety andgenerate additional questions to provide further insight onsystem risk A failure to ask these questions could mean thatthe potential for an accident is higher than we had assumed

It is up to all stakeholders to look for those conditionsthat could lead to an accident and to recognize that theworst can happen This means we should all express con-cerns about software safety management and engineeringwhen necessary based on our knowledge experience andjudgment including lessons learned from accidents We mustask questions to understand the potential for harm to under-stand the steps taken to assure that the risks have beenreduced and to assure that there is proof that hazard con-trols are effective We do not have to be software experts toask questions and in fact a lack of expertise could be anadvantage when trying to understand how a system worksMost importantly when thinking of software and computingsystem safety we must think of the system and its interac-tions between software hardware humans processes andenvironments

CONCLUSION

PSM can provide immense benefits in reducing opera-tional risks in the chemical process and energy productionindustries By proactively identifying hazards assessing andcharacterizing risks and taking actions to reduce those risksorganizations can prevent accidents and reduce the potentialfor death injury property damage and environmentalimpacts Given the importance of software and computingsystems in the design and operation of many systems soft-ware must be included as part of the broader process safetyeffort However the accidents provided in this article alongwith many others illustrate that software is often not system-atically considered as part of process safety activities For

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 129

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE

Page 2: Case studies in process safety: Lessons learned from software-related accidents

purchased as Commercial Off-The-Shelf software Softwaresafety encompasses not just the software but also the com-puting system A computing system includes the softwareand supporting hardware sensors effectors humans whointeract with the system and data necessary for successfuloperation Examples of computing systems include Program-mable Logic Controllers (PLC) and Supervisory Control andData Acquisition (SCADA) systems

Through the review of hundreds of software-related acci-dents and incidents and the authorrsquos personal experience inthis discipline common themes and lessons have beenfound [1] This article will use case studies to illustrate thosethemes and lessons learned The discussion will center onsoftware safety as part of a broader process safety effort andwill include recommendations that can be used to improvethe safety of software-driven systems

LESSONS LEARNED SOFTWARE AND COMPUTING SYSTEM RELATED ACCIDENTS

This section discusses a number of lessons learned relatedto accidents and incidents where software and computingsystems were contributors These accidents are taken fromdetailed reports and investigative summaries from multipleindustries and organizations and are included to broadlyillustrate hazards and risks in software and computing sys-tems Note that in discussing these accidents this article doesnot intend to oversimplify the events and conditions that ledto the accidents or place blame on any individuals or organ-izations There is rarely a single identifiable cause leading toan accident Accidents are usually the result of complex fac-tors that include hardware software human interactionsenvironments and procedures The descriptions are meantto provide examples of where the analysis process failed insome way Readers are encouraged to review the completeaccident and mishap investigation reports referenced tounderstand the often complex conditions and chain of eventsthat led to each accident discussed here

Decisions Made in the Acquisition and PlanningPhases of Development Can Profoundly Affect Safety

Planning typically involves trade-offs between many dif-ferent facets of the program including cost schedule per-formance and safety Poor planning can lead to unexpectedsafety consequences and many safety decisions are actuallymade in the planning and acquisition phase For exampleinadequate resources may be allocated to the software safetyeffort This can result in a failure to perform hazard analysesand identify safety requirements early in the program whenthese activities have the most impact Therefore safety per-sonnel should be included in early phases of a programespecially when developing contractual requirements andsoftware safety considerations should be included in sched-ule and resource discussions

On August 11 1985 an Institute West Virginia chemicalproduction facility leaked methylene chloride and aldicarboxime toxic chemicals used to manufacture the pesticideTemik Six workers were injured and more than a hundredresidents living near the facility were sent to the hospitalThe toxic chemical leak occurred when a tank overheatedafter a steam valve failed Due to the failure steam leakedinto the heating jacket of a storage tank containing a mixtureof methylene chloride and aldicarb oxime The methylenechloride vaporized from the heating leaving concentratedaldicarb oxime to settle on the bottom of the tank Operatorslater pumped the material from the storage tank for process-ing They thought they had emptied the tank but concen-trated aldicarb oxime remained in the storage tank because alevel gauge was broken and the operators performedno additional checks The concentrated aldicarb oxime

continued to be heated over several days by the steam in theheating jacket ultimately resulting in a runaway chemicalreaction days later The high pressure from the reaction ledto release of aldicarb oxime through a flare and largeamounts of the chemical were also released to the atmos-phere when a burst disc failed

The root causes of the leak were found to be the use of atank that was not designed to hold aldicarb oxime faultylevel indicators defective safety valves mistaken transmis-sion of steam to a tank and failure of control room operatorsto notice critical pressure and temperature changes In addi-tion a computerized monitoring system contributed to theseverity of the accident Earlier in the year the company hadinstalled a new air-monitoring system that could automati-cally identify and detect releases of hazardous gases Thatcomputerized system could also predict whether the gaseswould be contained within the release area or would migrateand potentially harm the public This information was impor-tant because it could be used to mitigate risks through evac-uation or other measures if a vapor cloud appeared to bespreading Company officials initially stated that the com-puter system had failed and therefore they were notinformed of the lack of containment of the aldicarb oximegas cloud However they later admitted that they had notpurchased a version of the software programmed to detectaldicarb oxime Had the company invested in the moreexpensive version of the software they could have predictedthat a cloud of aldicarb oxime would not have been con-tained in the production area However the company hadonly ordered the basic software model that did not containaldicarb oxime properties [23]

Hazard Causes May Not Include Software andComputing Systems

Many organizations are improving their safety analyseswith respect to software However software safety still lackssufficient number of qualified practitioners and the methodsfor performing comprehensive analyses are not universallyknown In many cases software causes may not be includedin hazard analyses Or if they are the software causes maybe stated in general terms in hazard analyses such as includ-ing a generic hazard cause called ldquosoftware errorrdquo instead ofdefining specifically what software functionality can lead toan undesirable outcome In addition the software hazardanalyses may not pay enough attention to those cases wherethe software works exactly as intended but the implementedfunctionality was unsafe

On October 18 2005 an explosion occurred at a naturalgas decompressionrecompression facility near EmpressAlberta Extensive damage occurred to the facility but noinjuries were reported The facility decompressed and condi-tioned high-pressure gas from a pipeline to achieve pres-sures and temperatures consistent with what was needed inthe Natural Gas Liquids (NGL) stripping facilities At the timeof the accident the facilityrsquos ldquoArdquo PLC lost communication withthe NGL facility control room Transportation Safety Board ofCanada (TSB) investigators found that a processor card hadfailed shutting down the PLC When the PLC failed its out-puts were de-energized However some of the microproces-sor functions required an energized signal to initiate physicalshutdown of some functions These functions included mainand auxiliary lube oil pumps purgedamper valves and suc-tion and discharge valves As a result of this condition thepumps did not shutdown and valves did not close in a safeposition when the PLC failed Operators in the NGL controlroom received an alarm indicating that the communicationswere lost They investigated the problem and believed thatthe failure in communication had triggered an emergency

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 125

shutdown of the facility The operators believed that theemergency shutdown meant that all components were in asafe state However because the recompressor motor did notshut down a surge event occurred in the pipe This surgecaused a pipe to break leading to a natural gas leak Theleaking natural gas ignited resulting in an explosion TheTSB noted that the loss of a PLC had not been identified as asafety concern in the hazard analyses prior to the accident[4]

Risks May Be Underestimated or OptimisticallyEvaluated

The risk assessment process helps decision makers intheir risk decisions (eg accept the risk without changereduce the risk to an acceptable level through hazard elimi-nation and mitigation transfer the risk or forego the activ-ity) assists in justifying the acceptance of residual risk andallows communication of risks throughout the organizationA risk assessment can either be qualitative or quantitativealthough the emphasis in process safety is often on qualita-tive risk assessment Evaluating software and computing sys-tem risk can be difficult especially if an extensive historydoes not exist with the automated equipment Risk must becarefully evaluated based on known technical factors suchas design complexity and maturity degree of system testinguse of unproven technologies potential for unexpectedhuman intervention and so on Accidents have illustratedunfounded optimism in the use of software to assure safety

On April 7 2000 a pipeline failure occurred at the ChalkPoint Generating Station in Maryland The pipeline releasedapproximately 140400 gallons of fuel oil into surroundingwetlands Swanson Creek and the Patuxent River The costof the environmental response and cleanup exceeded $71million The US National Transportation Safety Board(NTSB) determined that the probable cause of the accidentwas a fracture in a buckle in the pipe The buckle wentundiscovered because data from an in-line inspection toolwere interpreted incorrectly The NTSB stated that a contrib-utor to the severity of the accident was inadequate operatingprocedures and practices for monitoring the magnitude offlow through the pipeline When the leak occurred the 12-inch-diameter pipeline was being cleaned An automatedpipeline monitoring system existed but this automated sys-tem could not adequately monitor pipeline conditions duringthis operation because the locations of the meters pressure-sensing points and temperature-sensing points had not beenconfigured to be in the direct liquid flow path Thereforethe automated system did not provide alarms or other infor-mation to operations personnel Field personnel manuallyrecorded tank level measurements at the upstream anddownstream points but did not use this information to evalu-ate whether any fuel oil had been lost The manual monitor-ing procedures and lack of alarms from the automatedsystem led to a 7-hour delay in responding to the leak Thefield personnel finally realized there was a problem whenthe fuel pump began to cavitate Using hand calculations thecrew then realized that they had not received over 3000 bar-rels of fuel oil and they shut the pipeline down Followingthe accident the company installed a SCADA system withsoftware-based leak detection and radar tank gauges In thiscase the operators underestimated the risks associated with aleak and overestimated their ability to detect a failure [5]

Hazard Controls May Rely Only on Good SoftwareProcesses and Testing

Hazard controls are devices and approaches to mitigaterisk presented by a hazard by reducing either the severity ofthe hazard or the probability of its occurrence Strong hazard

control design follows the design order of precedencewhere the first approach is to try to design out the hazard orminimize the risks through design selection Many hazardreports will focus on the implementation of good softwareprocesses or extensive unit testing to prove that the designis safe However the software processes and testing will nothelp if the software design is flawed with respect to safe sys-tem operations (including hardware software processeshumans and environments) In these cases the software mayoperate exactly as designed but the operation may beunsafe

On February 11 2003 an employee of a manufacturer inGonzales Texas was fatally injured while performing mainte-nance on a reaction tank The US Mine Safety and HealthAdministration (MSHA) determined that the cause of the acci-dent was a failure to close and secure a manual gate valvefor a steam line and a failure to place the batch PLC in thestop mode The company was a surface clay mill that pur-chased clay and blended refined milled and processed thematerial into products used in paints inks and grease Onthe day of the accident the employee had been informedthat there had been a product change in one of the batchprocessing systems The employee was assigned to performcleanup duties on a reactor tank Two valves controlledsteam entry into the tank a manual gate valve and a butter-fly valve with an automatic pneumatic actuator The PLCcontrolled the functioning of the batch system based on sen-sors that monitored material flow At the time of the accidentthe PLC was in ldquoslurry holdrdquo mode In this mode the systemwas programmed to actuate the steam valve when the clayslurry level reached 55 feet An aluminum extension ladderused by the employee caused the level sensor to falselysense that slurry was in the reactor which resulted in thePLC sending a command to open the steam valve Becausethe manual valve had been left open steam at 350o F thenentered the tank fatally burning the employee [6]

Organizations May Fail to Consider the Possibility ofHazard Controls Not Working

Hazard controls are never 100 percent reliable Thereforeit is important for the analyst to consider what could happenif the controls fail Consideration should be given to unex-pected hardware software and human interactions and aldquodefense in depthrdquo approach should be used where multipleindependent methods are used to prevent a hazard frombecoming an accident Controls should be carefully eval-uated to identify what could happen if they do not work asexpected especially if those controls rely heavily on softwareand personnel actions

On February 18 2009 an employee was fatally injured ata coal preparation plant in the Hunter Valley region of NewSouth Wales Australia The accident occurred when 10 tonsof waste rock were inadvertently released from the reject binand fell onto the cabin of the employeersquos truck At the plantraw coal was extracted from the mine and usable coal wasseparated from waste rock The waste rock was transferredapproximately 2 kilometers on conveyers to the reject binThe waste rock was then loaded from the reject bin ontotrucks and hauled away The process of loading the truckswith waste rock was controlled by a PLC system The PLCsystem included truck detection sensors traffic lights bincapacity sensing and remote control hand-held transmittersused by the truck drivers On the day of the accident thetruck driver drove his truck under the reject bin deliverychute A signal was sent from the handheld remote controlto command the chute to open The accident report statedthat it was not clear whether the signal was sent inadver-tently or intentionally Opening the chute required that two

DOI 101002prs Process Safety Progress (Vol33 No2)126 June 2014 Published on behalf of the AIChE

of three lines of truck detection sensors be blocked in addi-tion to a command from the remote control to assure thatthe truck was in the correct location Each sensor line con-tained three sensors and all three sensors had to be blockedfor the entire line to be considered as blocked At the timeof the accident the truck was obscuring one line of sensorsand a second line of sensors was obscured by dirt on thelenses and therefore was not working correctly Because twoof the sensor lines were blocked and the remote control sig-nal had been sent the PLC automatically opened the rejectbin chute door and dropped 10 tons of material on the truckcab before the driver had safely cleared the chute resultingin the fatal injury [7]

Testing May Not Provide Information on Subsystemand Component Interactions or Hazard ControlOperations

The safety verification and validation process is intendedto determine that the design solution has met all the safetyrequirements (verification) and that the correct system isbuilt (validation) The verification and validation process ifperformed correctly will provide evidence that risk has beenreduced Testing is an important part of the verification andvalidation process and comprehensive software testingshould be conducted throughout the development cycleTesting should include not only nominal conditions basedon requirements but also abnormal conditions such asimproper inputs inputs outside expected conditions inputsexactly at and near the boundary conditions and inputsstuck at some value Safety-critical software must include fullsystem integration testing of end-to-end events That testingmust include stressing of the software and should includeinteractions of the software with hardware humans andenvironments A number of accidents have occurred whenno component failed in the conventional sense but the inter-action of components caused a system failure Testing mustinclude verifications of hazard controls and it should assurethat redundancy works when needed

On October 24 2002 a grinder exploded at a limestoneprocessing plant in Foreman Arkansas An operator waskilled when flammable waste fuel covered him and ignitedThe operator had started the pump for solid waste fuel proc-essing when the accident occurred The MSHA stated in itsreport that the cause of the accident was that the safety mon-itoring system designed to shut off the waste fuel systempump had not been maintained so that it functioned prop-erly Kilns were used in processing activities at the plantand these kilns were heated by burning coal natural gasand liquid waste fuel The liquid waste fuel was delivered bytruck or railcar and pumped into large storage tanks Fromthe storage area it was pumped through a grinder to reducethe particle size of the solids in the fuel Two independentsystems monitored and controlled the waste fuel delivery AFoxboro Intelligent Automation Distribution Control Systemmonitored and recorded normal operating parameters TheFoxboro also issued audible and visual alarms that wereavailable at the plant control room A PLC provided basicstartup and shutdown of the system and responded to com-mands from the Foxboro On the day of the accident theFoxboro sensed that the fuel delivery pressure was lowapparently due to blockage in the line As designed the Fox-boro sent a command to the PLC to shut down the pumpsHowever the PLC failed to respond and the pumps keptrunning Three months prior to the accident this PLC hadbeen installed this was supposed to be a simple replacementof an older PLC of similar capability However the Foxborohad not been connected to the newer PLC and the connec-tions remained to the older non-functioning PLC The

complete system had never been tested with the new PLC Asystem test had been scheduled 3 days prior to the accidentbut had been aborted when a pump failed during the testthe test had never been rescheduled The accident reportstated that the blockage may have broken free just prior tothe accident With the pumps running the pressure elevatedsignificantly and a ldquowater hammerrdquo effect caused overpressu-rization in the system at the grinder The grinder was tornloose from its base spraying fuel and pulling loose a 480-volt cable that ultimately served as an ignition source [8]

Software Change Management and Hazard AnalysesProcesses May Not Be Integrated

Engineering by its very nature is an activity that requireschange Changes can come about for a variety of reasonsincluding the discovery of problems during developmentchanges in requirements or routine upgrades of software orhardware As the development cycle proceeds and as anorganization gains more operational experience new haz-ards are often uncovered and some hazards may no longerbe relevant If the hazard analysis is not updated to reflectthese changes then resources may be expended on previ-ously identified hazards that may no longer be relevant andnew hazards may not be discovered as the design maturesIn addition routine changes and modifications to softwareand computing systems must be analyzed for potential haz-ards as part of the broader hazard analysis process

On June 10 1999 a 16-inch-diameter pipe carrying gaso-line ruptured in Bellingham Washington The ruptured pipe-line released approximately 237000 gallons of gasoline intoa nearby creek according to the NTSB That gasoline thenignited burning approximately 11=2 miles along the creekThree people died in the fire One home and the Bellinghamwater treatment plant were also damaged in the accident

The NTSB investigated the accident and determined thatthe probable cause of the rupture was damage to the pipeduring a modification project performed in 1994 This dam-age weakened the pipeline making it susceptible to ruptureunder increased pressure in the pipe The NTSB also statedthat inspections of the pipeline during the project were inad-equate and the company did not identify and repair damageThe report noted that in-line pipeline inspection data shouldhave prompted the company to excavate and examine thatsection of pipeline but the company failed to perform suchwork after reported anomalies The SCADA system com-puters also played a role in the accident The SCADA systemwas used for operation of the pipeline for example to openand close valves remotely as required or to operate pumpsas needed Just prior to the accident the operator was pre-paring to initiate delivery of gasoline to a terminal in Seattlediverting delivery from another facility During the processof switching delivery destinations the pressure in the pipe-line began to increase which is a normal condition but onethat required the operator to start a pump to reduce pres-sure However when the operator tried to start that pumpthe SCADA system failed to execute the start commandissued by the operator The operator soon found that theSCADA system was unresponsive to any commands some-thing that had never happened before The report statedthat ldquoHad the controller been able to start the pump atWoodinville it is probable that the pressure backup wouldhave been alleviated and the pipeline operated routinely forthe balance of the fuel deliveryrdquo Instead the pressure in thepipe increased and the increased pressure likely caused thedamaged pipe to rupture

The cause of computer system failure was likely a changemade to the system database just prior to the accident TheNTSB accident report stated that the SCADA system

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 127

administrator entered new records into the live database atthe time of the accident The system administrator howeverdid not check the records or test the system software to seeif those changes introduced any problems The computingsystem problem could not be replicated after the accidentand therefore the cause of the anomaly could not be defini-tively identified The report stated ldquoThe Safety Board con-cludes that had the SCADA database revisions that wereperformed shortly before the accident been performed andthoroughly tested on an off-line system instead of the pri-mary on-line SCADA system errors resulting from those revi-sions may have been identified and repaired before theycould affect the operation of the pipeline [9]rdquo

Human-Software Interactions Have Significant SafetyImplications That Are Often Underestimated

Humans interact with hardware and software in a varietyof ways A number of accidents have occurred where theseinteractions have been a significant contributing cause Ofspecial concern is when a new computing system is imple-mented that changes what is expected of the operator innormal and emergency situations

On August 28 2008 an explosion occurred at a pesticidemanufacturing plant in Institute West Virginia Two workerswere killed in the explosion and eight others were injuredThe US Chemical Safety and Hazard Investigation Board(CSB) found that the explosion was the result of a runawaychemical reaction The company was starting up a methomylunit for the first time after several months of down time formaintenance to install a new computer control system andreactor vessel Normal operations called for dissolvedmethomyl and waste chemicals to be fed into a preheatedresidue treater vessel partially filled with solvent The treaterallowed the methomyl to decompose safely after which itwas mixed with other waste chemicals and used to fuel facil-ity boilers On the day of the accident a methomyl-solventmixture was added prematurely to the residue treater vesselbefore solvent had been added to the tank This eventoccurred when operators were troubleshooting equipmentproblems during the startup This mixture was supposed tobe added to the vessel only after that vessel was filled withclean solvent and heated to a minimum safe temperature Aninterlock system existed as part of the automatic feed controlsystem to prevent inadvertent introduction of methomyl Thisinterlock was password-protected to prevent inadvertentoverride but operators intentionally overrode the interlockBypassing the interlocks had apparently been a practice con-doned by management in the past as a workaround foroperational problems Once the methomyl decompositionreaction started it could not be stopped and the pressurerapidly rose in the vessel due to gas from the reaction lead-ing to the explosion Uncrystallized methomyl existed in thetank due to equipment problems and this material greatlyincreased the methomyl concentration in the residue treatercontributing to the runaway reaction

The CSB stated that the company initiated this processstartup before the company had completed critical checksincluding valve lineups equipment checkouts and computercalibrations The CSB said that the operating procedure forthe startup was complex but this procedure had not beenreviewed or approved In addition the company had notperformed training on the new computer system put in placeas part of the maintenance This new computer systemoffered significant improvements to help automatically con-trol the operation For example the control system includedgraphical display screens that simulated the process flowBut the system was also complex and the modificationschanged the way operators interacted with the system The

operators thought the screens were difficult to navigate andresponding to troubleshooting alarms was difficult The acci-dent report stated that had the operators received adequatetraining on the new computer system they may have beenable to recognize problems in operation before theexplosion

The report faulted the company for not performing anadequate pre-startup safety review System operators did notparticipate in the safety review and review checklistsshowed items as completed when they were not The CSBstated that the company had also failed to perform a thor-ough Process Hazard Analysis The report stated that the Pro-cess Hazard Analysis was performed quickly becausemanagement had not allotted sufficient time for analysis Inaddition the CSB stated that the Process Hazard Analysisincluded invalid assumptions and said that the team did notapply the analysis tools properly resulting in unmitigatedaccident scenarios [10]

Support Software Including Models and SimulationsMay Be As Critical To Safety As Control Software

While the focus of software safety efforts is usually onsoftware directly controlling an application support softwaremay contribute to an accident or to the effectiveness of theemergency response Examples of support software includedatabases used for maintenance activities software to esti-mate load stability computer-based models that providedesign calculations or assurance information and so on Afailure to analyze the hazards associated with this supportsoftware including models and simulations can result inunforeseen system failures

On November 12 2008 a 2 million gallon liquid fertilizertank at in Chesapeake Virginia collapsed Two workers per-forming welding operations at the site were seriously injuredand an adjacent neighborhood was partially flooded as aresult of the accident The CSB found that the company hadnot assured that welds to replace vertical joints met acceptedindustry standards and the CSB faulted the company for itsfailure to perform inspections of the welds The companywas also faulted for not having proper procedures in placefor filling the tanks following major facility modifications Inits report the CSB also noted that the contractor hired by thecompany to calculate the maximum fill height had usedsome faulty assumptions The maximum liquid level wassupposed to be calculated in part based on the minimummeasured shell thicknesses and the extent of the weldinspection (full spot or no radiography) The contractorused the maximum (not minimum) measured thickness andassumed full inspection of the welds These assumptions ledto an overestimation of the allowable liquid level The tankfailed at a fill level of 2674 feet below the calculated maxi-mum of 2701 feet The CSB also noted a number of previ-ous overfilling accidents The CSB found 16 other tankfailures at nine facilities in other states between 1995 and2008 These 16 failures resulted in one death four hospital-izations one community evacuation and two releases towaterways Eleven occurred due to defective welding [11]

RECOMMENDATIONS FOR THE SOFTWARE SYSTEM SAFETY PROCESS

There is a great deal of information in the literature onhow to improve software development processes And stand-ards such as IEC 61511 Functional Safety - Safety Instru-mented Systems for the Process Industry Sector and DO-178B Software Considerations in Airborne Systems andEquipment Certification provide invaluable information tohelp improve the safety of complex systems However theabove themes and lessons learned indicate that improvingsoftware and computing system safety is more than just

DOI 101002prs Process Safety Progress (Vol33 No2)128 June 2014 Published on behalf of the AIChE

providing additional training using standards or implement-ing improved software development techniques or toolsImproving software safety requires a change in the way pro-cess safety practitioners think Software safety must be fullyintegrated into the process safety analysis process And pro-cess safety practitioners must begin to ask critical questionsabout their software safety efforts Some examples of thesequestions include the following

bull Do plans reflect how business is really done Are plansreviewed Do plans have unrealistic schedules orresource allocations Are software and computing sys-tems considering in the planning and acquisition proc-esses Poor or unrealistic plans may reflect anorganization that does not truly place a priority onsafety activities

bull Is there a convincing story that the safety analysis iscomplete Is there a sufficient description of the systemto support an analysis including the software and com-puting system Are computing systems treated as ldquoblackboxesrdquo Have software causes been thoroughly eval-uated using a systematic approach or are the softwarecauses listed as ldquosoftware errorrdquo with no further expla-nation Do the hazard analyses include support soft-ware such as design models or databases Failure toshow that the problem is being looked at systematicallycould indicate that there are holes in the analysis withpotentially significant problems overlooked

bull Are the hazard reports detailed enough Are causesdescriptive Are the recommendations clear and suffi-ciently detailed Does the logic make sense and is itcomplete Do controls match up with the causesshowing a one-to-one or many-to-one relation Lack ofdetail could be an indication of insufficient knowledgeof the system or lack of information on the system

bull Do the risk assessments have a logical basis Have therisk assessments considered software complexity matu-rity testing use of unproven technologies etc Areassessments overly optimistic based on what is knownabout the computing system A failure to provide abasis for your risk assessments may result in unrealisticassessments

bull Are the hazard controls primarily procedural rather thandesign changes safety features or devices Is there anoverreliance on alarms and signage Do software con-trols only rely on good software processes Is there anoverreliance on humans and software to ldquosave the dayrdquoOverreliance on operational controls may indicate aweak safety design

bull Are hazard control recommendations being imple-mented Can the selected hazard control strategyactually be implemented and verified Is the controlstrategy so complex that it will be impossible to deter-mine whether it will work when needed Complex con-trols overlapping control strategies or inadequateimplementation may be an indication of a weak safetydesign

bull Has the risk assessment truly considered the worst caseWhat is the basis for the likelihood levels Is the riskanalyzed only for steady-state operation or have start-ups and shutdowns also been considered Failure toprovide good answers to these questions indicates apotential misunderstanding of the risk

bull Are problems found in test and design included in thehazard reports and factored into the design Have prob-lems and incidents been fully investigated Does aprocess exist for tracking action items includingwhether recommendations have been implementedAre changes to software and computing systems prop-

erly factored into the hazard analysis Failure to incor-porate problems changes and corrective actions is anindication of the potential to miss serious design flaws

These questions help to identify whether the hazard anal-ysis process is robust However we must also ask questionsspecifically related to the use of software and computing sys-tems in complex systems The best questions come fromreal-world examples of accidents where software has been acontributor Some examples of questions are as follows andothers can be found in the literature [12]

bull Have safety-critical software commands and data beenidentified

bull Do hazard controls for software-related causes combinegood practices and specific safeguards

bull Is software and system testing adequate and do testsinclude sufficient off-nominal conditions

bull Is the computing system design overly complexbull Is the design based on unproven technologiesbull What happens if the software locks upbull Are the sensors used for software decisions fault

tolerantbull Has software mode transition been consideredbull Has consideration been given to the order of commands

and potential out of sequence inputsbull Will the software and system start up and shut down in

a known safe statebull Are checks performed before initiating hazardous

operationsbull Will the software properly handle spurious signals and

power outages

These are by no means all the questions a decision makershould ask and positive answers to these questions provideno assurance that an accident will be prevented These ques-tions should encourage critical thinking about safety andgenerate additional questions to provide further insight onsystem risk A failure to ask these questions could mean thatthe potential for an accident is higher than we had assumed

It is up to all stakeholders to look for those conditionsthat could lead to an accident and to recognize that theworst can happen This means we should all express con-cerns about software safety management and engineeringwhen necessary based on our knowledge experience andjudgment including lessons learned from accidents We mustask questions to understand the potential for harm to under-stand the steps taken to assure that the risks have beenreduced and to assure that there is proof that hazard con-trols are effective We do not have to be software experts toask questions and in fact a lack of expertise could be anadvantage when trying to understand how a system worksMost importantly when thinking of software and computingsystem safety we must think of the system and its interac-tions between software hardware humans processes andenvironments

CONCLUSION

PSM can provide immense benefits in reducing opera-tional risks in the chemical process and energy productionindustries By proactively identifying hazards assessing andcharacterizing risks and taking actions to reduce those risksorganizations can prevent accidents and reduce the potentialfor death injury property damage and environmentalimpacts Given the importance of software and computingsystems in the design and operation of many systems soft-ware must be included as part of the broader process safetyeffort However the accidents provided in this article alongwith many others illustrate that software is often not system-atically considered as part of process safety activities For

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 129

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE

Page 3: Case studies in process safety: Lessons learned from software-related accidents

shutdown of the facility The operators believed that theemergency shutdown meant that all components were in asafe state However because the recompressor motor did notshut down a surge event occurred in the pipe This surgecaused a pipe to break leading to a natural gas leak Theleaking natural gas ignited resulting in an explosion TheTSB noted that the loss of a PLC had not been identified as asafety concern in the hazard analyses prior to the accident[4]

Risks May Be Underestimated or OptimisticallyEvaluated

The risk assessment process helps decision makers intheir risk decisions (eg accept the risk without changereduce the risk to an acceptable level through hazard elimi-nation and mitigation transfer the risk or forego the activ-ity) assists in justifying the acceptance of residual risk andallows communication of risks throughout the organizationA risk assessment can either be qualitative or quantitativealthough the emphasis in process safety is often on qualita-tive risk assessment Evaluating software and computing sys-tem risk can be difficult especially if an extensive historydoes not exist with the automated equipment Risk must becarefully evaluated based on known technical factors suchas design complexity and maturity degree of system testinguse of unproven technologies potential for unexpectedhuman intervention and so on Accidents have illustratedunfounded optimism in the use of software to assure safety

On April 7 2000 a pipeline failure occurred at the ChalkPoint Generating Station in Maryland The pipeline releasedapproximately 140400 gallons of fuel oil into surroundingwetlands Swanson Creek and the Patuxent River The costof the environmental response and cleanup exceeded $71million The US National Transportation Safety Board(NTSB) determined that the probable cause of the accidentwas a fracture in a buckle in the pipe The buckle wentundiscovered because data from an in-line inspection toolwere interpreted incorrectly The NTSB stated that a contrib-utor to the severity of the accident was inadequate operatingprocedures and practices for monitoring the magnitude offlow through the pipeline When the leak occurred the 12-inch-diameter pipeline was being cleaned An automatedpipeline monitoring system existed but this automated sys-tem could not adequately monitor pipeline conditions duringthis operation because the locations of the meters pressure-sensing points and temperature-sensing points had not beenconfigured to be in the direct liquid flow path Thereforethe automated system did not provide alarms or other infor-mation to operations personnel Field personnel manuallyrecorded tank level measurements at the upstream anddownstream points but did not use this information to evalu-ate whether any fuel oil had been lost The manual monitor-ing procedures and lack of alarms from the automatedsystem led to a 7-hour delay in responding to the leak Thefield personnel finally realized there was a problem whenthe fuel pump began to cavitate Using hand calculations thecrew then realized that they had not received over 3000 bar-rels of fuel oil and they shut the pipeline down Followingthe accident the company installed a SCADA system withsoftware-based leak detection and radar tank gauges In thiscase the operators underestimated the risks associated with aleak and overestimated their ability to detect a failure [5]

Hazard Controls May Rely Only on Good SoftwareProcesses and Testing

Hazard controls are devices and approaches to mitigaterisk presented by a hazard by reducing either the severity ofthe hazard or the probability of its occurrence Strong hazard

control design follows the design order of precedencewhere the first approach is to try to design out the hazard orminimize the risks through design selection Many hazardreports will focus on the implementation of good softwareprocesses or extensive unit testing to prove that the designis safe However the software processes and testing will nothelp if the software design is flawed with respect to safe sys-tem operations (including hardware software processeshumans and environments) In these cases the software mayoperate exactly as designed but the operation may beunsafe

On February 11 2003 an employee of a manufacturer inGonzales Texas was fatally injured while performing mainte-nance on a reaction tank The US Mine Safety and HealthAdministration (MSHA) determined that the cause of the acci-dent was a failure to close and secure a manual gate valvefor a steam line and a failure to place the batch PLC in thestop mode The company was a surface clay mill that pur-chased clay and blended refined milled and processed thematerial into products used in paints inks and grease Onthe day of the accident the employee had been informedthat there had been a product change in one of the batchprocessing systems The employee was assigned to performcleanup duties on a reactor tank Two valves controlledsteam entry into the tank a manual gate valve and a butter-fly valve with an automatic pneumatic actuator The PLCcontrolled the functioning of the batch system based on sen-sors that monitored material flow At the time of the accidentthe PLC was in ldquoslurry holdrdquo mode In this mode the systemwas programmed to actuate the steam valve when the clayslurry level reached 55 feet An aluminum extension ladderused by the employee caused the level sensor to falselysense that slurry was in the reactor which resulted in thePLC sending a command to open the steam valve Becausethe manual valve had been left open steam at 350o F thenentered the tank fatally burning the employee [6]

Organizations May Fail to Consider the Possibility ofHazard Controls Not Working

Hazard controls are never 100 percent reliable Thereforeit is important for the analyst to consider what could happenif the controls fail Consideration should be given to unex-pected hardware software and human interactions and aldquodefense in depthrdquo approach should be used where multipleindependent methods are used to prevent a hazard frombecoming an accident Controls should be carefully eval-uated to identify what could happen if they do not work asexpected especially if those controls rely heavily on softwareand personnel actions

On February 18 2009 an employee was fatally injured ata coal preparation plant in the Hunter Valley region of NewSouth Wales Australia The accident occurred when 10 tonsof waste rock were inadvertently released from the reject binand fell onto the cabin of the employeersquos truck At the plantraw coal was extracted from the mine and usable coal wasseparated from waste rock The waste rock was transferredapproximately 2 kilometers on conveyers to the reject binThe waste rock was then loaded from the reject bin ontotrucks and hauled away The process of loading the truckswith waste rock was controlled by a PLC system The PLCsystem included truck detection sensors traffic lights bincapacity sensing and remote control hand-held transmittersused by the truck drivers On the day of the accident thetruck driver drove his truck under the reject bin deliverychute A signal was sent from the handheld remote controlto command the chute to open The accident report statedthat it was not clear whether the signal was sent inadver-tently or intentionally Opening the chute required that two

DOI 101002prs Process Safety Progress (Vol33 No2)126 June 2014 Published on behalf of the AIChE

of three lines of truck detection sensors be blocked in addi-tion to a command from the remote control to assure thatthe truck was in the correct location Each sensor line con-tained three sensors and all three sensors had to be blockedfor the entire line to be considered as blocked At the timeof the accident the truck was obscuring one line of sensorsand a second line of sensors was obscured by dirt on thelenses and therefore was not working correctly Because twoof the sensor lines were blocked and the remote control sig-nal had been sent the PLC automatically opened the rejectbin chute door and dropped 10 tons of material on the truckcab before the driver had safely cleared the chute resultingin the fatal injury [7]

Testing May Not Provide Information on Subsystemand Component Interactions or Hazard ControlOperations

The safety verification and validation process is intendedto determine that the design solution has met all the safetyrequirements (verification) and that the correct system isbuilt (validation) The verification and validation process ifperformed correctly will provide evidence that risk has beenreduced Testing is an important part of the verification andvalidation process and comprehensive software testingshould be conducted throughout the development cycleTesting should include not only nominal conditions basedon requirements but also abnormal conditions such asimproper inputs inputs outside expected conditions inputsexactly at and near the boundary conditions and inputsstuck at some value Safety-critical software must include fullsystem integration testing of end-to-end events That testingmust include stressing of the software and should includeinteractions of the software with hardware humans andenvironments A number of accidents have occurred whenno component failed in the conventional sense but the inter-action of components caused a system failure Testing mustinclude verifications of hazard controls and it should assurethat redundancy works when needed

On October 24 2002 a grinder exploded at a limestoneprocessing plant in Foreman Arkansas An operator waskilled when flammable waste fuel covered him and ignitedThe operator had started the pump for solid waste fuel proc-essing when the accident occurred The MSHA stated in itsreport that the cause of the accident was that the safety mon-itoring system designed to shut off the waste fuel systempump had not been maintained so that it functioned prop-erly Kilns were used in processing activities at the plantand these kilns were heated by burning coal natural gasand liquid waste fuel The liquid waste fuel was delivered bytruck or railcar and pumped into large storage tanks Fromthe storage area it was pumped through a grinder to reducethe particle size of the solids in the fuel Two independentsystems monitored and controlled the waste fuel delivery AFoxboro Intelligent Automation Distribution Control Systemmonitored and recorded normal operating parameters TheFoxboro also issued audible and visual alarms that wereavailable at the plant control room A PLC provided basicstartup and shutdown of the system and responded to com-mands from the Foxboro On the day of the accident theFoxboro sensed that the fuel delivery pressure was lowapparently due to blockage in the line As designed the Fox-boro sent a command to the PLC to shut down the pumpsHowever the PLC failed to respond and the pumps keptrunning Three months prior to the accident this PLC hadbeen installed this was supposed to be a simple replacementof an older PLC of similar capability However the Foxborohad not been connected to the newer PLC and the connec-tions remained to the older non-functioning PLC The

complete system had never been tested with the new PLC Asystem test had been scheduled 3 days prior to the accidentbut had been aborted when a pump failed during the testthe test had never been rescheduled The accident reportstated that the blockage may have broken free just prior tothe accident With the pumps running the pressure elevatedsignificantly and a ldquowater hammerrdquo effect caused overpressu-rization in the system at the grinder The grinder was tornloose from its base spraying fuel and pulling loose a 480-volt cable that ultimately served as an ignition source [8]

Software Change Management and Hazard AnalysesProcesses May Not Be Integrated

Engineering by its very nature is an activity that requireschange Changes can come about for a variety of reasonsincluding the discovery of problems during developmentchanges in requirements or routine upgrades of software orhardware As the development cycle proceeds and as anorganization gains more operational experience new haz-ards are often uncovered and some hazards may no longerbe relevant If the hazard analysis is not updated to reflectthese changes then resources may be expended on previ-ously identified hazards that may no longer be relevant andnew hazards may not be discovered as the design maturesIn addition routine changes and modifications to softwareand computing systems must be analyzed for potential haz-ards as part of the broader hazard analysis process

On June 10 1999 a 16-inch-diameter pipe carrying gaso-line ruptured in Bellingham Washington The ruptured pipe-line released approximately 237000 gallons of gasoline intoa nearby creek according to the NTSB That gasoline thenignited burning approximately 11=2 miles along the creekThree people died in the fire One home and the Bellinghamwater treatment plant were also damaged in the accident

The NTSB investigated the accident and determined thatthe probable cause of the rupture was damage to the pipeduring a modification project performed in 1994 This dam-age weakened the pipeline making it susceptible to ruptureunder increased pressure in the pipe The NTSB also statedthat inspections of the pipeline during the project were inad-equate and the company did not identify and repair damageThe report noted that in-line pipeline inspection data shouldhave prompted the company to excavate and examine thatsection of pipeline but the company failed to perform suchwork after reported anomalies The SCADA system com-puters also played a role in the accident The SCADA systemwas used for operation of the pipeline for example to openand close valves remotely as required or to operate pumpsas needed Just prior to the accident the operator was pre-paring to initiate delivery of gasoline to a terminal in Seattlediverting delivery from another facility During the processof switching delivery destinations the pressure in the pipe-line began to increase which is a normal condition but onethat required the operator to start a pump to reduce pres-sure However when the operator tried to start that pumpthe SCADA system failed to execute the start commandissued by the operator The operator soon found that theSCADA system was unresponsive to any commands some-thing that had never happened before The report statedthat ldquoHad the controller been able to start the pump atWoodinville it is probable that the pressure backup wouldhave been alleviated and the pipeline operated routinely forthe balance of the fuel deliveryrdquo Instead the pressure in thepipe increased and the increased pressure likely caused thedamaged pipe to rupture

The cause of computer system failure was likely a changemade to the system database just prior to the accident TheNTSB accident report stated that the SCADA system

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 127

administrator entered new records into the live database atthe time of the accident The system administrator howeverdid not check the records or test the system software to seeif those changes introduced any problems The computingsystem problem could not be replicated after the accidentand therefore the cause of the anomaly could not be defini-tively identified The report stated ldquoThe Safety Board con-cludes that had the SCADA database revisions that wereperformed shortly before the accident been performed andthoroughly tested on an off-line system instead of the pri-mary on-line SCADA system errors resulting from those revi-sions may have been identified and repaired before theycould affect the operation of the pipeline [9]rdquo

Human-Software Interactions Have Significant SafetyImplications That Are Often Underestimated

Humans interact with hardware and software in a varietyof ways A number of accidents have occurred where theseinteractions have been a significant contributing cause Ofspecial concern is when a new computing system is imple-mented that changes what is expected of the operator innormal and emergency situations

On August 28 2008 an explosion occurred at a pesticidemanufacturing plant in Institute West Virginia Two workerswere killed in the explosion and eight others were injuredThe US Chemical Safety and Hazard Investigation Board(CSB) found that the explosion was the result of a runawaychemical reaction The company was starting up a methomylunit for the first time after several months of down time formaintenance to install a new computer control system andreactor vessel Normal operations called for dissolvedmethomyl and waste chemicals to be fed into a preheatedresidue treater vessel partially filled with solvent The treaterallowed the methomyl to decompose safely after which itwas mixed with other waste chemicals and used to fuel facil-ity boilers On the day of the accident a methomyl-solventmixture was added prematurely to the residue treater vesselbefore solvent had been added to the tank This eventoccurred when operators were troubleshooting equipmentproblems during the startup This mixture was supposed tobe added to the vessel only after that vessel was filled withclean solvent and heated to a minimum safe temperature Aninterlock system existed as part of the automatic feed controlsystem to prevent inadvertent introduction of methomyl Thisinterlock was password-protected to prevent inadvertentoverride but operators intentionally overrode the interlockBypassing the interlocks had apparently been a practice con-doned by management in the past as a workaround foroperational problems Once the methomyl decompositionreaction started it could not be stopped and the pressurerapidly rose in the vessel due to gas from the reaction lead-ing to the explosion Uncrystallized methomyl existed in thetank due to equipment problems and this material greatlyincreased the methomyl concentration in the residue treatercontributing to the runaway reaction

The CSB stated that the company initiated this processstartup before the company had completed critical checksincluding valve lineups equipment checkouts and computercalibrations The CSB said that the operating procedure forthe startup was complex but this procedure had not beenreviewed or approved In addition the company had notperformed training on the new computer system put in placeas part of the maintenance This new computer systemoffered significant improvements to help automatically con-trol the operation For example the control system includedgraphical display screens that simulated the process flowBut the system was also complex and the modificationschanged the way operators interacted with the system The

operators thought the screens were difficult to navigate andresponding to troubleshooting alarms was difficult The acci-dent report stated that had the operators received adequatetraining on the new computer system they may have beenable to recognize problems in operation before theexplosion

The report faulted the company for not performing anadequate pre-startup safety review System operators did notparticipate in the safety review and review checklistsshowed items as completed when they were not The CSBstated that the company had also failed to perform a thor-ough Process Hazard Analysis The report stated that the Pro-cess Hazard Analysis was performed quickly becausemanagement had not allotted sufficient time for analysis Inaddition the CSB stated that the Process Hazard Analysisincluded invalid assumptions and said that the team did notapply the analysis tools properly resulting in unmitigatedaccident scenarios [10]

Support Software Including Models and SimulationsMay Be As Critical To Safety As Control Software

While the focus of software safety efforts is usually onsoftware directly controlling an application support softwaremay contribute to an accident or to the effectiveness of theemergency response Examples of support software includedatabases used for maintenance activities software to esti-mate load stability computer-based models that providedesign calculations or assurance information and so on Afailure to analyze the hazards associated with this supportsoftware including models and simulations can result inunforeseen system failures

On November 12 2008 a 2 million gallon liquid fertilizertank at in Chesapeake Virginia collapsed Two workers per-forming welding operations at the site were seriously injuredand an adjacent neighborhood was partially flooded as aresult of the accident The CSB found that the company hadnot assured that welds to replace vertical joints met acceptedindustry standards and the CSB faulted the company for itsfailure to perform inspections of the welds The companywas also faulted for not having proper procedures in placefor filling the tanks following major facility modifications Inits report the CSB also noted that the contractor hired by thecompany to calculate the maximum fill height had usedsome faulty assumptions The maximum liquid level wassupposed to be calculated in part based on the minimummeasured shell thicknesses and the extent of the weldinspection (full spot or no radiography) The contractorused the maximum (not minimum) measured thickness andassumed full inspection of the welds These assumptions ledto an overestimation of the allowable liquid level The tankfailed at a fill level of 2674 feet below the calculated maxi-mum of 2701 feet The CSB also noted a number of previ-ous overfilling accidents The CSB found 16 other tankfailures at nine facilities in other states between 1995 and2008 These 16 failures resulted in one death four hospital-izations one community evacuation and two releases towaterways Eleven occurred due to defective welding [11]

RECOMMENDATIONS FOR THE SOFTWARE SYSTEM SAFETY PROCESS

There is a great deal of information in the literature onhow to improve software development processes And stand-ards such as IEC 61511 Functional Safety - Safety Instru-mented Systems for the Process Industry Sector and DO-178B Software Considerations in Airborne Systems andEquipment Certification provide invaluable information tohelp improve the safety of complex systems However theabove themes and lessons learned indicate that improvingsoftware and computing system safety is more than just

DOI 101002prs Process Safety Progress (Vol33 No2)128 June 2014 Published on behalf of the AIChE

providing additional training using standards or implement-ing improved software development techniques or toolsImproving software safety requires a change in the way pro-cess safety practitioners think Software safety must be fullyintegrated into the process safety analysis process And pro-cess safety practitioners must begin to ask critical questionsabout their software safety efforts Some examples of thesequestions include the following

bull Do plans reflect how business is really done Are plansreviewed Do plans have unrealistic schedules orresource allocations Are software and computing sys-tems considering in the planning and acquisition proc-esses Poor or unrealistic plans may reflect anorganization that does not truly place a priority onsafety activities

bull Is there a convincing story that the safety analysis iscomplete Is there a sufficient description of the systemto support an analysis including the software and com-puting system Are computing systems treated as ldquoblackboxesrdquo Have software causes been thoroughly eval-uated using a systematic approach or are the softwarecauses listed as ldquosoftware errorrdquo with no further expla-nation Do the hazard analyses include support soft-ware such as design models or databases Failure toshow that the problem is being looked at systematicallycould indicate that there are holes in the analysis withpotentially significant problems overlooked

bull Are the hazard reports detailed enough Are causesdescriptive Are the recommendations clear and suffi-ciently detailed Does the logic make sense and is itcomplete Do controls match up with the causesshowing a one-to-one or many-to-one relation Lack ofdetail could be an indication of insufficient knowledgeof the system or lack of information on the system

bull Do the risk assessments have a logical basis Have therisk assessments considered software complexity matu-rity testing use of unproven technologies etc Areassessments overly optimistic based on what is knownabout the computing system A failure to provide abasis for your risk assessments may result in unrealisticassessments

bull Are the hazard controls primarily procedural rather thandesign changes safety features or devices Is there anoverreliance on alarms and signage Do software con-trols only rely on good software processes Is there anoverreliance on humans and software to ldquosave the dayrdquoOverreliance on operational controls may indicate aweak safety design

bull Are hazard control recommendations being imple-mented Can the selected hazard control strategyactually be implemented and verified Is the controlstrategy so complex that it will be impossible to deter-mine whether it will work when needed Complex con-trols overlapping control strategies or inadequateimplementation may be an indication of a weak safetydesign

bull Has the risk assessment truly considered the worst caseWhat is the basis for the likelihood levels Is the riskanalyzed only for steady-state operation or have start-ups and shutdowns also been considered Failure toprovide good answers to these questions indicates apotential misunderstanding of the risk

bull Are problems found in test and design included in thehazard reports and factored into the design Have prob-lems and incidents been fully investigated Does aprocess exist for tracking action items includingwhether recommendations have been implementedAre changes to software and computing systems prop-

erly factored into the hazard analysis Failure to incor-porate problems changes and corrective actions is anindication of the potential to miss serious design flaws

These questions help to identify whether the hazard anal-ysis process is robust However we must also ask questionsspecifically related to the use of software and computing sys-tems in complex systems The best questions come fromreal-world examples of accidents where software has been acontributor Some examples of questions are as follows andothers can be found in the literature [12]

bull Have safety-critical software commands and data beenidentified

bull Do hazard controls for software-related causes combinegood practices and specific safeguards

bull Is software and system testing adequate and do testsinclude sufficient off-nominal conditions

bull Is the computing system design overly complexbull Is the design based on unproven technologiesbull What happens if the software locks upbull Are the sensors used for software decisions fault

tolerantbull Has software mode transition been consideredbull Has consideration been given to the order of commands

and potential out of sequence inputsbull Will the software and system start up and shut down in

a known safe statebull Are checks performed before initiating hazardous

operationsbull Will the software properly handle spurious signals and

power outages

These are by no means all the questions a decision makershould ask and positive answers to these questions provideno assurance that an accident will be prevented These ques-tions should encourage critical thinking about safety andgenerate additional questions to provide further insight onsystem risk A failure to ask these questions could mean thatthe potential for an accident is higher than we had assumed

It is up to all stakeholders to look for those conditionsthat could lead to an accident and to recognize that theworst can happen This means we should all express con-cerns about software safety management and engineeringwhen necessary based on our knowledge experience andjudgment including lessons learned from accidents We mustask questions to understand the potential for harm to under-stand the steps taken to assure that the risks have beenreduced and to assure that there is proof that hazard con-trols are effective We do not have to be software experts toask questions and in fact a lack of expertise could be anadvantage when trying to understand how a system worksMost importantly when thinking of software and computingsystem safety we must think of the system and its interac-tions between software hardware humans processes andenvironments

CONCLUSION

PSM can provide immense benefits in reducing opera-tional risks in the chemical process and energy productionindustries By proactively identifying hazards assessing andcharacterizing risks and taking actions to reduce those risksorganizations can prevent accidents and reduce the potentialfor death injury property damage and environmentalimpacts Given the importance of software and computingsystems in the design and operation of many systems soft-ware must be included as part of the broader process safetyeffort However the accidents provided in this article alongwith many others illustrate that software is often not system-atically considered as part of process safety activities For

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 129

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE

Page 4: Case studies in process safety: Lessons learned from software-related accidents

of three lines of truck detection sensors be blocked in addi-tion to a command from the remote control to assure thatthe truck was in the correct location Each sensor line con-tained three sensors and all three sensors had to be blockedfor the entire line to be considered as blocked At the timeof the accident the truck was obscuring one line of sensorsand a second line of sensors was obscured by dirt on thelenses and therefore was not working correctly Because twoof the sensor lines were blocked and the remote control sig-nal had been sent the PLC automatically opened the rejectbin chute door and dropped 10 tons of material on the truckcab before the driver had safely cleared the chute resultingin the fatal injury [7]

Testing May Not Provide Information on Subsystemand Component Interactions or Hazard ControlOperations

The safety verification and validation process is intendedto determine that the design solution has met all the safetyrequirements (verification) and that the correct system isbuilt (validation) The verification and validation process ifperformed correctly will provide evidence that risk has beenreduced Testing is an important part of the verification andvalidation process and comprehensive software testingshould be conducted throughout the development cycleTesting should include not only nominal conditions basedon requirements but also abnormal conditions such asimproper inputs inputs outside expected conditions inputsexactly at and near the boundary conditions and inputsstuck at some value Safety-critical software must include fullsystem integration testing of end-to-end events That testingmust include stressing of the software and should includeinteractions of the software with hardware humans andenvironments A number of accidents have occurred whenno component failed in the conventional sense but the inter-action of components caused a system failure Testing mustinclude verifications of hazard controls and it should assurethat redundancy works when needed

On October 24 2002 a grinder exploded at a limestoneprocessing plant in Foreman Arkansas An operator waskilled when flammable waste fuel covered him and ignitedThe operator had started the pump for solid waste fuel proc-essing when the accident occurred The MSHA stated in itsreport that the cause of the accident was that the safety mon-itoring system designed to shut off the waste fuel systempump had not been maintained so that it functioned prop-erly Kilns were used in processing activities at the plantand these kilns were heated by burning coal natural gasand liquid waste fuel The liquid waste fuel was delivered bytruck or railcar and pumped into large storage tanks Fromthe storage area it was pumped through a grinder to reducethe particle size of the solids in the fuel Two independentsystems monitored and controlled the waste fuel delivery AFoxboro Intelligent Automation Distribution Control Systemmonitored and recorded normal operating parameters TheFoxboro also issued audible and visual alarms that wereavailable at the plant control room A PLC provided basicstartup and shutdown of the system and responded to com-mands from the Foxboro On the day of the accident theFoxboro sensed that the fuel delivery pressure was lowapparently due to blockage in the line As designed the Fox-boro sent a command to the PLC to shut down the pumpsHowever the PLC failed to respond and the pumps keptrunning Three months prior to the accident this PLC hadbeen installed this was supposed to be a simple replacementof an older PLC of similar capability However the Foxborohad not been connected to the newer PLC and the connec-tions remained to the older non-functioning PLC The

complete system had never been tested with the new PLC Asystem test had been scheduled 3 days prior to the accidentbut had been aborted when a pump failed during the testthe test had never been rescheduled The accident reportstated that the blockage may have broken free just prior tothe accident With the pumps running the pressure elevatedsignificantly and a ldquowater hammerrdquo effect caused overpressu-rization in the system at the grinder The grinder was tornloose from its base spraying fuel and pulling loose a 480-volt cable that ultimately served as an ignition source [8]

Software Change Management and Hazard AnalysesProcesses May Not Be Integrated

Engineering by its very nature is an activity that requireschange Changes can come about for a variety of reasonsincluding the discovery of problems during developmentchanges in requirements or routine upgrades of software orhardware As the development cycle proceeds and as anorganization gains more operational experience new haz-ards are often uncovered and some hazards may no longerbe relevant If the hazard analysis is not updated to reflectthese changes then resources may be expended on previ-ously identified hazards that may no longer be relevant andnew hazards may not be discovered as the design maturesIn addition routine changes and modifications to softwareand computing systems must be analyzed for potential haz-ards as part of the broader hazard analysis process

On June 10 1999 a 16-inch-diameter pipe carrying gaso-line ruptured in Bellingham Washington The ruptured pipe-line released approximately 237000 gallons of gasoline intoa nearby creek according to the NTSB That gasoline thenignited burning approximately 11=2 miles along the creekThree people died in the fire One home and the Bellinghamwater treatment plant were also damaged in the accident

The NTSB investigated the accident and determined thatthe probable cause of the rupture was damage to the pipeduring a modification project performed in 1994 This dam-age weakened the pipeline making it susceptible to ruptureunder increased pressure in the pipe The NTSB also statedthat inspections of the pipeline during the project were inad-equate and the company did not identify and repair damageThe report noted that in-line pipeline inspection data shouldhave prompted the company to excavate and examine thatsection of pipeline but the company failed to perform suchwork after reported anomalies The SCADA system com-puters also played a role in the accident The SCADA systemwas used for operation of the pipeline for example to openand close valves remotely as required or to operate pumpsas needed Just prior to the accident the operator was pre-paring to initiate delivery of gasoline to a terminal in Seattlediverting delivery from another facility During the processof switching delivery destinations the pressure in the pipe-line began to increase which is a normal condition but onethat required the operator to start a pump to reduce pres-sure However when the operator tried to start that pumpthe SCADA system failed to execute the start commandissued by the operator The operator soon found that theSCADA system was unresponsive to any commands some-thing that had never happened before The report statedthat ldquoHad the controller been able to start the pump atWoodinville it is probable that the pressure backup wouldhave been alleviated and the pipeline operated routinely forthe balance of the fuel deliveryrdquo Instead the pressure in thepipe increased and the increased pressure likely caused thedamaged pipe to rupture

The cause of computer system failure was likely a changemade to the system database just prior to the accident TheNTSB accident report stated that the SCADA system

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 127

administrator entered new records into the live database atthe time of the accident The system administrator howeverdid not check the records or test the system software to seeif those changes introduced any problems The computingsystem problem could not be replicated after the accidentand therefore the cause of the anomaly could not be defini-tively identified The report stated ldquoThe Safety Board con-cludes that had the SCADA database revisions that wereperformed shortly before the accident been performed andthoroughly tested on an off-line system instead of the pri-mary on-line SCADA system errors resulting from those revi-sions may have been identified and repaired before theycould affect the operation of the pipeline [9]rdquo

Human-Software Interactions Have Significant SafetyImplications That Are Often Underestimated

Humans interact with hardware and software in a varietyof ways A number of accidents have occurred where theseinteractions have been a significant contributing cause Ofspecial concern is when a new computing system is imple-mented that changes what is expected of the operator innormal and emergency situations

On August 28 2008 an explosion occurred at a pesticidemanufacturing plant in Institute West Virginia Two workerswere killed in the explosion and eight others were injuredThe US Chemical Safety and Hazard Investigation Board(CSB) found that the explosion was the result of a runawaychemical reaction The company was starting up a methomylunit for the first time after several months of down time formaintenance to install a new computer control system andreactor vessel Normal operations called for dissolvedmethomyl and waste chemicals to be fed into a preheatedresidue treater vessel partially filled with solvent The treaterallowed the methomyl to decompose safely after which itwas mixed with other waste chemicals and used to fuel facil-ity boilers On the day of the accident a methomyl-solventmixture was added prematurely to the residue treater vesselbefore solvent had been added to the tank This eventoccurred when operators were troubleshooting equipmentproblems during the startup This mixture was supposed tobe added to the vessel only after that vessel was filled withclean solvent and heated to a minimum safe temperature Aninterlock system existed as part of the automatic feed controlsystem to prevent inadvertent introduction of methomyl Thisinterlock was password-protected to prevent inadvertentoverride but operators intentionally overrode the interlockBypassing the interlocks had apparently been a practice con-doned by management in the past as a workaround foroperational problems Once the methomyl decompositionreaction started it could not be stopped and the pressurerapidly rose in the vessel due to gas from the reaction lead-ing to the explosion Uncrystallized methomyl existed in thetank due to equipment problems and this material greatlyincreased the methomyl concentration in the residue treatercontributing to the runaway reaction

The CSB stated that the company initiated this processstartup before the company had completed critical checksincluding valve lineups equipment checkouts and computercalibrations The CSB said that the operating procedure forthe startup was complex but this procedure had not beenreviewed or approved In addition the company had notperformed training on the new computer system put in placeas part of the maintenance This new computer systemoffered significant improvements to help automatically con-trol the operation For example the control system includedgraphical display screens that simulated the process flowBut the system was also complex and the modificationschanged the way operators interacted with the system The

operators thought the screens were difficult to navigate andresponding to troubleshooting alarms was difficult The acci-dent report stated that had the operators received adequatetraining on the new computer system they may have beenable to recognize problems in operation before theexplosion

The report faulted the company for not performing anadequate pre-startup safety review System operators did notparticipate in the safety review and review checklistsshowed items as completed when they were not The CSBstated that the company had also failed to perform a thor-ough Process Hazard Analysis The report stated that the Pro-cess Hazard Analysis was performed quickly becausemanagement had not allotted sufficient time for analysis Inaddition the CSB stated that the Process Hazard Analysisincluded invalid assumptions and said that the team did notapply the analysis tools properly resulting in unmitigatedaccident scenarios [10]

Support Software Including Models and SimulationsMay Be As Critical To Safety As Control Software

While the focus of software safety efforts is usually onsoftware directly controlling an application support softwaremay contribute to an accident or to the effectiveness of theemergency response Examples of support software includedatabases used for maintenance activities software to esti-mate load stability computer-based models that providedesign calculations or assurance information and so on Afailure to analyze the hazards associated with this supportsoftware including models and simulations can result inunforeseen system failures

On November 12 2008 a 2 million gallon liquid fertilizertank at in Chesapeake Virginia collapsed Two workers per-forming welding operations at the site were seriously injuredand an adjacent neighborhood was partially flooded as aresult of the accident The CSB found that the company hadnot assured that welds to replace vertical joints met acceptedindustry standards and the CSB faulted the company for itsfailure to perform inspections of the welds The companywas also faulted for not having proper procedures in placefor filling the tanks following major facility modifications Inits report the CSB also noted that the contractor hired by thecompany to calculate the maximum fill height had usedsome faulty assumptions The maximum liquid level wassupposed to be calculated in part based on the minimummeasured shell thicknesses and the extent of the weldinspection (full spot or no radiography) The contractorused the maximum (not minimum) measured thickness andassumed full inspection of the welds These assumptions ledto an overestimation of the allowable liquid level The tankfailed at a fill level of 2674 feet below the calculated maxi-mum of 2701 feet The CSB also noted a number of previ-ous overfilling accidents The CSB found 16 other tankfailures at nine facilities in other states between 1995 and2008 These 16 failures resulted in one death four hospital-izations one community evacuation and two releases towaterways Eleven occurred due to defective welding [11]

RECOMMENDATIONS FOR THE SOFTWARE SYSTEM SAFETY PROCESS

There is a great deal of information in the literature onhow to improve software development processes And stand-ards such as IEC 61511 Functional Safety - Safety Instru-mented Systems for the Process Industry Sector and DO-178B Software Considerations in Airborne Systems andEquipment Certification provide invaluable information tohelp improve the safety of complex systems However theabove themes and lessons learned indicate that improvingsoftware and computing system safety is more than just

DOI 101002prs Process Safety Progress (Vol33 No2)128 June 2014 Published on behalf of the AIChE

providing additional training using standards or implement-ing improved software development techniques or toolsImproving software safety requires a change in the way pro-cess safety practitioners think Software safety must be fullyintegrated into the process safety analysis process And pro-cess safety practitioners must begin to ask critical questionsabout their software safety efforts Some examples of thesequestions include the following

bull Do plans reflect how business is really done Are plansreviewed Do plans have unrealistic schedules orresource allocations Are software and computing sys-tems considering in the planning and acquisition proc-esses Poor or unrealistic plans may reflect anorganization that does not truly place a priority onsafety activities

bull Is there a convincing story that the safety analysis iscomplete Is there a sufficient description of the systemto support an analysis including the software and com-puting system Are computing systems treated as ldquoblackboxesrdquo Have software causes been thoroughly eval-uated using a systematic approach or are the softwarecauses listed as ldquosoftware errorrdquo with no further expla-nation Do the hazard analyses include support soft-ware such as design models or databases Failure toshow that the problem is being looked at systematicallycould indicate that there are holes in the analysis withpotentially significant problems overlooked

bull Are the hazard reports detailed enough Are causesdescriptive Are the recommendations clear and suffi-ciently detailed Does the logic make sense and is itcomplete Do controls match up with the causesshowing a one-to-one or many-to-one relation Lack ofdetail could be an indication of insufficient knowledgeof the system or lack of information on the system

bull Do the risk assessments have a logical basis Have therisk assessments considered software complexity matu-rity testing use of unproven technologies etc Areassessments overly optimistic based on what is knownabout the computing system A failure to provide abasis for your risk assessments may result in unrealisticassessments

bull Are the hazard controls primarily procedural rather thandesign changes safety features or devices Is there anoverreliance on alarms and signage Do software con-trols only rely on good software processes Is there anoverreliance on humans and software to ldquosave the dayrdquoOverreliance on operational controls may indicate aweak safety design

bull Are hazard control recommendations being imple-mented Can the selected hazard control strategyactually be implemented and verified Is the controlstrategy so complex that it will be impossible to deter-mine whether it will work when needed Complex con-trols overlapping control strategies or inadequateimplementation may be an indication of a weak safetydesign

bull Has the risk assessment truly considered the worst caseWhat is the basis for the likelihood levels Is the riskanalyzed only for steady-state operation or have start-ups and shutdowns also been considered Failure toprovide good answers to these questions indicates apotential misunderstanding of the risk

bull Are problems found in test and design included in thehazard reports and factored into the design Have prob-lems and incidents been fully investigated Does aprocess exist for tracking action items includingwhether recommendations have been implementedAre changes to software and computing systems prop-

erly factored into the hazard analysis Failure to incor-porate problems changes and corrective actions is anindication of the potential to miss serious design flaws

These questions help to identify whether the hazard anal-ysis process is robust However we must also ask questionsspecifically related to the use of software and computing sys-tems in complex systems The best questions come fromreal-world examples of accidents where software has been acontributor Some examples of questions are as follows andothers can be found in the literature [12]

bull Have safety-critical software commands and data beenidentified

bull Do hazard controls for software-related causes combinegood practices and specific safeguards

bull Is software and system testing adequate and do testsinclude sufficient off-nominal conditions

bull Is the computing system design overly complexbull Is the design based on unproven technologiesbull What happens if the software locks upbull Are the sensors used for software decisions fault

tolerantbull Has software mode transition been consideredbull Has consideration been given to the order of commands

and potential out of sequence inputsbull Will the software and system start up and shut down in

a known safe statebull Are checks performed before initiating hazardous

operationsbull Will the software properly handle spurious signals and

power outages

These are by no means all the questions a decision makershould ask and positive answers to these questions provideno assurance that an accident will be prevented These ques-tions should encourage critical thinking about safety andgenerate additional questions to provide further insight onsystem risk A failure to ask these questions could mean thatthe potential for an accident is higher than we had assumed

It is up to all stakeholders to look for those conditionsthat could lead to an accident and to recognize that theworst can happen This means we should all express con-cerns about software safety management and engineeringwhen necessary based on our knowledge experience andjudgment including lessons learned from accidents We mustask questions to understand the potential for harm to under-stand the steps taken to assure that the risks have beenreduced and to assure that there is proof that hazard con-trols are effective We do not have to be software experts toask questions and in fact a lack of expertise could be anadvantage when trying to understand how a system worksMost importantly when thinking of software and computingsystem safety we must think of the system and its interac-tions between software hardware humans processes andenvironments

CONCLUSION

PSM can provide immense benefits in reducing opera-tional risks in the chemical process and energy productionindustries By proactively identifying hazards assessing andcharacterizing risks and taking actions to reduce those risksorganizations can prevent accidents and reduce the potentialfor death injury property damage and environmentalimpacts Given the importance of software and computingsystems in the design and operation of many systems soft-ware must be included as part of the broader process safetyeffort However the accidents provided in this article alongwith many others illustrate that software is often not system-atically considered as part of process safety activities For

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 129

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE

Page 5: Case studies in process safety: Lessons learned from software-related accidents

administrator entered new records into the live database atthe time of the accident The system administrator howeverdid not check the records or test the system software to seeif those changes introduced any problems The computingsystem problem could not be replicated after the accidentand therefore the cause of the anomaly could not be defini-tively identified The report stated ldquoThe Safety Board con-cludes that had the SCADA database revisions that wereperformed shortly before the accident been performed andthoroughly tested on an off-line system instead of the pri-mary on-line SCADA system errors resulting from those revi-sions may have been identified and repaired before theycould affect the operation of the pipeline [9]rdquo

Human-Software Interactions Have Significant SafetyImplications That Are Often Underestimated

Humans interact with hardware and software in a varietyof ways A number of accidents have occurred where theseinteractions have been a significant contributing cause Ofspecial concern is when a new computing system is imple-mented that changes what is expected of the operator innormal and emergency situations

On August 28 2008 an explosion occurred at a pesticidemanufacturing plant in Institute West Virginia Two workerswere killed in the explosion and eight others were injuredThe US Chemical Safety and Hazard Investigation Board(CSB) found that the explosion was the result of a runawaychemical reaction The company was starting up a methomylunit for the first time after several months of down time formaintenance to install a new computer control system andreactor vessel Normal operations called for dissolvedmethomyl and waste chemicals to be fed into a preheatedresidue treater vessel partially filled with solvent The treaterallowed the methomyl to decompose safely after which itwas mixed with other waste chemicals and used to fuel facil-ity boilers On the day of the accident a methomyl-solventmixture was added prematurely to the residue treater vesselbefore solvent had been added to the tank This eventoccurred when operators were troubleshooting equipmentproblems during the startup This mixture was supposed tobe added to the vessel only after that vessel was filled withclean solvent and heated to a minimum safe temperature Aninterlock system existed as part of the automatic feed controlsystem to prevent inadvertent introduction of methomyl Thisinterlock was password-protected to prevent inadvertentoverride but operators intentionally overrode the interlockBypassing the interlocks had apparently been a practice con-doned by management in the past as a workaround foroperational problems Once the methomyl decompositionreaction started it could not be stopped and the pressurerapidly rose in the vessel due to gas from the reaction lead-ing to the explosion Uncrystallized methomyl existed in thetank due to equipment problems and this material greatlyincreased the methomyl concentration in the residue treatercontributing to the runaway reaction

The CSB stated that the company initiated this processstartup before the company had completed critical checksincluding valve lineups equipment checkouts and computercalibrations The CSB said that the operating procedure forthe startup was complex but this procedure had not beenreviewed or approved In addition the company had notperformed training on the new computer system put in placeas part of the maintenance This new computer systemoffered significant improvements to help automatically con-trol the operation For example the control system includedgraphical display screens that simulated the process flowBut the system was also complex and the modificationschanged the way operators interacted with the system The

operators thought the screens were difficult to navigate andresponding to troubleshooting alarms was difficult The acci-dent report stated that had the operators received adequatetraining on the new computer system they may have beenable to recognize problems in operation before theexplosion

The report faulted the company for not performing anadequate pre-startup safety review System operators did notparticipate in the safety review and review checklistsshowed items as completed when they were not The CSBstated that the company had also failed to perform a thor-ough Process Hazard Analysis The report stated that the Pro-cess Hazard Analysis was performed quickly becausemanagement had not allotted sufficient time for analysis Inaddition the CSB stated that the Process Hazard Analysisincluded invalid assumptions and said that the team did notapply the analysis tools properly resulting in unmitigatedaccident scenarios [10]

Support Software Including Models and SimulationsMay Be As Critical To Safety As Control Software

While the focus of software safety efforts is usually onsoftware directly controlling an application support softwaremay contribute to an accident or to the effectiveness of theemergency response Examples of support software includedatabases used for maintenance activities software to esti-mate load stability computer-based models that providedesign calculations or assurance information and so on Afailure to analyze the hazards associated with this supportsoftware including models and simulations can result inunforeseen system failures

On November 12 2008 a 2 million gallon liquid fertilizertank at in Chesapeake Virginia collapsed Two workers per-forming welding operations at the site were seriously injuredand an adjacent neighborhood was partially flooded as aresult of the accident The CSB found that the company hadnot assured that welds to replace vertical joints met acceptedindustry standards and the CSB faulted the company for itsfailure to perform inspections of the welds The companywas also faulted for not having proper procedures in placefor filling the tanks following major facility modifications Inits report the CSB also noted that the contractor hired by thecompany to calculate the maximum fill height had usedsome faulty assumptions The maximum liquid level wassupposed to be calculated in part based on the minimummeasured shell thicknesses and the extent of the weldinspection (full spot or no radiography) The contractorused the maximum (not minimum) measured thickness andassumed full inspection of the welds These assumptions ledto an overestimation of the allowable liquid level The tankfailed at a fill level of 2674 feet below the calculated maxi-mum of 2701 feet The CSB also noted a number of previ-ous overfilling accidents The CSB found 16 other tankfailures at nine facilities in other states between 1995 and2008 These 16 failures resulted in one death four hospital-izations one community evacuation and two releases towaterways Eleven occurred due to defective welding [11]

RECOMMENDATIONS FOR THE SOFTWARE SYSTEM SAFETY PROCESS

There is a great deal of information in the literature onhow to improve software development processes And stand-ards such as IEC 61511 Functional Safety - Safety Instru-mented Systems for the Process Industry Sector and DO-178B Software Considerations in Airborne Systems andEquipment Certification provide invaluable information tohelp improve the safety of complex systems However theabove themes and lessons learned indicate that improvingsoftware and computing system safety is more than just

DOI 101002prs Process Safety Progress (Vol33 No2)128 June 2014 Published on behalf of the AIChE

providing additional training using standards or implement-ing improved software development techniques or toolsImproving software safety requires a change in the way pro-cess safety practitioners think Software safety must be fullyintegrated into the process safety analysis process And pro-cess safety practitioners must begin to ask critical questionsabout their software safety efforts Some examples of thesequestions include the following

bull Do plans reflect how business is really done Are plansreviewed Do plans have unrealistic schedules orresource allocations Are software and computing sys-tems considering in the planning and acquisition proc-esses Poor or unrealistic plans may reflect anorganization that does not truly place a priority onsafety activities

bull Is there a convincing story that the safety analysis iscomplete Is there a sufficient description of the systemto support an analysis including the software and com-puting system Are computing systems treated as ldquoblackboxesrdquo Have software causes been thoroughly eval-uated using a systematic approach or are the softwarecauses listed as ldquosoftware errorrdquo with no further expla-nation Do the hazard analyses include support soft-ware such as design models or databases Failure toshow that the problem is being looked at systematicallycould indicate that there are holes in the analysis withpotentially significant problems overlooked

bull Are the hazard reports detailed enough Are causesdescriptive Are the recommendations clear and suffi-ciently detailed Does the logic make sense and is itcomplete Do controls match up with the causesshowing a one-to-one or many-to-one relation Lack ofdetail could be an indication of insufficient knowledgeof the system or lack of information on the system

bull Do the risk assessments have a logical basis Have therisk assessments considered software complexity matu-rity testing use of unproven technologies etc Areassessments overly optimistic based on what is knownabout the computing system A failure to provide abasis for your risk assessments may result in unrealisticassessments

bull Are the hazard controls primarily procedural rather thandesign changes safety features or devices Is there anoverreliance on alarms and signage Do software con-trols only rely on good software processes Is there anoverreliance on humans and software to ldquosave the dayrdquoOverreliance on operational controls may indicate aweak safety design

bull Are hazard control recommendations being imple-mented Can the selected hazard control strategyactually be implemented and verified Is the controlstrategy so complex that it will be impossible to deter-mine whether it will work when needed Complex con-trols overlapping control strategies or inadequateimplementation may be an indication of a weak safetydesign

bull Has the risk assessment truly considered the worst caseWhat is the basis for the likelihood levels Is the riskanalyzed only for steady-state operation or have start-ups and shutdowns also been considered Failure toprovide good answers to these questions indicates apotential misunderstanding of the risk

bull Are problems found in test and design included in thehazard reports and factored into the design Have prob-lems and incidents been fully investigated Does aprocess exist for tracking action items includingwhether recommendations have been implementedAre changes to software and computing systems prop-

erly factored into the hazard analysis Failure to incor-porate problems changes and corrective actions is anindication of the potential to miss serious design flaws

These questions help to identify whether the hazard anal-ysis process is robust However we must also ask questionsspecifically related to the use of software and computing sys-tems in complex systems The best questions come fromreal-world examples of accidents where software has been acontributor Some examples of questions are as follows andothers can be found in the literature [12]

bull Have safety-critical software commands and data beenidentified

bull Do hazard controls for software-related causes combinegood practices and specific safeguards

bull Is software and system testing adequate and do testsinclude sufficient off-nominal conditions

bull Is the computing system design overly complexbull Is the design based on unproven technologiesbull What happens if the software locks upbull Are the sensors used for software decisions fault

tolerantbull Has software mode transition been consideredbull Has consideration been given to the order of commands

and potential out of sequence inputsbull Will the software and system start up and shut down in

a known safe statebull Are checks performed before initiating hazardous

operationsbull Will the software properly handle spurious signals and

power outages

These are by no means all the questions a decision makershould ask and positive answers to these questions provideno assurance that an accident will be prevented These ques-tions should encourage critical thinking about safety andgenerate additional questions to provide further insight onsystem risk A failure to ask these questions could mean thatthe potential for an accident is higher than we had assumed

It is up to all stakeholders to look for those conditionsthat could lead to an accident and to recognize that theworst can happen This means we should all express con-cerns about software safety management and engineeringwhen necessary based on our knowledge experience andjudgment including lessons learned from accidents We mustask questions to understand the potential for harm to under-stand the steps taken to assure that the risks have beenreduced and to assure that there is proof that hazard con-trols are effective We do not have to be software experts toask questions and in fact a lack of expertise could be anadvantage when trying to understand how a system worksMost importantly when thinking of software and computingsystem safety we must think of the system and its interac-tions between software hardware humans processes andenvironments

CONCLUSION

PSM can provide immense benefits in reducing opera-tional risks in the chemical process and energy productionindustries By proactively identifying hazards assessing andcharacterizing risks and taking actions to reduce those risksorganizations can prevent accidents and reduce the potentialfor death injury property damage and environmentalimpacts Given the importance of software and computingsystems in the design and operation of many systems soft-ware must be included as part of the broader process safetyeffort However the accidents provided in this article alongwith many others illustrate that software is often not system-atically considered as part of process safety activities For

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 129

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE

Page 6: Case studies in process safety: Lessons learned from software-related accidents

providing additional training using standards or implement-ing improved software development techniques or toolsImproving software safety requires a change in the way pro-cess safety practitioners think Software safety must be fullyintegrated into the process safety analysis process And pro-cess safety practitioners must begin to ask critical questionsabout their software safety efforts Some examples of thesequestions include the following

bull Do plans reflect how business is really done Are plansreviewed Do plans have unrealistic schedules orresource allocations Are software and computing sys-tems considering in the planning and acquisition proc-esses Poor or unrealistic plans may reflect anorganization that does not truly place a priority onsafety activities

bull Is there a convincing story that the safety analysis iscomplete Is there a sufficient description of the systemto support an analysis including the software and com-puting system Are computing systems treated as ldquoblackboxesrdquo Have software causes been thoroughly eval-uated using a systematic approach or are the softwarecauses listed as ldquosoftware errorrdquo with no further expla-nation Do the hazard analyses include support soft-ware such as design models or databases Failure toshow that the problem is being looked at systematicallycould indicate that there are holes in the analysis withpotentially significant problems overlooked

bull Are the hazard reports detailed enough Are causesdescriptive Are the recommendations clear and suffi-ciently detailed Does the logic make sense and is itcomplete Do controls match up with the causesshowing a one-to-one or many-to-one relation Lack ofdetail could be an indication of insufficient knowledgeof the system or lack of information on the system

bull Do the risk assessments have a logical basis Have therisk assessments considered software complexity matu-rity testing use of unproven technologies etc Areassessments overly optimistic based on what is knownabout the computing system A failure to provide abasis for your risk assessments may result in unrealisticassessments

bull Are the hazard controls primarily procedural rather thandesign changes safety features or devices Is there anoverreliance on alarms and signage Do software con-trols only rely on good software processes Is there anoverreliance on humans and software to ldquosave the dayrdquoOverreliance on operational controls may indicate aweak safety design

bull Are hazard control recommendations being imple-mented Can the selected hazard control strategyactually be implemented and verified Is the controlstrategy so complex that it will be impossible to deter-mine whether it will work when needed Complex con-trols overlapping control strategies or inadequateimplementation may be an indication of a weak safetydesign

bull Has the risk assessment truly considered the worst caseWhat is the basis for the likelihood levels Is the riskanalyzed only for steady-state operation or have start-ups and shutdowns also been considered Failure toprovide good answers to these questions indicates apotential misunderstanding of the risk

bull Are problems found in test and design included in thehazard reports and factored into the design Have prob-lems and incidents been fully investigated Does aprocess exist for tracking action items includingwhether recommendations have been implementedAre changes to software and computing systems prop-

erly factored into the hazard analysis Failure to incor-porate problems changes and corrective actions is anindication of the potential to miss serious design flaws

These questions help to identify whether the hazard anal-ysis process is robust However we must also ask questionsspecifically related to the use of software and computing sys-tems in complex systems The best questions come fromreal-world examples of accidents where software has been acontributor Some examples of questions are as follows andothers can be found in the literature [12]

bull Have safety-critical software commands and data beenidentified

bull Do hazard controls for software-related causes combinegood practices and specific safeguards

bull Is software and system testing adequate and do testsinclude sufficient off-nominal conditions

bull Is the computing system design overly complexbull Is the design based on unproven technologiesbull What happens if the software locks upbull Are the sensors used for software decisions fault

tolerantbull Has software mode transition been consideredbull Has consideration been given to the order of commands

and potential out of sequence inputsbull Will the software and system start up and shut down in

a known safe statebull Are checks performed before initiating hazardous

operationsbull Will the software properly handle spurious signals and

power outages

These are by no means all the questions a decision makershould ask and positive answers to these questions provideno assurance that an accident will be prevented These ques-tions should encourage critical thinking about safety andgenerate additional questions to provide further insight onsystem risk A failure to ask these questions could mean thatthe potential for an accident is higher than we had assumed

It is up to all stakeholders to look for those conditionsthat could lead to an accident and to recognize that theworst can happen This means we should all express con-cerns about software safety management and engineeringwhen necessary based on our knowledge experience andjudgment including lessons learned from accidents We mustask questions to understand the potential for harm to under-stand the steps taken to assure that the risks have beenreduced and to assure that there is proof that hazard con-trols are effective We do not have to be software experts toask questions and in fact a lack of expertise could be anadvantage when trying to understand how a system worksMost importantly when thinking of software and computingsystem safety we must think of the system and its interac-tions between software hardware humans processes andenvironments

CONCLUSION

PSM can provide immense benefits in reducing opera-tional risks in the chemical process and energy productionindustries By proactively identifying hazards assessing andcharacterizing risks and taking actions to reduce those risksorganizations can prevent accidents and reduce the potentialfor death injury property damage and environmentalimpacts Given the importance of software and computingsystems in the design and operation of many systems soft-ware must be included as part of the broader process safetyeffort However the accidents provided in this article alongwith many others illustrate that software is often not system-atically considered as part of process safety activities For

Process Safety Progress (Vol33 No2) Published on behalf of the AIChE DOI 101002prs June 2014 129

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE

Page 7: Case studies in process safety: Lessons learned from software-related accidents

example software may not be identified as a hazard causerisks may be underestimated software-related hazard con-trols may be oversimplified testing may not include suffi-cient safety scenarios and system changes may not befactored into the design or hazard analysis Process safetypractitioners must take advantage of available software andcomputing system lessons learned going beyond examplespresented here to improve their own safety efforts andmost importantly to prevent accidents

LITERATURE CITED

1 TL Hardy Software and System Safety Accidents Inci-dents and Lessons Learned AuthorHouse BloomingtonIN USA 2012

2 B Pool System Wrongfully Blamed in Union CarbideLeak SAFER Fights Off Dark Cloud of Bad Publicity LosAngeles Times August 20 1985

3 N Schlager Breakdown Deadly Technological DisastersVisible Ink Press Canton MI USA 1995

4 Transportation Safety Board of Canada ProgrammableLogic Controller Failure Foothills Pipe Lines Ltd Decom-pressionRecompression Facility BP Canada EnergyCompany Empress Natural Gas Liquids Facility NearEmpress Alberta 18 October 2005 Report NumberP05H0061 July 12 2006

5 US National Transportation Safety Board Rupture ofPiney Point Oil Pipeline and Release of Fuel Oil NearChalk Point Maryland April 7 2000 Pipeline AccidentReport NTSBPAR-0201 July 23 2002

6 US Mine Safety and Health Administration Report ofInvestigation Fatal Other Accident (Steam Burns) Febru-ary 11 2003 Southern Clay Plants amp Pits Southern ClayProd Inc Gonzales Gonzales County Texas Mine IDNo 41ndash00298 2003

7 State of New South Wales (Australia) Department ofIndustry and Investment Fatality involving David HurstOldknow Ravensworth Underground Mine Coal Prepara-tion Plant Reject bin 802 18 February 2009 May 2010

8 US Mine Safety and Health Administration Report ofAccident Exploding Vessels Under Pressure AccidentOctober 24 2002 Foreman Quarry and Plant Ash GroveCement Company Foreman Little River CountyArkansas Mine ID No 03-00256 2003

9 US National Transportation Safety Board Pipeline Rup-ture and Subsequent Fire in Bellingham WashingtonJune 10 1999 Pipeline Accident Report NTSBPAR-0202October 8 2002

10 US Chemical Safety and Hazard Investigation BoardPesticide Chemical Runaway Reaction Pressure VesselExplosion Bayer CropScience Institute West VirginiaAugust 28 2008 Report No 2008-08-I-WV January 2011

11 US Chemical Safety and Hazard Investigation BoardInvestigation Report Allied Terminals IncmdashCatastrophicTank Collapse Allied Terminals Inc Chesapeake Vir-ginia November 12 2008 Report No 2009-03-I-VA May2009

12 TL Hardy Essential Questions in System Safety A Guidefor Safety Decision Makers AuthorHouse BloomingtonIN USA 2011

DOI 101002prs Process Safety Progress (Vol33 No2)130 June 2014 Published on behalf of the AIChE