january 2017 human error: the biggest challenge to ......feb 16, 2017 · • ways to avoid human...
TRANSCRIPT
HUMAN ERROR: THE BIGGEST CHALLENGE TO AVAILABILITY
James Soh 苏旭江Newwit Consultancy 新 想 咨 询
January 2017
1
AVAILABILITY
System will fail, even “full-proof” ones. Redundancy, BCP and IT DRP
Availability is based on good engineering design, rigor of implementation, testing and commissioning, automated monitoring, rigorous operations, maintenance, and proper use (capacity/resiliency).
Human Error:
Means that something has been done that is a deviation from intention, expectation or desirability.
Misconception that Tier 4 data center can withstand incident caused by human error.
2
TWO KINDS OF HUMAN ERROR
Active Error
Latent Error
• Outcome: monetary loss,
• Reputational, loss of life
CASE
4
• Trending up -Average cost of outage gone up by 28% (from 2013 to 2016) to USD 740,357
- According to 2016 Ponemon Institute© Research Report
Root cause of downtime can be directly or indirectly linked to human error
COST OF DOWNTIME
5
CONSEQUENCE OF HUMAN ERROR AND UNAVAILABILITY DURATION
• Weakest link -> it only need one person to trigger an incident
• Can be just near-miss, minor (not impacting critical operations)
• Can cause major consequence
• Interaction with Cascading failure cause major outage -> unavailability
• Service recovery is not just when power returns, or setting restored
6
ANALYSIS OF OUTAGES CAUSED BY HUMAN ERRORS
• Inherent Design / Setting flaw• Outdated / swiss cheese situation• Requires analysis and manual intervention• Error Producing Conditions (EPC)
• Weakness in manual processes• Inadequate automation• Inadequate training / familiarity• Inadequate operations procedures
• Insufficient Information / knowledge• Capacity• Inadequate training / knowledge• Inadequate documentations
• Insufficient Risk Assessment• MOS / RA, risk matrix• Vendor experience
7
WHAT OTHERS HAVE DONE TO MINIMIZE HUMAN ERROR?
• Airline’s Crew Resource Management
• US Nuclear Regulatory Commission• Standardized Plant Analysis Risk - Human Reliability Analysis (SPAR-H)
method to take account of the potential for human error
• OECD’s Nuclear Energy Agency• Ways to avoid human error, e.g.,
• Systems should also be designed to limit the need for human intervention
• distinctive and consistent labelling of equipment, control panels & documents;
• displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
• designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
• operators to be better trained for plant emergencies, use of simulators …
8
RISK MITIGATION
9
• PDCA
Plan
DoCheck
Act• Update SOP
• Update BCP / IT
DRP
• Training and
Development
Review and Carry
out Corrections
• Baseline,
Documentation and
records
• Compliance to
regulations
• Skills gap analysis
• Incident Response
Exercise
Establish policy and
improvement targets
• Knowledge gap
• Internal audit
• External review
• Management
review
Management
Review
• Current design and
infrastructure
• Capacity
• Past Incidents
• Resource (Internal, external)
• Table Top exercise
• External audit
• MOS / RA risk matrix review
System and Operations
Review
Adapted from BS 25999
Know-
ledge
Base
CAN WE DO SOMETHING ABOUT IT?
• Review existing data center processes, including planning, commissioning, deployment and service to identify opportunities for automation and areas where necessary data or skills are creating inefficiencies in light of new technologies now available.
• Analyze the skill set of the data center management team in light of future requirements to identify opportunities for collaboration and areas where internal skills need to be supplemented with new hires or outside resources.
• Review current infrastructure technologies, configurations and Capacity for their ability to meet efficiency, availability and scalability goals
• Ensure the foundation for effective management is in place in the form of an up-to-date visual model of the data center and centralized monitoring of infrastructure systems. This could likely include the deployment of a BMS / DCIM platform
10
CAN WE DO SOMETHING ABOUT IT?
• Error Reduction Strategy
Strategy Power (leverage)
Fail-safes and constraints Highest
Forcing functions
Automation and computerization
Human Error Reduction Tool
Standardization
Redundancies
Reminders and checklists
Rules and policies
Education and Information Base
11
• Limited short-term memory• Personality conflicts
• Mental shortcuts (biases)• Lack of alternative indication
• Inaccurate risk perception (Pollyanna)• Unexpected equipment conditions
• Mindset (“tuned” to see)• Hidden system response
• Complacency / Overconfidence• Workarounds / OOS instruments
• Assumptions (inaccurate mental picture)• Confusing displays or controls
• Habit patterns• Changes / Departures from routine
• Stress (limits attention)• Distractions / Interruptions
Human NatureWork Environment
• Illness / Fatigue• Lack of or unclear standards
• “Hazardous” attitude for critical task• Unclear goals, roles, & responsibilities
• Indistinct problem-solving skills• Interpretation requirements
• Lack of proficiency / Inexperience• Irrecoverable acts
• Imprecise communication habits• Repetitive actions, monotonous
• New technique not used before• Simultaneous, multiple tasks
• Lack of knowledge (mental model)• High Workload (memory requirements)
• Unfamiliarity w/ task / First time• Time pressure (in a hurry)
Individual CapabilitiesTask Demands
EARLY SIGNS - ERROR PRECURSORS
http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
HUMAN PERFORMANCE TOOLS
• Communications
• Critical Steps
• Enhanced Pre-Job Briefing
• Self Check
• Peer Check
• Independent Verification
• Error Traps
• Just Culture
• Effective Communication
• Questioning Attitude
• Enhanced Turnover
• Error Precursors
• Performance/Error Modes
• Tagging
• Record Keeping
• Poka Yoke
• SAFE Dialogue
• STAR
• Training
CAN WE DO SOMETHING ABOUT IT?
• Organization
• Resource
• Catch-all• Capacity
• MOS / RA
• A robust and regularly tested Incident and Problem Management Processes
• No blame – actively• anticipate it;
• try to understand potential cause of error; and
• make sure we learn from past mistakes so that we can minimize its occurrence and impact on our critical systems.
• Invest in Training
• Written Documentation
14
KEY TAKE AWAY
• It is worthwhile to commit resources to reduce errors
• We can do improve our resiliency and thereby uptime
• There are proven methods and tools
15
REFERENCES
• https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
• http://news.delta.com/chief-operating-officer-gives-delta-operations-update
• https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
• http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
• https://www.oecd-nea.org/brief/brief-02.html
16