january 2017 human error: the biggest challenge to ......feb 16, 2017 · • ways to avoid human...

HUMAN ERROR: THE BIGGEST CHALLENGE TO AVAILABILITY

James Soh 苏旭江Newwit Consultancy 新想咨询

January 2017

1

AVAILABILITY

System will fail, even “full-proof” ones. Redundancy, BCP and IT DRP

Availability is based on good engineering design, rigor of implementation, testing and commissioning, automated monitoring, rigorous operations, maintenance, and proper use (capacity/resiliency).

Human Error:

Means that something has been done that is a deviation from intention, expectation or desirability.

Misconception that Tier 4 data center can withstand incident caused by human error.

2

TWO KINDS OF HUMAN ERROR

Active Error

Latent Error

• Outcome: monetary loss,

• Reputational, loss of life

CASE

4

• Trending up -Average cost of outage gone up by 28% (from 2013 to 2016) to USD 740,357

- According to 2016 Ponemon Institute© Research Report

Root cause of downtime can be directly or indirectly linked to human error

COST OF DOWNTIME

5

CONSEQUENCE OF HUMAN ERROR AND UNAVAILABILITY DURATION

• Weakest link -> it only need one person to trigger an incident

• Can be just near-miss, minor (not impacting critical operations)

• Can cause major consequence

• Interaction with Cascading failure cause major outage -> unavailability

• Service recovery is not just when power returns, or setting restored

6

ANALYSIS OF OUTAGES CAUSED BY HUMAN ERRORS

• Inherent Design / Setting flaw• Outdated / swiss cheese situation• Requires analysis and manual intervention• Error Producing Conditions (EPC)

• Weakness in manual processes• Inadequate automation• Inadequate training / familiarity• Inadequate operations procedures

• Insufficient Information / knowledge• Capacity• Inadequate training / knowledge• Inadequate documentations

• Insufficient Risk Assessment• MOS / RA, risk matrix• Vendor experience

7

WHAT OTHERS HAVE DONE TO MINIMIZE HUMAN ERROR?

• Airline’s Crew Resource Management

• US Nuclear Regulatory Commission• Standardized Plant Analysis Risk - Human Reliability Analysis (SPAR-H)

method to take account of the potential for human error

• OECD’s Nuclear Energy Agency• Ways to avoid human error, e.g.,

• Systems should also be designed to limit the need for human intervention

• distinctive and consistent labelling of equipment, control panels & documents;

• displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and

• designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.

• operators to be better trained for plant emergencies, use of simulators …

8

RISK MITIGATION

9

• PDCA

Plan

DoCheck

Act• Update SOP

• Update BCP / IT

DRP

• Training and

Development

Review and Carry

out Corrections

• Baseline,

Documentation and

records

• Compliance to

regulations

• Skills gap analysis

• Incident Response

Exercise

Establish policy and

improvement targets

• Knowledge gap

• Internal audit

• External review

• Management

review

Management

Review

• Current design and

infrastructure

• Capacity

• Past Incidents

• Resource (Internal, external)

• Table Top exercise

• External audit

• MOS / RA risk matrix review

System and Operations

Review

Adapted from BS 25999

Know-

ledge

Base

CAN WE DO SOMETHING ABOUT IT?

• Review existing data center processes, including planning, commissioning, deployment and service to identify opportunities for automation and areas where necessary data or skills are creating inefficiencies in light of new technologies now available.

• Analyze the skill set of the data center management team in light of future requirements to identify opportunities for collaboration and areas where internal skills need to be supplemented with new hires or outside resources.

• Review current infrastructure technologies, configurations and Capacity for their ability to meet efficiency, availability and scalability goals

• Ensure the foundation for effective management is in place in the form of an up-to-date visual model of the data center and centralized monitoring of infrastructure systems. This could likely include the deployment of a BMS / DCIM platform

10


• Error Reduction Strategy

Strategy Power (leverage)

Fail-safes and constraints Highest

Forcing functions

Automation and computerization

Human Error Reduction Tool

Standardization

Redundancies

Reminders and checklists

Rules and policies

Education and Information Base

11

• Limited short-term memory• Personality conflicts

• Mental shortcuts (biases)• Lack of alternative indication

• Inaccurate risk perception (Pollyanna)• Unexpected equipment conditions

• Mindset (“tuned” to see)• Hidden system response

• Complacency / Overconfidence• Workarounds / OOS instruments

• Assumptions (inaccurate mental picture)• Confusing displays or controls

• Habit patterns• Changes / Departures from routine

• Stress (limits attention)• Distractions / Interruptions

Human NatureWork Environment

• Illness / Fatigue• Lack of or unclear standards

• “Hazardous” attitude for critical task• Unclear goals, roles, & responsibilities

• Indistinct problem-solving skills• Interpretation requirements

• Lack of proficiency / Inexperience• Irrecoverable acts

• Imprecise communication habits• Repetitive actions, monotonous

• New technique not used before• Simultaneous, multiple tasks

• Lack of knowledge (mental model)• High Workload (memory requirements)

• Unfamiliarity w/ task / First time• Time pressure (in a hurry)

Individual CapabilitiesTask Demands

EARLY SIGNS - ERROR PRECURSORS

http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf

HUMAN PERFORMANCE TOOLS

• Communications

• Critical Steps

• Enhanced Pre-Job Briefing

• Self Check

• Peer Check

• Independent Verification

• Error Traps

• Just Culture

• Effective Communication

• Questioning Attitude

• Enhanced Turnover

• Error Precursors

• Performance/Error Modes

• Tagging

• Record Keeping

• Poka Yoke

• SAFE Dialogue

• STAR

• Training


• Organization

• Resource

• Catch-all• Capacity

• MOS / RA

• A robust and regularly tested Incident and Problem Management Processes

• No blame – actively• anticipate it;

• try to understand potential cause of error; and

• make sure we learn from past mistakes so that we can minimize its occurrence and impact on our critical systems.

• Invest in Training

• Written Documentation

14

KEY TAKE AWAY

• It is worthwhile to commit resources to reduce errors

• We can do improve our resiliency and thereby uptime

• There are proven methods and tools

15

REFERENCES

• https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique

• http://news.delta.com/chief-operating-officer-gives-delta-operations-update

• https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/

• http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf

• https://www.oecd-nea.org/brief/brief-02.html

16

https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique

http://news.delta.com/chief-operating-officer-gives-delta-operations-update

https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/

http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf

https://www.oecd-nea.org/brief/brief-02.html

新想咨询Newwit Consultancy

Email: [email protected]

17

january 2017 human error: the biggest challenge to ......feb 16, 2017 · • ways to avoid human...

Documents