responding to outages maturely

94
Responding to Outages Maturely John Allspaw SVP, Tech Ops Code As Craft, Berlin Tuesday, April 24, 12

Upload: john-allspaw

Post on 01-Sep-2014

4.019 views

Category:

Technology


0 download

DESCRIPTION

These

TRANSCRIPT

Page 1: Responding to Outages Maturely

Responding to Outages Maturely

John AllspawSVP, Tech Ops

Code As Craft, Berlin

Tuesday, April 24, 12

Page 2: Responding to Outages Maturely

OPERABILITY

Tuesday, April 24, 12

Page 3: Responding to Outages Maturely

PRODUCTION

Tuesday, April 24, 12

Page 4: Responding to Outages Maturely

http://WhoOwnsMyAvailability.com

Tuesday, April 24, 12

Page 5: Responding to Outages Maturely

Tuesday, April 24, 12

Page 6: Responding to Outages Maturely

How important is this?

Tuesday, April 24, 12

Page 7: Responding to Outages Maturely

Tuesday, April 24, 12

Page 8: Responding to Outages Maturely

Tuesday, April 24, 12

Page 9: Responding to Outages Maturely

Tuesday, April 24, 12

Page 10: Responding to Outages Maturely

Tuesday, April 24, 12

Page 11: Responding to Outages Maturely

Tuesday, April 24, 12

Page 12: Responding to Outages Maturely

Tuesday, April 24, 12

Page 13: Responding to Outages Maturely

Tuesday, April 24, 12

Page 14: Responding to Outages Maturely

Tuesday, April 24, 12

Page 15: Responding to Outages Maturely

Tuesday, April 24, 12

Page 16: Responding to Outages Maturely

Tuesday, April 24, 12

Page 17: Responding to Outages Maturely

Tuesday, April 24, 12

Page 18: Responding to Outages Maturely

Tuesday, April 24, 12

Page 19: Responding to Outages Maturely

How important is this?

Tuesday, April 24, 12

Page 20: Responding to Outages Maturely

How Can This Happen?

Tuesday, April 24, 12

Page 21: Responding to Outages Maturely

Complicated? Complex?

Tuesday, April 24, 12

Page 22: Responding to Outages Maturely

Complex Systems

• Cascading Failures

• Difficult to determine boundaries

• Complex systems may be open

• Complex systems may have a memory

• Complex systems may be nested

• Dynamic network of multiplicity

• May produce emergent phenomena

• Relationships are non-linear

• Relationships contain feedback loopsTuesday, April 24, 12

Page 23: Responding to Outages Maturely

How Can This Happen?It does happen.And it will again.

And again.Tuesday, April 24, 12

Page 24: Responding to Outages Maturely

Tuesday, April 24, 12

Page 25: Responding to Outages Maturely

Optimization

MTBF

MTTRTuesday, April 24, 12

Page 26: Responding to Outages Maturely

http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12

Page 27: Responding to Outages Maturely

How does team troubleshooting

happen?Tuesday, April 24, 12

Page 28: Responding to Outages Maturely

Time

Problem Starts

DetectionEvaluation

ResponseStable

ConfirmationAll Clear Po

stMort

em

Tuesday, April 24, 12

Page 29: Responding to Outages Maturely

Time

Problem Starts

DetectionEvaluation

ResponseStable

ConfirmationAll Clear

Stress

PostM

ortem

Tuesday, April 24, 12

Page 30: Responding to Outages Maturely

Forced beyond learned roles

Actions whose consequences are both important and difficult to see

Cognitively and perceptively noisy

Coordinative load increases exponentiallyTuesday, April 24, 12

Page 31: Responding to Outages Maturely

Tuesday, April 24, 12

Page 32: Responding to Outages Maturely

So What Can We Do?

Tuesday, April 24, 12

Page 33: Responding to Outages Maturely

We Learn From Others

Tuesday, April 24, 12

Page 34: Responding to Outages Maturely

Characteristics of response to escalating scenarios

Tuesday, April 24, 12

Page 35: Responding to Outages Maturely

...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Tuesday, April 24, 12

Page 36: Responding to Outages Maturely

...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate)

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Tuesday, April 24, 12

Page 37: Responding to Outages Maturely

...inclined to think in causal series, instead of causal nets.

A therefore B,

instead of

A, therefore B and C (therefore D and E), etc.

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Tuesday, April 24, 12

Page 38: Responding to Outages Maturely

Thematic Vagabonding

Pitfalls

Tuesday, April 24, 12

Page 39: Responding to Outages Maturely

Pitfalls

Goal Fixation(encystment)

Tuesday, April 24, 12

Page 40: Responding to Outages Maturely

Pitfalls

Refusal to make decisions

Tuesday, April 24, 12

Page 41: Responding to Outages Maturely

Non-communicating lone wolf-isms

Heroism

Tuesday, April 24, 12

Page 42: Responding to Outages Maturely

Irrelevant noise in comm channels

Distraction

Tuesday, April 24, 12

Page 43: Responding to Outages Maturely

Jens Rasmussen, 1983Senior Member, IEEE

“Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models”IEEE Transactions On Systems, Man, and Cybernetics, May 1983

Tuesday, April 24, 12

Page 44: Responding to Outages Maturely

SKILL - BASED

Simple, routineRULE - BASED

Knowable, but unfamiliarKNOWLEDGE - BASED

WTF IS GOING ON?(Reason, 1990)

Tuesday, April 24, 12

Page 45: Responding to Outages Maturely

• Which causes did you consider first?

• Which ones did you not consider at all?

• How much of what you considered comes from recent history?

• How much comes from observations from other team members?

Team Troubleshooting

Tuesday, April 24, 12

Page 46: Responding to Outages Maturely

• How effective is the response team in communicating to other groups? Users?

• How long does it take to exhaust obvious cause(s)?

Team Troubleshooting

Tuesday, April 24, 12

Page 47: Responding to Outages Maturely

Team Dynamics

Tuesday, April 24, 12

Page 48: Responding to Outages Maturely

• Air Traffic Control

• Naval Air Operations At Sea

• Electrical Power Systems

• Etc.

High Reliability Organizations

• Complex Socio-Technical systems

• Efficiency <-> Thoroughness

• Time/Resource Constrained

• Engineering-driven

Tuesday, April 24, 12

Page 49: Responding to Outages Maturely

Tuesday, April 24, 12

Page 50: Responding to Outages Maturely

“The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea”Rochlin, La Porte, and Roberts. Naval War College Review 1987

http://govleaders.org/reliability.htm

Tuesday, April 24, 12

Page 51: Responding to Outages Maturely

Tuesday, April 24, 12

Page 52: Responding to Outages Maturely

Close interdependence between groups

Tuesday, April 24, 12

Page 53: Responding to Outages Maturely

Close reciprocal coordination and information sharing, resulting in overlapping knowledge

Tuesday, April 24, 12

Page 54: Responding to Outages Maturely

High redundancy: multiple people observing the same event and sharing information

Tuesday, April 24, 12

Page 55: Responding to Outages Maturely

Broad definition of who belongs to the team.

Tuesday, April 24, 12

Page 56: Responding to Outages Maturely

Teammates are included in the communication loops rather than excluded.

Tuesday, April 24, 12

Page 57: Responding to Outages Maturely

Lots of error correction.

Tuesday, April 24, 12

Page 58: Responding to Outages Maturely

High levels of situation comprehension: maintain constant awareness of the possibility of accidents.

Tuesday, April 24, 12

Page 59: Responding to Outages Maturely

High levels of interpersonal skills

Tuesday, April 24, 12

Page 60: Responding to Outages Maturely

Maintenance of detailed records of past incidents that are closely examined with a view to learning from them.

Tuesday, April 24, 12

Page 61: Responding to Outages Maturely

Patterns of authority are changed to meet the demands of the events: organizational flexibility.

Tuesday, April 24, 12

Page 62: Responding to Outages Maturely

The reporting of errors and faults is rewarded, not punished.

Tuesday, April 24, 12

Page 63: Responding to Outages Maturely

So What ElseCan We Do?

Tuesday, April 24, 12

Page 64: Responding to Outages Maturely

We Drill

Tuesday, April 24, 12

Page 65: Responding to Outages Maturely

We GameDay

Tuesday, April 24, 12

Page 66: Responding to Outages Maturely

Tuesday, April 24, 12

Page 67: Responding to Outages Maturely

We Learn To Improvise

Tuesday, April 24, 12

Page 68: Responding to Outages Maturely

IMPROVISATION

Tuesday, April 24, 12

Page 69: Responding to Outages Maturely

IMPROVISATION

Tuesday, April 24, 12

Page 70: Responding to Outages Maturely

We Learn From Our Mistakes

Tuesday, April 24, 12

Page 71: Responding to Outages Maturely

Postmortems

• Full timelines: What happened, when, who involved

• Review in public, everyone invited

• Search for “second stories” instead of “human error”

• Cultivating a blameless environment

• Giving requisite authority to individuals to improve things

Tuesday, April 24, 12

Page 72: Responding to Outages Maturely

High signal:noise in comm channels?

Troubleshooting fatigue?

Troubleshooting handoff?

All tools on-hand and working?

Improvised tooling or solutions?

Metrics visibility?

Collaborative and skillful communication?

Qualifying Response

Tuesday, April 24, 12

Page 73: Responding to Outages Maturely

Remediation

Tuesday, April 24, 12

Page 74: Responding to Outages Maturely

We Share Near-MissEvents

Tuesday, April 24, 12

Page 75: Responding to Outages Maturely

Near MissesHey everybody -

Don’t be like me. I tried to X, but that wasn’t a good idea.

It almost exploded everyone.

So, don’t do: (details about X)

Love, Joe

Tuesday, April 24, 12

Page 76: Responding to Outages Maturely

• Can act like “vaccines” - help system safety without actually hurting anything

• Happen more often, so provide more data on latent failures

• Powerful reminder of hazards, and slows down the process of forgetting to be afraid

Near Misses

Tuesday, April 24, 12

Page 77: Responding to Outages Maturely

Practice!

• How we troubleshoot in the moment, as a distributed team

• How we handle time pressure

• How we Observe/Orient/Decide/Act

• How we communicate during emergencies

• How we trust (or not) each other during emergencies

• How we relate to emergencies when things are normal

• How we could detect how we are protected during normal times (i.e., why aren’t we going down RIGHT NOW?)

Tuesday, April 24, 12

Page 78: Responding to Outages Maturely

Resilient Response

• Can learn from other fields

• Can train for outages

• Can learn from mistakes

• Can learn from successes as well as failures

Tuesday, April 24, 12

Page 79: Responding to Outages Maturely

http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12

Page 80: Responding to Outages Maturely

THE END

Tuesday, April 24, 12

Page 81: Responding to Outages Maturely

A parting wordA parting challenge

Tuesday, April 24, 12

Page 82: Responding to Outages Maturely

Two Propositions

Tuesday, April 24, 12

Page 83: Responding to Outages Maturely

100 changes

6 change-related issuesTuesday, April 24, 12

Page 84: Responding to Outages Maturely

100 > 6

Tuesday, April 24, 12

Page 85: Responding to Outages Maturely

Proposition #1

“Ways in which things go right are special cases of the ways in which things go wrong.”

Tuesday, April 24, 12

Page 86: Responding to Outages Maturely

Proposition #1

Successes = failures gone wrong

Study the failures, generalize from that.

Potential data sources: 6 out of 100

Tuesday, April 24, 12

Page 87: Responding to Outages Maturely

Proposition #2

“Ways in which things go wrong are special cases of the ways in which things go right.”

Tuesday, April 24, 12

Page 88: Responding to Outages Maturely

Proposition #2

Failures = successes gone wrongStudy the successes, generalize from that

Potential data sources: 94 out of 100Tuesday, April 24, 12

Page 89: Responding to Outages Maturely

94/100 ?

6/100 ?

OR

Tuesday, April 24, 12

Page 90: Responding to Outages Maturely

What and WHY Do Things Go RIGHT?

Tuesday, April 24, 12

Page 91: Responding to Outages Maturely

Not just: why did we fail?

But also: why did we succeed?

Tuesday, April 24, 12

Page 92: Responding to Outages Maturely

Mature Role of Automation

http://www.bainbrdg.demon.co.uk/Papers/Ironies.html

“Ironies of Automation” - Lisanne Bainbridge

Tuesday, April 24, 12

Page 93: Responding to Outages Maturely

Mature Role of Automation

• Moves humans from manual operator to supervisor

• Extends and augments human abilities, doesn’t replace it

• Doesn’t remove “human error”

• Are brittle

• Recognize that there is always discretionary space for humans

• Recognizes the Law of Stretched Systems

Tuesday, April 24, 12

Page 94: Responding to Outages Maturely

Law of Stretched Systems

“Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity”

D. Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006

Tuesday, April 24, 12