handling incidents

40
How to handle incidents, downtime & outages Devopsdays, Amsterdam 2015 David Mytton, Founder, Server Density

Upload: server-density

Post on 06-Aug-2015

92 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Handling incidents

How to handle incidents, downtime & outages

Devopsdays, Amsterdam 2015 David Mytton, Founder, Server Density

Page 2: Handling incidents
Page 3: Handling incidents

Cost of uptime?

Page 4: Handling incidents

Cost of uptime?

Page 5: Handling incidents

Cost of uptime?

$2.9bnQ1: 2015

Page 6: Handling incidents

Cost of uptime?

Page 7: Handling incidents

Cost of uptime?

$2.9bnQ1: 2015

$870mQ1: 2015

Page 8: Handling incidents

Cost of uptime?

Page 9: Handling incidents

Cost of uptime?

$2.9bnQ1: 2015

$870mQ1: 2015

$4.1bnQ1: 2015

Page 10: Handling incidents

Cost of uptime?

Page 11: Handling incidents

How much are you spending?

Page 12: Handling incidents

Expect downtime

• Prepare

• Respond

• Postmortem

Page 13: Handling incidents

Prepare

• On call

• Primary/secondary

Page 14: Handling incidents

Prepare

• On call

• Primary/secondary

• Reachability

Page 15: Handling incidents

Prepare

• On call

• Off call

Page 16: Handling incidents

Prepare

• On call

• Off call

• Docs

Page 17: Handling incidents

Prepare

• On call

• Off call

• Docs

• Searchable

Page 18: Handling incidents

Prepare

• On call

• Off call

• Docs

• Searchable

• Independent

Page 19: Handling incidents

Prepare

Page 20: Handling incidents

• Key info

• Team contacts

Prepare

Page 21: Handling incidents

• Key info

• Team contacts

• Vendor contacts

Prepare

Page 22: Handling incidents

• Key info

• Team contacts

• Vendor contacts

• Key credentials

Prepare

Page 23: Handling incidents

• Key info

• Unexpected situations

Prepare

• Communication

Page 24: Handling incidents

• Key info

• Unexpected situations

Prepare

• Communication

• Internet access

Page 25: Handling incidents

• Key info

• Unexpected situations

• Communication

• Internet access

• Support access

Prepare

Page 26: Handling incidents

Respond

• First responder

1. Load incident response checklist

Page 27: Handling incidents

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

Page 28: Handling incidents

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

3. Log incident in JIRA

Page 29: Handling incidents

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

3. Log incident in JIRA

4. Begin investigation

Page 30: Handling incidents

• Key response principles

• Log everything

Respond

Page 31: Handling incidents

Respond

• Key response principles

• Log everything

• Frequent public updates

Page 32: Handling incidents

Respond

• Key response principles

• Log everything

• Frequent public updates

• Gather the team

Page 33: Handling incidents

Respond

• Key response principles

• Log everything

• Frequent public updates

• Gather the team

• Escalate!

Page 34: Handling incidents

• Within a few days

Postmortem

Page 35: Handling incidents

• Within a few days

• Tell the story

Postmortem

Page 36: Handling incidents

• Within a few days

• Tell the story

• Appropriate technical detail

Postmortem

Page 37: Handling incidents

• Within a few days

• Tell the story

• Appropriate technical detail

• What failed, why?

Postmortem

Page 38: Handling incidents

Postmortem

• How it’s going to be fixed

Page 39: Handling incidents

Postmortem

Page 40: Handling incidents

ありがとうございます

[email protected]

@davidmytton