incident management and chatops - usenix forget the incident response procedure spy pre-oncall...

28
Incident Management and ChatOps Daniella Niyonkuru (@niyodanie)

Upload: phamtram

Post on 27-Mar-2018

252 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Incident Management and ChatOps❤🔥💬🤖

Daniella Niyonkuru (@niyodanie)

Page 2: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Production Engineering

Page 3: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

🔥

Page 4: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

I N C I D E N T M A N A G E R O N - C A L L ( I M O C )

Page 5: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

IMOCSupport Response Manager

(SRM)

Component Experts

Incident Command System (ICS)

Page 6: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

The IMOC is on-call for Incident Response;

NOT on-duty for fixing production issues.

Page 7: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Incident Response Funnel➡ Shit breaks ➡ Detection ➡ Start Incident ➡ Communicate ➡ Fix ➡ Stop Incident ➡ Document (Service Disruption) ➡ Investigation ➡ Root Cause Analysis (RCA) ➡ Action Items ➡ ResolutionCredit: John Arthorne

Page 8: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Pager Anxiety; What if …

• Forget I’m on-call

• Phone in silent mode

• Forget to update the status page

• Don’t know who to ping

• Too much context switching, can't focus

• Forget the incident response procedure

😱

Page 9: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

C H ATO P S

Page 10: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Conversation-Driven Collaboration

Page 11: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Chatops at Shopify

Page 12: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

I N C I D E N T S A N D C H ATO P S

Page 13: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Incident Management Tools

Three main sets of commands:

• spy page

• spy incident

• spy status

Page 14: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Shit breaks➡ spy page imoc “order notifications not going out”

Page 15: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Start incident➡ spy incident start me order fraud analysis outage

Page 16: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Communicate➡ spy incident tldr

Page 17: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Other Teams➡ spy incident tell :team message➡ spy page datastores

Page 18: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Actions

Page 19: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Third-Party Services➡ spy status➡ spy status :provider :status for :feature➡ spy pager imoc res 123

Page 20: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Reminders when: [30, stop] command: :check_status_page- when: 120 command: :notify_support_atc message: 'Spy has notified the Support Response Manager (SRM) on your behalf.'- when: 120 command: :srm_fill_out_doc- when: 300 message: 'You should coordinate external comms with the support incident responder.’- when: 600 command: :srm_checking_in- when: [3600] command: :notify_imoc_team- when: stop message: 'Please create a Service Disruptions report.' Milestones

Page 21: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Stop incident➡ spy incident stop

Page 22: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

And much more

• SD content generation (`spy incident note`)

• Preventing on-call fatigue (`spy incident handoff`)

• Reducing context switching (`spy pager stfu`)

• Reminders (before, during and after the incident)

Page 23: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

H O W D I D S P Y A F F E C T I M O C S ?

Page 24: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Benefits

• Increased sharing and focus

• Shortened feedback loop

• Eliminated manual toil

• Smoother incident handling

• Faster onboarding experience

Page 25: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Fears, What if …• Forget I’m on-call

• Phone in silent mode

• Forget to update the status page

• Don’t know who to ping

• Too much context switching, can't focus

• Forget the incident response procedure

spy pre-oncall reminders

spy check reminders

spy oncall

spy cmd #war-room

spy incident

Page 26: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

🤖

• Flexible and powerful

• A very important member of our team

• Enables us to really lead an incident response

• Reduce incident impact and duration

Page 27: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Questions?

THANK YOU!

@niyodanie

Page 28: Incident Management and ChatOps - USENIX  Forget the incident response procedure spy pre-oncall reminders spy check reminders spy oncall spy cmd #war-room spy incident $

Shopify TalksThursday5:00 pm to 6:00 pm:Six Ways a Culture of Communication Strengthens Your Team’s Resiliency (Lightning Talk) - Jaime Woo

Friday11:30 am to 12:00 pm: Building an On-Premise Kubernetes Cluster For a Large Web Application - Daniel Turner

Check out our blog at engineering.shopify.comFollow us on Twitter at @shopifyeng