fixing on-call when nobody thinks it’s (too) broken › sites › default › files › conference...

34
Fixing On-Call When Nobody Thinks it’s (too) Broken Tony Lykke Hudson River Trading

Upload: others

Post on 27-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Fixing On-Call When Nobody Thinks it’s (too) Broken

Tony LykkeHudson River Trading

Page 2: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Jan 2012 to Oct 201871,317 high urgency pages

Average: 201/weekWorst month: 2,327

Nov 2018 to Feb 20191,015 high urgency pages

Average: 56/week

Page 3: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

What This Talk is (not)

IS:

● Guide to reducing pager noise

IS NOT:

● Explanation of why you should

Page 4: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

What This Talk is (not)

IS:

● General advice you can (hopefully) apply regardless of platform

IS NOT:

● Technical deep-dive into Nagios and PagerDuty

Page 5: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

A Brief History

Atlassian (Sydney)

Service Ops (SOPS)

Jira SRE

Hudson River Trading (NYC)

Trade Systems SRE

Page 6: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

HRT at a Glance

Automated trading firm

>5% of daily US equities volume

Milliseconds are slowwww

Everything is quantifiable

~300 people across 10 offices

Page 7: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Trade Systems Overview

Thousands of machines

Tens of colos

Hundreds of pages per week

Page 8: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Hundreds of Pages a Week

Fewer machines More machines

1600

800

0

Page 9: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Me: “You don’t have to get paged so much”

Them:

Page 10: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Why?

“That’s just how it’s always been”

“Each of our colos is configured slightly differently, so it’s hard to set good thresholds”

“Reducing noise that much just isn’t realistic without Big Corp™ resources”

“It’s better than it used to be”

Page 11: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

9 Really Hard Steps to Reduce Pager Noise

1) Understand your audience

2) Understand the problem

3) Understand the system

4) Devise a game plan

5) Get permission (optional)

6) Lay the groundwork

7) Fix the lowest hanging fruit

8) Communicate, communicate, communicate

9) GOTO 7

Page 12: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 1: Understand Your Audience

Stockholm syndrome + alert fatigue = ???

Culture of alerts meaning things are working

Disagreement that this “problem” is fixable

Priorities not aligned

Preference for conservative approaches

Page 13: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 2: Understand the Problem

Page 14: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 2: Understand the Problem

ApproximateUS market hours

Daily dataconsolidation

Page 15: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 3: Understand the System

Nagios 3

30 satellites, 40,000+ checks

200,000 lines of configuration

Huge amounts of stale and duplicated config

Manually deployed via rsync

Adding or changing checks tedious and error-prone

Page 16: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 4: Devise a Game Plan

Doesn’t have to be comprehensive

Low risk and high impact changes first

Communicate the plan and ask for feedback

Listen to the data

Page 17: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 4: Devise a Game Plan

Satellites

Master

Already exists,make smarterUse templates

to simplify

Page 18: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Ask for forgiveness instead of permission?

Step 5: Get Permission (Optional)

Use the data you’ve collected

15-20 mins per page that isn’t outright ignored

Be annoyingly persistent

Does your team trust you?

How long until they ask questions?

Will you have answers?

Page 19: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 5a: When in Doubt, Over-Communicate

“Preference for conservative approaches” - Me, earlier

You will break things

Page 20: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 6: Lay the Groundwork

Neglect creates technical debt

Page 21: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 6: Lay the Groundwork

Neglect creates technical debt

Standardise config

Templated 15k lines of duplicate config

Automated build and deploy

Page 22: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 7: Fix the Lowest Hanging Fruit

Minimise:

Risk

Time

Maximise:

Impact

Page 23: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 7: Fix the Lowest Hanging Fruit

Page 24: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 7: Fix the Lowest Hanging Fruit

ApproximateUS market hours

Daily dataconsolidation

Page 25: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 7: Fix the Lowest Hanging Fruit

Page 26: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

“What’s it feel like to be a hero?”

Page 27: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 8: Communicate, Communicate, Communicate

Blog posts

RFCs

Documentation

Comprehensive pull request messages

Documented rollout plans

???

Page 28: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 8: Communicate, Communicate, Communicate

Silence made people uncomfortable

Decade of alerts being ok

Needed to wean our team off the noise

Page 29: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 8: Communicate, Communicate, Communicate

Page 30: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Step 9: GOTO 7

Page 31: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Then vs Now

Satellites

Master

Page 32: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Then vs NowThen vs Now

Satellites

MasterDrop

Downgrade

Group

Page 33: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Ten Second Review

Gather data

Review data

Make a plan

Communicate

???

Talk at SREcon about it

Page 34: Fixing On-Call When Nobody Thinks it’s (too) Broken › sites › default › files › conference › protected … · Fixing On-Call When Nobody Thinks it’s (too) Broken Tony

Sound Fun? We’re Hiring

Come talk to us at our booth! We’re at: Booth 41, Northside Ballroom

Visit our careers page: hudson-trading.com/careers

[email protected]

@tl on SREcon Slack