![Page 1: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/1.jpg)
Pitfalls in Measuring SLOsDanyel Fisher@fisherdanyel
Liz Fong-Jones@lizthegrey
![Page 2: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/2.jpg)
@fisherdanyel @lizthegrey
![Page 3: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/3.jpg)
@fisherdanyel @lizthegrey
![Page 4: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/4.jpg)
@fisherdanyel @lizthegrey
What do you do when things break?
How bad was this break?
![Page 5: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/5.jpg)
@fisherdanyel @lizthegrey
![Page 6: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/6.jpg)
@fisherdanyel @lizthegrey
Build new features! We need to improve quality!
![Page 7: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/7.jpg)
@fisherdanyel @lizthegrey
Management
Engineering
Clients and Users
How broken is “too broken”?
What does “good enough” mean?
Combatting alert fatigue
![Page 8: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/8.jpg)
@fisherdanyel @lizthegrey
A telemetry system produces events that correspond to real world use
We can describe some of these events as eligible
We can describe some of them as good
![Page 9: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/9.jpg)
@fisherdanyel @lizthegrey
Given an event, is it eligible? Is it good?
Eligible: “Had an http status code”
Good: “... that was a 200, and was served under 500 ms”
![Page 10: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/10.jpg)
@fisherdanyel @lizthegrey
![Page 11: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/11.jpg)
@fisherdanyel @lizthegrey
Minimum Quality ratio over a period of time
Number of bad events allowed.
![Page 12: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/12.jpg)
@fisherdanyel @lizthegrey
Deploy faster
Room for experimentation
Opportunity to tighten SLO
![Page 13: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/13.jpg)
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
![Page 14: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/14.jpg)
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
Default dashboards usually load in < 1s
99.9%
![Page 15: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/15.jpg)
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
Queries often return in < 10 s
Default dashboards usually load in < 1s
99.9%
99%
![Page 16: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/16.jpg)
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
Queries often return in < 10 s
Default dashboards usually load in < 1s
99.9%
99% 7.3 hours
45 minutes
~4.3 minutes
![Page 17: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/17.jpg)
@fisherdanyel @lizthegrey
![Page 18: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/18.jpg)
@fisherdanyel @lizthegrey
User Data Throughput
![Page 19: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/19.jpg)
@fisherdanyel @lizthegrey
User Data Throughput
We blew through three months’ budget in those 12
minutes.
![Page 20: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/20.jpg)
@fisherdanyel @lizthegrey
We dropped customer data
![Page 21: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/21.jpg)
@fisherdanyel @lizthegrey
We dropped customer data
We rolled it back (manually)
We communicated to customers
We halted deploys
![Page 22: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/22.jpg)
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
![Page 23: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/23.jpg)
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
We had experimental CI build wiring.
![Page 24: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/24.jpg)
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
We had experimental CI build wiring.
Our scripts deployed empty binaries.
![Page 25: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/25.jpg)
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
We had experimental CI build wiring.
Our scripts deployed empty binaries.
There was no health check and rollback.
![Page 26: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/26.jpg)
@fisherdanyel @lizthegrey
We stopped writing new features
We prioritized stability
We mitigated risks
![Page 27: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/27.jpg)
@fisherdanyel @lizthegrey
SLOs allowed us to characterize
what went wrong, how badly it went wrong,
andhow to prioritize repair
![Page 28: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/28.jpg)
@fisherdanyel @lizthegrey
Learning from SLOs
![Page 29: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/29.jpg)
@fisherdanyel @lizthegrey
Final pointA one-line description of it
![Page 30: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/30.jpg)
@fisherdanyel @lizthegrey
SLOs, as espoused at Google vs in practice
We told you "SLOs are magical".
But not everyone has Google's tooling.
And SLO practice at Google is uneven.
Debugging SLOs often is divorced from reporting SLOs.
There had to be a better way, leveraging Honeycomb...
![Page 31: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/31.jpg)
@fisherdanyel @lizthegrey
Using a Design Thinking perspective:
● Expressing and Viewing SLOs● Burndown Alerts and Responding● Learning from our Experiences
![Page 32: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/32.jpg)
@fisherdanyel @lizthegrey
Displays and Views
![Page 33: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/33.jpg)
@fisherdanyel @lizthegrey
See where the burndown was happening, explain why, and remediate
![Page 34: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/34.jpg)
@fisherdanyel @lizthegrey
Expressing SLOs
Time based: “How many 15 minute periods, had a P99(duration) < 50 ms”
Event based: “How many events had a duration < 500 ms”
![Page 35: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/35.jpg)
@fisherdanyel @lizthegrey
Status of an SLO
![Page 36: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/36.jpg)
@fisherdanyel @lizthegrey
How have we done?
![Page 37: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/37.jpg)
@fisherdanyel @lizthegrey
![Page 38: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/38.jpg)
@fisherdanyel @lizthegrey
Where did it go?
![Page 39: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/39.jpg)
@fisherdanyel @lizthegrey
When did the errors happen?
![Page 40: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/40.jpg)
@fisherdanyel @lizthegrey
When did the errors happen?
![Page 41: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/41.jpg)
@fisherdanyel @lizthegrey
What went wrong?
High dimensional data
High cardinality data
![Page 42: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/42.jpg)
@fisherdanyel @lizthegrey
Why did it happen?
![Page 43: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/43.jpg)
@fisherdanyel @lizthegrey
Why did it happen?
![Page 44: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/44.jpg)
@fisherdanyel @lizthegrey
Why did it happen?
![Page 45: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/45.jpg)
@fisherdanyel @lizthegrey
See where the burndown was happening, explain why, and remediate
![Page 46: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/46.jpg)
@fisherdanyel @lizthegrey
User Feedback
“I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate but they’re a little spiky to be useful.”
![Page 47: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/47.jpg)
@fisherdanyel @lizthegrey
Burndown Alerts
![Page 48: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/48.jpg)
@fisherdanyel @lizthegrey
How is my system doing?Am I over budget?
When will my alarm fail?
![Page 49: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/49.jpg)
@fisherdanyel @lizthegrey
When will I fail?
User goal: get alerts to exhaustion time
Human-digestible units
24 hours: “I’ll take a look in the morning”
4 hours: “All hands on deck!”
![Page 50: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/50.jpg)
@fisherdanyel @lizthegrey
![Page 51: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/51.jpg)
@fisherdanyel @lizthegrey
How is my system doing?Am I over budget?
When will my alarm fail?
![Page 52: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/52.jpg)
@fisherdanyel @lizthegrey
Implementing Burn Alerts
Run a 30 day query
![Page 53: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/53.jpg)
@fisherdanyel @lizthegrey
Implementing Burn Alerts
Run a 30 day query
at a 5 minute resolution
![Page 54: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/54.jpg)
@fisherdanyel @lizthegrey
Implementing Burn Alerts
Run a 30 day query
at a 5 minute resolution
every minute
![Page 55: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/55.jpg)
@fisherdanyel @lizthegrey
![Page 56: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/56.jpg)
@fisherdanyel @lizthegrey
Learning from Experience
![Page 57: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/57.jpg)
@fisherdanyel @lizthegrey
Volume is importantTolerate at least dozens of bad events per day
![Page 58: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/58.jpg)
Faults
![Page 59: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/59.jpg)
@fisherdanyel @lizthegrey
SLOs for Customer ServiceRemember that user having a bad day?
ADD IMAGE
![Page 60: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/60.jpg)
@fisherdanyel @lizthegrey
Blackouts are easy… but brownouts are much more interesting
![Page 61: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/61.jpg)
@fisherdanyel @lizthegrey
![Page 62: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/62.jpg)
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes
![Page 63: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/63.jpg)
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”
![Page 64: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/64.jpg)
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”
![Page 65: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/65.jpg)
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”
It’s out of memoryWe aren’t alerting on that crash
![Page 66: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/66.jpg)
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”
It’s out of memoryWe aren’t alerting on that crash
![Page 67: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/67.jpg)
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”
It’s out of memoryWe aren’t alerting on that crash
10:32 am Fixed
![Page 68: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/68.jpg)
@fisherdanyel @lizthegrey
![Page 69: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/69.jpg)
@fisherdanyel @lizthegrey
![Page 70: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/70.jpg)
@fisherdanyel @lizthegrey
Cultural ChangeIt’s hard to replace alerts with SLOs
But a clear incident can help
![Page 71: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/71.jpg)
@fisherdanyel @lizthegrey
Reduce Alarm FatigueFocus on user-affecting SLOs
Focus on actionable alarms
![Page 72: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/72.jpg)
@fisherdanyel @lizthegrey
Conclusion
![Page 73: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/73.jpg)
@fisherdanyel @lizthegrey
SLOs allowed us to characterize
what went wrong,how badly it went wrong,
andhow to prioritize repair
![Page 74: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/74.jpg)
@fisherdanyel @lizthegrey
You can do it tooAnd maybe avoid our mistakes
![Page 75: Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy faster Room for experimentation Opportunity to tighten SLO. @fisherdanyel @lizthegrey](https://reader034.vdocuments.site/reader034/viewer/2022042323/5f0d7f7b7e708231d43aa7a4/html5/thumbnails/75.jpg)
Pitfalls in Measuring SLOs
Email: [email protected] / [email protected]
Twitter: @fisherdanyel & @lizthegrey
Come talk to us in Slack!
hny.co/danyel