webinar - data driven postmortems - jason yee
TRANSCRIPT
![Page 1: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/1.jpg)
DATA-DRIVEN POSTMORTEMSJASON YEE, DATADOG @GITBISECT
![Page 2: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/2.jpg)
“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.”- Henry Ford
TW: @gitbisect @datadoghq
![Page 3: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/3.jpg)
@gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey Hunter
@datadoghq SaaS-based monitoring Trillions of data points per day http://jobs.datadoghq.com
![Page 4: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/4.jpg)
“The problems we work on at Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.”
Internal Datadog Developer Guide
TW: @gitbisect @datadoghq
![Page 5: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/5.jpg)
BLAMELESS POSTMORTEMS
TW: @gitbisect @datadoghq
![Page 6: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/6.jpg)
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS DEVOPS? ▸ Culture
▸ Automation
▸Metrics
▸ Sharing
TW: @gitbisect @datadoghq
![Page 7: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/7.jpg)
TW: @gitbisect @datadoghq
![Page 8: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/8.jpg)
TW: @gitbisect @datadoghq
![Page 9: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/9.jpg)
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
OUR FOCUS AREA ▸ Culture
▸ Sharing
TW: @gitbisect @datadoghq
![Page 10: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/10.jpg)
TW: @gitbisect @datadoghq
![Page 11: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/11.jpg)
CULTURE & SHARING RESOURCES
BLAMELESS POSTMORTEMS▸Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
▸The Human Side of Postmortems by Dave Zwieback
http://bit.ly/human-postmortem
TW: @gitbisect @datadoghq
![Page 12: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/12.jpg)
METRICSCULTURE & SHARING ARE GREAT, BUT WHAT ABOUT
TW: @gitbisect @datadoghq
![Page 13: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/13.jpg)
TW: @gitbisect @datadoghq
![Page 14: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/14.jpg)
COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE
SO INSTRUMENT ALL THE THINGS!
TW: @gitbisect @datadoghq
![Page 15: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/15.jpg)
4 QUALITIES OF GOOD METRICSNOT ALL METRICS ARE EQUAL
TW: @gitbisect @datadoghq
![Page 16: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/16.jpg)
1. WELL UNDERSTOOD
![Page 17: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/17.jpg)
1. WELL UNDERSTOOD
![Page 18: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/18.jpg)
1. WELL UNDERSTOOD
![Page 19: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/19.jpg)
TW: @gitbisect @datadoghq
2. SUFFICIENT GRANULARITY
![Page 20: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/20.jpg)
1 second Peak 46%
1 minute Peak 36%
5 minutes Peak 12%
![Page 21: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/21.jpg)
1 second Peak 46%
1 minute Peak 36%
5 minutes Peak 12%
![Page 22: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/22.jpg)
1 second Peak 46%
1 minute Peak 36%
5 minutes Peak 12%
![Page 23: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/23.jpg)
3. TAGGED & FILTERABLE
TW: @gitbisect @datadoghq
![Page 24: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/24.jpg)
![Page 25: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/25.jpg)
![Page 26: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/26.jpg)
![Page 27: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/27.jpg)
Query Based Monitoring“What’s the average throughput of application:nginx per version ?”
“How many requests per second is my role:accounting-app running application:postgresql hosted in region:us-west-1 compared to region:us-east-1?”
TW: @gitbisect @datadoghq
![Page 28: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/28.jpg)
4. LONG-LIVED
TW: @gitbisect @datadoghq
![Page 29: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/29.jpg)
METRICS 101
HOW LONG?▸ AWS Cloudwatch: Up to 15months at 1h granularity
▸ MS Azure Monitoring Service: Up to 90d at 1d granularity
▸ Google Stackdriver: Up to 6 weeks at 1m granularity
▸ Datadog: Up to 15months at 1s granularity
TW: @gitbisect @datadoghq
![Page 30: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/30.jpg)
TW: @gitbisect @datadoghq
![Page 31: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/31.jpg)
TW: @gitbisect @datadoghq
![Page 32: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/32.jpg)
TW: @gitbisect @datadoghq
![Page 33: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/33.jpg)
TW: @gitbisect @datadoghq
P.S. - June 1! Mark your calendar!
![Page 34: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/34.jpg)
RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES
TW: @gitbisect @datadoghq
![Page 35: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/35.jpg)
There is no singular “Root Cause”
![Page 36: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/36.jpg)
HUMAN ELEMENTTECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES
TW: @gitbisect @datadoghq
![Page 37: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/37.jpg)
IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM
TW: @gitbisect @datadoghq
![Page 38: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/38.jpg)
HUMAN DATA
DATA COLLECTION: WHO?▸ Everyone!
▸ Responders
▸ Identifiers
▸ Affected Users
TW: @gitbisect @datadoghq
![Page 39: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/39.jpg)
HUMAN DATA
DATA COLLECTION: WHAT?▸Their perspective
▸What they did
▸What they thought
▸Why they thought/did it
TW: @gitbisect @datadoghq
![Page 40: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/40.jpg)
“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”
RICHARD GUINDON
TW: @gitbisect @datadoghq
![Page 41: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/41.jpg)
TELLING STORIES
“A PICTURE IS WORTH A THOUSAND WORDS” - ANCIENT PROVERB
TW: @gitbisect @datadoghq
![Page 42: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/42.jpg)
HUMAN DATA
DATA COLLECTION: WHEN?▸ As soon as possible.
▸Memory drops sharply within 20 minutes
▸ Susceptibility to “false memory” increases
▸Get your project managers involved!
TW: @gitbisect @datadoghq
![Page 43: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/43.jpg)
HUMAN DATA
DATA SKEW/CORRUPTION▸ Stress
▸ Sleep deprivation
▸ Burnout
TW: @gitbisect @datadoghq
![Page 44: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/44.jpg)
HUMAN DATA
DATA SKEW/CORRUPTION▸ Blame
▸ Fear of punitive action
TW: @gitbisect @datadoghq
![Page 45: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/45.jpg)
HUMAN DATA
DATA SKEW/CORRUPTION▸ Bias
▸ Anchoring
▸ Hindsight
▸Outcome
▸ Availability (Recency)
▸ Bandwagon Effect
TW: @gitbisect @datadoghq
![Page 46: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/46.jpg)
HOW WE DO POSTMORTEMS AT DATADOG
TW: @gitbisect @datadoghq
![Page 47: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/47.jpg)
DATADOG POSTMORTEMS
A FEW NOTES▸ Postmortems emailed to company wide
▸ Scheduled recurring postmortem meetings
TW: @gitbisect @datadoghq
![Page 48: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/48.jpg)
DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT HAPPENED?▸Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
▸What was the impact on customers?
▸What was the severity of the outage?
▸What components were affected?
▸What ultimately resolved the outage?
TW: @gitbisect @datadoghq
![Page 49: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/49.jpg)
TW: @gitbisect @datadoghq
![Page 50: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/50.jpg)
TW: @gitbisect @datadoghq
![Page 51: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/51.jpg)
DATADOG’S POSTMORTEM TEMPLATE (2/5)
HOW WAS THE OUTAGE DETECTED?▸We want to make sure we detected the issue
early and would catch the same issue if it were to repeat.
▸Did we have a metric that showed the outage?
▸Was there a monitor on that metric?
▸ How long did it take for us to declare an outage?
TW: @gitbisect @datadoghq
![Page 52: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/52.jpg)
TW: @gitbisect @datadoghq
![Page 53: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/53.jpg)
TW: @gitbisect @datadoghq
![Page 54: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/54.jpg)
DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE RESPOND?▸Who was the incident owner & who else was
involved?
▸ Slack archive links and timeline of events!
▸What went well?
▸What didn’t go so well?
TW: @gitbisect @datadoghq
![Page 55: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/55.jpg)
*Names changed
TW: @gitbisect @datadoghq
![Page 56: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/56.jpg)
CHATOPS ARCHIVES FTW!
*Names changed
TW: @gitbisect @datadoghq
![Page 57: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/57.jpg)
*Names changed
TRACK LEARNINGS AS YOU GO
TW: @gitbisect @datadoghq
![Page 58: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/58.jpg)
DATADOG’S POSTMORTEM TEMPLATE (4/5)
WHY DID IT HAPPEN?▸Deep dive into the cause
▸ Examples from this incident:
▸ http://bit.ly/dd-statuspage
▸ http://bit.ly/alq-postmortem
TW: @gitbisect @datadoghq
![Page 59: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/59.jpg)
DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?▸ Link to Github issues and Trello cards
▸Now?
▸Next?
▸ Later?
▸ Follow up notes
TW: @gitbisect @datadoghq
![Page 60: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/60.jpg)
*Names changed
TW: @gitbisect @datadoghq
![Page 61: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/61.jpg)
DATADOG’S POSTMORTEM TEMPLATE
RECAP:▸What happened (summary)?
▸ How did we detect it?
▸ How did we respond?
▸Why did it happen (deep dive)?
▸ Actionable next steps!
TW: @gitbisect @datadoghq
![Page 62: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/62.jpg)
KEEP LEARNING
MORE RESOURCES▸ Postmortem Template
http://bit.ly/postmortem-template
▸ Post-Incident Reviews by Jason Hand http://bit.ly/post-incident-review
TW: @gitbisect @datadoghq
![Page 63: Webinar - Data driven postmortems - Jason Yee](https://reader034.vdocuments.site/reader034/viewer/2022051710/5aaae78e7f8b9a586f8b45fb/html5/thumbnails/63.jpg)
QUESTIONS?LET’S TALK!@GITBISECT
@DATADOGHQ