five causes of alert fatigue -- and how to prevent them
TRANSCRIPT
![Page 1: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/1.jpg)
Alert Fatigue -
and what to do about it
Elik Eizenberg, VP R&D
http://www.bigpanda.io
![Page 2: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/2.jpg)
alert fatigue
noun
A constant flood of noisy, non-actionable alerts, generated
by your monitoring stack.
Synonyms: alert overload, alert spam
2
![Page 3: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/3.jpg)
3
Poor Signal-to-Noise Ratio
Delayed Response
Wrong Prioritization
Constant Context Switching
![Page 4: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/4.jpg)
4
Common Pitfalls
![Page 5: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/5.jpg)
What you see: 20 critical Nagios / Zabbix alerts, all at once
What happened:
- Unexpected traffic to your app
- You get an alert from practically every host in the cluster
In an ideal world:
- 1 alert, indicating 80% of the cluster has problems
- Don’t wake me up unless at least some % of the cluster is down
5
Alert Per Host
![Page 6: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/6.jpg)
What you see: Low disk space alert on a MongoDB host
What happened:
- DB disk is slowly filling up as expected
- Will become urgent in a few weeks
In an ideal world:
- No need for an alert at all!
- Automatically issue a Jira ticket and assign it to me
6
Important != Urgent
![Page 7: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/7.jpg)
What you see: The same high-load alerts, every Monday after lunch
What happened:
- Monday is busy by definition
- You can’t use the same thresholds every day
In an ideal world:
- Dynamically update your thresholds
- Or focus only on anomalies (e.g. etsy/skyline)
7
Non-Adaptive Thresholds
![Page 8: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/8.jpg)
What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote
& Splunk…
What happened:
- Data corruption in a couple of Mongo nodes
- Resulting in heavy disk IO and some transaction errors
- This kind of error manifests itself in server, application & user level
In an ideal world:
- Auto correlate highly-related alerts from different systems
- Show me one high-level incident, instead of low-level alerts
8
Same Issue, Different System
![Page 9: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/9.jpg)
What you see: Issue pops us for a couple of minutes, then disappears.
What happened:
- Maybe a cronjob over utilizes the netwrok
- Or a random race-condition in the app
- Or a rarely-used product feature that causes the backend to crash
In an ideal world:
- No need for an alert every time it happens
- Give me a monthly report of common shot-lived alerts
9
Transient Alerts
![Page 10: Five Causes of Alert Fatigue -- and how to prevent them](https://reader036.vdocuments.site/reader036/viewer/2022080214/55ad9f411a28abf12d8b46eb/html5/thumbnails/10.jpg)
10
Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda
Thanks for listening!