monitoring by zabbix: the final frontier
TRANSCRIPT
Monitoring by Zabbix: the Final Frontier
Detect problems way before end users
AgendaProgramming languages we use to build our software
Standard approach to monitoring
How Zabbix does it?
Who am I?Alexei Vladishev
Creator of Zabbix
CEO and Architect
@avladishev
Riga | Tokyo | New York
Runtime issues
Memory leaks
Uninitialised pointers
Require discipline!
Runtime issues
Memory leaks
Uninitialised pointers
Require discipline!
Runtime issues
Out of memory
GC affects execution
Runtime issues
Memory leaks
Uninitialised pointers
Require discipline!
Runtime issues
Out of memory
GC affects execution
Runtime issues
Out of memory
Slow execution
Hard to predict resource usage
No guarantees: performance, resource usage, availability, etc.
Confluence KB: How to fix out of memory errors by increasing available memory?
We aren't really able to give a concrete recommendation for the amount of memory to allocate, because that will depend greatly on your server setup, the size of your user base, and their behaviour. You will need to find a value that works for you, ie no noticeable GC pauses, and no OutOfMemory errors.
Solution: Increase Xmx in small increments (eg 512mb at a time), until you no longer experience the OutOfMemory error.
Too many bad things may happen at runtime
That’s why we need monitoring!
Monitoring is about describing abnormal behaviour of our
systems
How to detect it?
Typical approach
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
CPU load > 5
Typical approach
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
CPU load > 5
Problem Problem Problem
Recovery Recovery
Too sensitive Flapping
Zabbix does it smart way
History
Analysis
Data collection
Zabbix server
History
Analysis
Data collection
Alerts
Zabbix server
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
Analyse historyCPU load for the last 10 minutes > 5
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
Analyse historyProblem!
CPU load for the last 10 minutes > 5
Recovery
Problem disappeared !=
problem is resolved
Problem: free disk space <= 10%
Now free disk space is 10.001%
Have we resolved our problem?
Problem: free disk space <= 10%
Now free disk space is 10.001%
Problem resolved?
Different conditions
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
Problem: CPU load > 5 Recovery: CPU load < 1
Different conditions
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50
Problem: CPU load > 5 Recovery: CPU load < 1
Problem!
Recovery
No flapping!
Smarter approachProblem if Free disk space < 10%
Recovery if Free disk space > 30% for the last 15 minutes
Problem if 3 consecutive checks of REST service failed
Recovery if 10 consecutive checks of REST service are OK
Anomaly detection
0
2,5
5
7,5
10
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10
Compare current system state with the past
Anomaly!
Forecasting
0
12,5
25
37,5
50
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
Forecasting
0
12,5
25
37,5
50
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
y = -2,9455x + 48,309
When and value after period of time
Problem in the future
ConclusionMonitoring by is your best friend
Use smart problem detection, do not spam DevOps
Detect problems way before end users notice
Anomalies
Forecasting