statscraft 2015: introduction to monitoring - yoav abrahami and mark sonis
TRANSCRIPT
![Page 1: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/1.jpg)
Introduction to Monitoring
![Page 2: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/2.jpg)
![Page 3: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/3.jpg)
![Page 4: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/4.jpg)
![Page 5: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/5.jpg)
Monitoring is both the process and the set of tools of finding problems before
your users, minimizing monetary impact of failure and enabling fast recovery.
![Page 6: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/6.jpg)
Efficient Monitoring aims at notifying the right person at the right time (and right time only) with the most precise
information.
![Page 7: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/7.jpg)
What monitoring is
measure Aggregate & Visualize Alert
![Page 8: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/8.jpg)
Webapp DB
![Page 9: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/9.jpg)
Webapp DB
What to Measure?
End userexperience
/performance
![Page 10: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/10.jpg)
End User Monitoring• Validates our application is running
from “outside”• Measure “real user” performance• Geo-Distributed – including real
latency• Many tools offer such solutions–Measure, visualize, alerts
![Page 11: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/11.jpg)
End User Monitoring• When is a page fully loaded?• Take care - some tools are biased
![Page 12: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/12.jpg)
![Page 13: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/13.jpg)
![Page 14: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/14.jpg)
End User Monitoring• Measure yourself • Using– Resource Timing API– User Timing API– Custom JS
• Send metrics from Browsers to your own sync server– all users / samples
![Page 15: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/15.jpg)
End User MonitoringWhat to measure• Measure page load time (as you
define it)• Measure loading errors• Measure number of page views• Group by Geo & Application• Group by browser
![Page 16: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/16.jpg)
End User MonitoringAlert on• Sudden drop in traffic from a certain
geo• Sudden increase in traffic• Increase in loading times• Increase in errors– From a specific browser
![Page 17: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/17.jpg)
Webapp DB
What to Measure?
Is Alive?
![Page 18: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/18.jpg)
Is Alive• Measure a process liveliness– Is the process running?
• Measure a process responsiveness– Does the process respond to a request?
• Alert on instance down– And auto restart it
• Alert on all instances down
![Page 19: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/19.jpg)
Is Alive• A variety of great tools• Tools that perform “ping” tests• Tools that call a designated URL for
responsiveness tests
• Is alive != Availability– Is alive is per host– Availability is about the system as a whole
![Page 20: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/20.jpg)
Webapp DB
What to Measure?
Request performance
![Page 21: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/21.jpg)
Request Monitoring• Measure how your application
performs– Regardless of networking to the user– Regardless of latency
• Measuring on the server, per server• Many tools provide such solutions–Measure, visualize, alerts
![Page 22: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/22.jpg)
Request Monitoring• But many tools miss the branching
point– Branching point – the point in your code
at which your code decides what branch of execution to perform for a request
• Issues with aggregation, what is monitored, alert flexibility
• But still, there are some great tools
![Page 23: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/23.jpg)
Request MonitoringWhat to measure• Measure request rate• Measure performance histogram• Measure error rate, by error type, http
response code• Group by request type (as you define it)• Group by host, application, data center• Group by error type (as you define it)
![Page 24: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/24.jpg)
Do not use Average• Don’t use Average for performance• Instead, use median, 95%tile and
99%tile.
![Page 25: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/25.jpg)
Request MonitoringWhat to Visualize• Request rate (RPM)
• Request performance–Median, 95%tile and 99%tile
on a moving window
![Page 26: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/26.jpg)
Request MonitoringWhat to Visualize• Errors– Rate, percent (compared to request
rate)– Top X errors by percent– Separate system and application errors– You will always have application errors– You should have exactly 0 system errors
![Page 27: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/27.jpg)
Request MonitoringAlert on• Big changes in traffic• Increase in response times• Increase in errors• System errors
![Page 28: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/28.jpg)
Webapp DB
What to Measure?
Resource Utilization
![Page 29: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/29.jpg)
Resources• System resources– CPU, Memory, IO, Storage, network
• Resource pools– Database connection pools– HTTP connection pools– Thread pools– Other resource pools
![Page 30: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/30.jpg)
Resource MonitoringWhat to measure• Measure resource utilization– Percent of resource used
• Measure resource acquisition queue– Time to acquire– Acquire Timeouts – Usage Timeouts
![Page 31: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/31.jpg)
Resource MonitoringWhat to measure• Group by resource type and pool• Group by host, application, data center• Group by error type (as you define it)
Alert on• Resource over utilization –
avg usage over XX% in a time window
![Page 32: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/32.jpg)
Webapp DB
What to Measure?
Database Monitor
![Page 33: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/33.jpg)
Database monitoringDepends on the database, but yet -• Storage• Replication “lag”• Slow operations• Resource usage
![Page 34: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/34.jpg)
Monitoring at Wix
![Page 35: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/35.jpg)
Precise information
Alert the right person
Automation
![Page 36: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/36.jpg)
Service is alive• Is my application alive on the
minimum number required by my SLA?
• 2 out of 5 instances of my-app are not responding to isAlive
• my-app requires a minimum of 3 instances to meet the SLA
![Page 37: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/37.jpg)
Alert
SensuQueries NginxAlert & SLA
ZooKeeperPlanned Configuration
Service owner
NginxService Load Balancer
Is-alive
![Page 38: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/38.jpg)
Alert
SensuQueries NginxAlert & SLA
ZooKeeperPlanned Configuration
Service owner
NginxService Load Balancer
Is-alive
Alert the right person
Precise information
Automation
![Page 39: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/39.jpg)
Service anomalies• Backend Anomalies
• Identify unhealthy KPIs per endpoints
• Abnormal increase in error rate for class.method.get
![Page 40: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/40.jpg)
Anomaly Alert
AnodotTime series anomaly
detectionAlerts & graphs
statsdStats aggregation
Forwarding metrics
JVM serversMetrics librarymetrics / 1m
Graphs
![Page 41: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/41.jpg)
Anomaly Alert
AnodotTime series anomaly
detectionAlerts & graphs
statsdStats aggregation
Forwarding metrics
JVM serversMetrics librarymetrics / 1m
Graphs
Precise information
Alert the right person
Automation
![Page 42: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/42.jpg)
Service anomalies• Frontend Anomalies
• Browser (client) generated KPIs
• User Experience - Users effected or not? How and where?
![Page 43: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/43.jpg)
Anomaly Alert
Storm & EsperRealtime streaming
processingMetrics / 1m
ClientJS in Browser
events Graphs
Loggerflume
events
AnodotTime series
anomaly detectionAlerts & graphs
![Page 44: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/44.jpg)
Anomaly Alert
Storm & EsperRealtime streaming
processingMetrics / 1m
ClientJS in Browser
events Graphs
Loggerflume
events
AnodotTime series
anomaly detectionAlerts & graphs
Precise information
Alert the right personAutomation
![Page 45: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/45.jpg)
Alert management
• What are the active alerts?
• What is the root cause?
• It is correlated to a change?
![Page 46: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/46.jpg)
Alert
BigPandaCentral alerts & changes
Alerts & Changes
ChangesDeploymentsChef uploadsA/B, F-Toggle,
Exp.
AlertsNewRelic
SensuNagios
PingDomWeb UI
![Page 47: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/47.jpg)
Alert
BigPandaCentral alerts & changes
Alerts & Changes
ChangesDeploymentsChef uploadsA/B, F-Toggle,
Exp.
AlertsNewRelic
SensuNagios
PingDomWeb UI
Precise information
Alert the right person
Automation
![Page 48: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis](https://reader035.vdocuments.site/reader035/viewer/2022070513/588579051a28abbb7e8b5bef/html5/thumbnails/48.jpg)
Questions?