practical monitoring techniques

12
Practical Monitoring Techniques

Upload: ariel-moskovich

Post on 13-Apr-2017

315 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Practical Monitoring Techniques

Practical Monitoring Techniques

Page 2: Practical Monitoring Techniques

Today's Talk● Our Mission● Current Tools● Increasing Coverage● PD Schedules● Automatic Self Healing● Bots And Alerts channels● Events Dashboard● Dashboard Accessibility● Best Practices Summary

Page 3: Practical Monitoring Techniques

Our Mission

Back up culture with the proper tools to support it

Page 4: Practical Monitoring Techniques

Current Tools

● Metrics collections: Collectd, statsd, Cloudwatch● Monitoring: Sensu, NewRelic● Alert channels: PagerDuty, emails, slack● Dashboards: Grafana, CloudWatch, NewRelic● Application testing: E2E Testing System● Internal tools: Sensu mobile, events system,

Sensu bar and more

Page 5: Practical Monitoring Techniques

Increasing Coverage● Automatic collection of basic

system and 3rd party metrics for new instances

● Add alerts automatically for new instance of existed subscriber

● Each Developer / DevOps is responsible for monitoring his application / infrastructure

● Easy method to add new alerts and dashboards

● Automatic events flow

Page 6: Practical Monitoring Techniques

Pager Schedules

● Divided into logical groups of ownership● Schedule has escalation point

● On call should be able to connect and respond to issues in his area

● Easy method to override schedule ● Ability to contact relevant on call

● Ability to page relevant on call

Page 7: Practical Monitoring Techniques

Automatic Self Healing

● Better MTTR● Avoid waking On Call if

possible

● Log activity to float recurrent issues

● Limit the healing to avoid restart loops

● Make sure to sync Healer Alert↔

Page 8: Practical Monitoring Techniques

Bots, Integrations and Alerts Channels

● Alerts channels: Emails, slack, PD mobile, sms, calls● Integrations: Sensu to PD/Slack, CloudWatch to PD,

3rd party (EX: CouchBase, NewRelic, etc) to PD,

● Slack Bot:

Page 9: Practical Monitoring Techniques

Events Dashboard

● Simple Rest API for sending events● Clean timeline view to spot production events● Connections between events (“depends on” and “dependents”)● Detailed view for each event

Page 10: Practical Monitoring Techniques

Accessibility

● Available from everywhere by mobile ● Easy to ack, resolve, mute alerts● Slack bots to reach help● Automatically get graph with the alert● Ability to search, edit, copy, etc alerts● Treat alerts management as code (SVC, DB,

backups, etc)

Page 11: Practical Monitoring Techniques

Best Practices Summary

● Share the pain● Automate base metrics● Automate healing● Make help reachable● Make it easy to add alerts and dashboards● Use warning levels as soft events to avoid phone calls at night● Automate graphs in alerts● Positive alerting system check each day● Dependencies between alerts● Postmortems

Page 12: Practical Monitoring Techniques

Questions