real-time metrics and distributed monitoring - jeff pierce, change.org - devopsdays tel aviv 2015
TRANSCRIPT
DevOps Days 2015
Real Time Metrics and Distributed Monitoring
Jeff PierceSenior DevOps Engineer @[email protected]://github.com/jeffpierce@Th3Technomancer
● Consulted for Citigroup on their High Frequency Trading Servers
● Stints at:○ Apple○ Rackspace
● Project Lead on Cassabon (https://github.com/jeffpierce/cassabon)
Background
About Change.org
● Global platform where people start and win campaigns for change
● 120 million users worldwide● Rapidly expanding user base
and engineering team● Spiky, unpredictable traffic
based on current events and viral petitions
Why not outsource it?
Why not outsource it?
We tried!
Why not outsource it?
We tried!We weren’t happy with the price
Why not outsource it?
We tried!We weren’t happy with the priceWe weren’t happy with the resolution of the stats we were capturing
Why do we need our monitoring distributed and high res metrics?
Why do we need our monitoring distributed and high res metrics?
In a cloud world, centralized services are asking for failure
Why do we need our monitoring distributed and high res metrics?
In a cloud world, centralized services are asking for failure
High resolution metrics are awesome!
Why do we need our monitoring distributed and high res metrics?
In a cloud world, centralized services are asking for failure
High resolution metrics are awesome!
Faster response time to outages
Why do we need our monitoring distributed and high res metrics?
In a cloud world, centralized services are asking for failure
High resolution metrics are awesome!
Faster response time to outagesAble to autoscale on our own terms
What else influenced our decision?
What else influenced our decision?
● We were pretty understaffed!
What else influenced our decision?
● We were pretty understaffed!● Low implementation time was key
What else influenced our decision?
● We were pretty understaffed!● Low implementation time was key● We needed to rely on the
knowledge the team already had
What else influenced our decision?
● We were pretty understaffed!● Low implementation time was key.● We needed to rely on the
knowledge the team already had● We needed something with low
maintenance and relatively easy scalability
Searching For A Solution
First Attempt: Try other providers!
First Attempt: Try other providers!
● Unable to find a provider that met both our price and resolution requirements
First Attempt: Try other providers!
● Unable to find a provider that met both our price and resolution requirements
● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts
First Attempt: Try other providers!
● Unable to find a provider that met both our price and resolution requirements
● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts
● Decided to see what we could come up with in-house!
Requirements For A DIY Stack
Requirements For A DIY Stack
● Leverage tools team members were familiar with
Requirements For A DIY Stack
● Leverage tools team members were familiar with
● Relatively low maintenance
Requirements For A DIY Stack
● Leverage tools team members were familiar with
● Relatively low maintenance● Flexible, resilient, distributed
Requirements For A DIY Stack
● Leverage tools team members were familiar with
● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced
services and with higher resolution
Requirements For A DIY Stack
● Leverage tools team members were familiar with
● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced
services and with higher resolution● Uses many parts that we were
already using in our infrastructure
We settled on...
We settled on...
● collectd with statsd plugin (http://collectd.org)
● Cyanite (https://github.com/pyr/cyanite)
We settled on...
● collectd with statsd plugin (http://collectd.org)
● Cyanite (https://github.com/pyr/cyanite)
● graphite-api (https://github.com/brutasse/graphite-api)
We settled on...
We settled on...
● collectd with statsd plugin (http://collectd.org)
● Cyanite (https://github.com/pyr/cyanite)
● graphite-api (https://github.com/brutasse/graphite-api)
● Grafana (http://grafana.org)
JSON Dashboards Are A Big Deal!
JSON Dashboards Are A Big Deal!
● Developers often know better which stats and graphs are important
JSON Dashboards Are A Big Deal!
● Developers often know better which stats and graphs are important
● Takes work off of the plate of DevOps
JSON Dashboards Are A Big Deal!
● Developers often know better which stats and graphs are important
● Takes work off of the plate of DevOps
● Can be checked in with app code
JSON Dashboards Are A Big Deal!
● Developers often know better which stats and graphs are important
● Takes work off of the plate of DevOps
● Can be checked in with app code● Can also be generated via
change control with custom libraries
JSON Dashboards Are A Big Deal!
● Developers often know better which stats and graphs are important
● Takes work off of the plate of DevOps
● Can be checked in with app code● Can also be generated via change
control with custom libraries● JSON is a familiar format to devs,
increasing adoption rate
App Servers
“Central” Monitor
Ext. Stat Gatherer
TCP 2003Cyanite
CyaniteCyanite
Cyanite
CassandraCassandra
CassandraCassandra
CassandraCassandra
TCP 8080
Elastic Search
Grafana + Graphite-API
TCP 80
Dashboard Requests
The Monitoring Side
Monitoring Implementation Goals
● Write/run simple scripts to query Cyanite
Monitoring Implementation Goals
● Write/run simple scripts to query Cyanite
● Use PagerDuty for alerting/paging
Monitoring Implementation Goals
● Write/run simple scripts to query Cyanite
● Use PagerDuty for alerting/paging● Only use external monitoring to
check application-wide or aggregate stats
Monitoring Implementation Goals
● Write/run simple scripts to query Cyanite
● Use PagerDuty for alerting/paging● Only use external monitoring to
check application-wide or aggregate stats
● Try to use external monitoring services as little as possible
Monitoring Implementation Goals
● Write/run simple scripts to query Cyanite
● Use PagerDuty for alerting/paging● Only use external monitoring to
check application-wide or aggregate stats
● Try to use external monitoring services as little as possible
● Template as many checks as possible for easy management by change control
Getting Developer Buy-In
Getting Developer Buy-In
● Make it simple to add stats and monitors so that we get a high adoption rate
Getting Developer Buy-In
● Make it simple to add stats and monitors so that we get a high adoption rate
● Make importable code in commonly used languages
Getting Developer Buy-In
● Make it simple to add stats and monitors so that we get a high adoption rate
● Make importable code in commonly used languages
● Demo ease of use
Getting Developer Buy-In
● Make it simple to add stats and monitors so that we get a high adoption rate
● Make importable code in commonly used languages
● Demo ease of use● Consult individual, influential
developers on importance of getting stats everywhere
What We Got From All This Work
Wins Thus Far
● Faster code!
Wins Thus Far
● Faster code!● Faster and fewer rollbacks!
Wins Thus Far
● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier
than ever!
Wins Thus Far
● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier
than ever!● Faster, easier troubleshooting!
And The Biggest Win...
Increased Communication Between Feature Developers and DevOps!
Increased Communication Between Feature Developers and DevOps!
● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.
Increased Communication Between Feature Developers and DevOps!
● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter
● When something is wrong, it’s easier to accept it from stats than the Ops person
Winners Ask Questions!