real-time metrics and distributed monitoring - jeff pierce, change.org - devopsdays tel aviv 2015

DevOps Days 2015

Real Time Metrics and Distributed Monitoring

Jeff PierceSenior DevOps Engineer @[email protected]://github.com/jeffpierce@Th3Technomancer

mailto:[email protected]

https://github.com/jeffpierce

https://github.com/jeffpierce

● Consulted for Citigroup on their High Frequency Trading Servers

● Stints at:○ Apple○ Rackspace

● Project Lead on Cassabon (https://github.com/jeffpierce/cassabon)

https://github.com/jeffpierce/cassabon

https://github.com/jeffpierce/cassabon

Background

About Change.org

● Global platform where people start and win campaigns for change

● 120 million users worldwide● Rapidly expanding user base

and engineering team● Spiky, unpredictable traffic

based on current events and viral petitions

Why not outsource it?


We tried!


We tried!We weren’t happy with the price


We tried!We weren’t happy with the priceWe weren’t happy with the resolution of the stats we were capturing

Why do we need our monitoring distributed and high res metrics?


In a cloud world, centralized services are asking for failure



High resolution metrics are awesome!




Faster response time to outages




Faster response time to outagesAble to autoscale on our own terms

What else influenced our decision?


● We were pretty understaffed!


● We were pretty understaffed!● Low implementation time was key


● We were pretty understaffed!● Low implementation time was key● We needed to rely on the

knowledge the team already had


● We were pretty understaffed!● Low implementation time was key.● We needed to rely on the

knowledge the team already had● We needed something with low

maintenance and relatively easy scalability

Searching For A Solution

First Attempt: Try other providers!


● Unable to find a provider that met both our price and resolution requirements



● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts



● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

● Decided to see what we could come up with in-house!

Requirements For A DIY Stack


● Leverage tools team members were familiar with



● Relatively low maintenance



● Relatively low maintenance● Flexible, resilient, distributed



● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced

services and with higher resolution



● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced

services and with higher resolution● Uses many parts that we were

already using in our infrastructure

We settled on...

We settled on...

● collectd with statsd plugin (http://collectd.org)

http://collectd.org/

We settled on...


● Cyanite (https://github.com/pyr/cyanite)


https://github.com/pyr/cyanite

We settled on...



● graphite-api (https://github.com/brutasse/graphite-api)



https://github.com/brutasse/graphite-api


We settled on...

We settled on...



● graphite-api (https://github.com/brutasse/graphite-api)

● Grafana (http://grafana.org)





http://grafana.org/

JSON Dashboards Are A Big Deal!


● Developers often know better which stats and graphs are important



● Takes work off of the plate of DevOps




● Can be checked in with app code




● Can be checked in with app code● Can also be generated via

change control with custom libraries




● Can be checked in with app code● Can also be generated via change

control with custom libraries● JSON is a familiar format to devs,

increasing adoption rate

App Servers

“Central” Monitor

Ext. Stat Gatherer

TCP 2003Cyanite

CyaniteCyanite

Cyanite

CassandraCassandra

CassandraCassandra

CassandraCassandra

TCP 8080

Elastic Search

Grafana + Graphite-API

TCP 80

Dashboard Requests

The Monitoring Side

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite



● Use PagerDuty for alerting/paging



● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats





● Try to use external monitoring services as little as possible





● Try to use external monitoring services as little as possible

● Template as many checks as possible for easy management by change control

Getting Developer Buy-In


● Make it simple to add stats and monitors so that we get a high adoption rate



● Make importable code in commonly used languages




● Demo ease of use




● Demo ease of use● Consult individual, influential

developers on importance of getting stats everywhere

What We Got From All This Work

Wins Thus Far

● Faster code!

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier

than ever!

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier

than ever!● Faster, easier troubleshooting!

And The Biggest Win...

Increased Communication Between Feature Developers and DevOps!


● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.


● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter

● When something is wrong, it’s easier to accept it from stats than the Ops person

Winners Ask Questions!

real-time metrics and distributed monitoring - jeff pierce, change.org - devopsdays tel aviv 2015

Technology