real-time metrics and distributed monitoring - jeff pierce, change.org - devopsdays tel aviv 2015

Post on 15-Apr-2017

311 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DevOps Days 2015

Real Time Metrics and Distributed Monitoring

Jeff PierceSenior DevOps Engineer @Change.orgjpierce@change.orghttps://github.com/jeffpierce@Th3Technomancer

● Consulted for Citigroup on their High Frequency Trading Servers

● Stints at:○ Apple○ Rackspace

● Project Lead on Cassabon (https://github.com/jeffpierce/cassabon)

Background

About Change.org

● Global platform where people start and win campaigns for change

● 120 million users worldwide● Rapidly expanding user base

and engineering team● Spiky, unpredictable traffic

based on current events and viral petitions

Why not outsource it?

Why not outsource it?

We tried!

Why not outsource it?

We tried!We weren’t happy with the price

Why not outsource it?

We tried!We weren’t happy with the priceWe weren’t happy with the resolution of the stats we were capturing

Why do we need our monitoring distributed and high res metrics?

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

High resolution metrics are awesome!

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

High resolution metrics are awesome!

Faster response time to outages

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

High resolution metrics are awesome!

Faster response time to outagesAble to autoscale on our own terms

What else influenced our decision?

What else influenced our decision?

● We were pretty understaffed!

What else influenced our decision?

● We were pretty understaffed!● Low implementation time was key

What else influenced our decision?

● We were pretty understaffed!● Low implementation time was key● We needed to rely on the

knowledge the team already had

What else influenced our decision?

● We were pretty understaffed!● Low implementation time was key.● We needed to rely on the

knowledge the team already had● We needed something with low

maintenance and relatively easy scalability

Searching For A Solution

First Attempt: Try other providers!

First Attempt: Try other providers!

● Unable to find a provider that met both our price and resolution requirements

First Attempt: Try other providers!

● Unable to find a provider that met both our price and resolution requirements

● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

First Attempt: Try other providers!

● Unable to find a provider that met both our price and resolution requirements

● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

● Decided to see what we could come up with in-house!

Requirements For A DIY Stack

Requirements For A DIY Stack

● Leverage tools team members were familiar with

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance● Flexible, resilient, distributed

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced

services and with higher resolution

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced

services and with higher resolution● Uses many parts that we were

already using in our infrastructure

We settled on...

We settled on...

● collectd with statsd plugin (http://collectd.org)

We settled on...

● collectd with statsd plugin (http://collectd.org)

● Cyanite (https://github.com/pyr/cyanite)

We settled on...

● collectd with statsd plugin (http://collectd.org)

● Cyanite (https://github.com/pyr/cyanite)

● graphite-api (https://github.com/brutasse/graphite-api)

We settled on...

We settled on...

● collectd with statsd plugin (http://collectd.org)

● Cyanite (https://github.com/pyr/cyanite)

● graphite-api (https://github.com/brutasse/graphite-api)

● Grafana (http://grafana.org)

JSON Dashboards Are A Big Deal!

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

● Can be checked in with app code

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

● Can be checked in with app code● Can also be generated via

change control with custom libraries

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

● Can be checked in with app code● Can also be generated via change

control with custom libraries● JSON is a familiar format to devs,

increasing adoption rate

App Servers

“Central” Monitor

Ext. Stat Gatherer

TCP 2003Cyanite

CyaniteCyanite

Cyanite

CassandraCassandra

CassandraCassandra

CassandraCassandra

TCP 8080

Elastic Search

Grafana + Graphite-API

TCP 80

Dashboard Requests

The Monitoring Side

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats

● Try to use external monitoring services as little as possible

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats

● Try to use external monitoring services as little as possible

● Template as many checks as possible for easy management by change control

Getting Developer Buy-In

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

● Make importable code in commonly used languages

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

● Make importable code in commonly used languages

● Demo ease of use

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

● Make importable code in commonly used languages

● Demo ease of use● Consult individual, influential

developers on importance of getting stats everywhere

What We Got From All This Work

Wins Thus Far

● Faster code!

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier

than ever!

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier

than ever!● Faster, easier troubleshooting!

And The Biggest Win...

Increased Communication Between Feature Developers and DevOps!

Increased Communication Between Feature Developers and DevOps!

● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.

Increased Communication Between Feature Developers and DevOps!

● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter

● When something is wrong, it’s easier to accept it from stats than the Ops person

Winners Ask Questions!

top related