metrics driven engineering (velocity 2011)

91
METRICS-DRIVEN ENGINEERING at Kellan Elliott-McCrea, VP of Eng. [email protected] @kellan Tuesday, June 5, 12

Post on 12-Sep-2014

2.156 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Metrics driven engineering (velocity 2011)

METRICS-DRIVENENGINEERING at

Kellan Elliott-McCrea, VP of [email protected] @kellan

Tuesday, June 5, 12

Page 2: Metrics driven engineering (velocity 2011)

Tuesday, June 5, 12

Page 3: Metrics driven engineering (velocity 2011)

Tuesday, June 5, 12

Page 4: Metrics driven engineering (velocity 2011)

What is Etsy?

Tuesday, June 5, 12

Page 5: Metrics driven engineering (velocity 2011)

8.5+ million items in the marketplace

Tuesday, June 5, 12

Page 6: Metrics driven engineering (velocity 2011)

400,000+ active

Tuesday, June 5, 12

Page 7: Metrics driven engineering (velocity 2011)

$300+ million in sales in 2010

~$41 million/month

Tuesday, June 5, 12

Page 8: Metrics driven engineering (velocity 2011)

> $1000 / minute

Tuesday, June 5, 12

Page 9: Metrics driven engineering (velocity 2011)

> 1 billion page views / month

Tuesday, June 5, 12

Page 10: Metrics driven engineering (velocity 2011)

business in over 150 countries

Tuesday, June 5, 12

Page 11: Metrics driven engineering (velocity 2011)

deploy the site, every ~20 minutes

Tuesday, June 5, 12

Page 12: Metrics driven engineering (velocity 2011)

engineering team grew

~4x in 2010

Tuesday, June 5, 12

Page 13: Metrics driven engineering (velocity 2011)

Metrics?

Tuesday, June 5, 12

Page 14: Metrics driven engineering (velocity 2011)

Logs, Graphs, Trends,

and Correlations

Tuesday, June 5, 12

Page 15: Metrics driven engineering (velocity 2011)

Metrics Driven?

Tuesday, June 5, 12

Page 16: Metrics driven engineering (velocity 2011)

Making Decisions

Tuesday, June 5, 12

Page 17: Metrics driven engineering (velocity 2011)

How many visitors are

using this thing?

Tuesday, June 5, 12

Page 18: Metrics driven engineering (velocity 2011)

Can we deploy that to

100% of our visitors?

Tuesday, June 5, 12

Page 19: Metrics driven engineering (velocity 2011)

Did we make it faster?

Tuesday, June 5, 12

Page 20: Metrics driven engineering (velocity 2011)

Did I just break something?

Tuesday, June 5, 12

Page 21: Metrics driven engineering (velocity 2011)

WHO MAKES THESE GRAPHS?

Well, the Ops team manages the network, racks the servers, installed the

monitoring tools, wears the pagers, blah, blah, blah...

Q.A.

Tuesday, June 5, 12

Page 22: Metrics driven engineering (velocity 2011)

but... Engineers build

the application.

Tuesday, June 5, 12

Page 23: Metrics driven engineering (velocity 2011)

Dev + Ops

Tuesday, June 5, 12

Page 24: Metrics driven engineering (velocity 2011)

ACCESS

Tuesday, June 5, 12

Page 25: Metrics driven engineering (velocity 2011)

Yes! No.

Tuesday, June 5, 12

Page 26: Metrics driven engineering (velocity 2011)

“Engineers are too busy!”

Tuesday, June 5, 12

Page 27: Metrics driven engineering (velocity 2011)

Here’s the BIG SECRET...

Tuesday, June 5, 12

Page 28: Metrics driven engineering (velocity 2011)

... MAKE IT EASY!

Tuesday, June 5, 12

Page 29: Metrics driven engineering (velocity 2011)

Simple, open source tools

Tuesday, June 5, 12

Page 30: Metrics driven engineering (velocity 2011)

Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)

Tuesday, June 5, 12

Page 31: Metrics driven engineering (velocity 2011)

Gan★cluster oriented★huge community contributed recipes★2.0 released today (including several Flickr and Etsy patches!)★gmetad makes it easy to track custom metrics

Tuesday, June 5, 12

Page 32: Metrics driven engineering (velocity 2011)

Tuesday, June 5, 12

Page 33: Metrics driven engineering (velocity 2011)

Graphite★super flexible collection and display★per metrics buckets★single instance ★super easy to write and use custom display functions

Tuesday, June 5, 12

Page 34: Metrics driven engineering (velocity 2011)

Logging

Tuesday, June 5, 12

Page 35: Metrics driven engineering (velocity 2011)

Logger::log_error("User login failed. Reason: $msg for

$username", “login”);

Tuesday, June 5, 12

Page 36: Metrics driven engineering (velocity 2011)

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Page 37: Metrics driven engineering (velocity 2011)

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Page 38: Metrics driven engineering (velocity 2011)

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Page 39: Metrics driven engineering (velocity 2011)

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Page 40: Metrics driven engineering (velocity 2011)

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Page 41: Metrics driven engineering (velocity 2011)

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Page 43: Metrics driven engineering (velocity 2011)

Logster

Tuesday, June 5, 12

Page 44: Metrics driven engineering (velocity 2011)

Logsterhttps://github.com/etsy/logster

Tuesday, June 5, 12

Page 45: Metrics driven engineering (velocity 2011)

Forked from ganglia-logtailer :

- Daemon mode (only cron mode) + Support for Graphite + Simplified parsing scripts

Tuesday, June 5, 12

Page 46: Metrics driven engineering (velocity 2011)

web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling

Tuesday, June 5, 12

Page 47: Metrics driven engineering (velocity 2011)

Fatals Errors Warnings

Tuesday, June 5, 12

Page 48: Metrics driven engineering (velocity 2011)

★runs out of cron★maintains a cursor into log files★supports ganglia and graphite ★custom parsers much easier to write then gmetad

Tuesday, June 5, 12

Page 49: Metrics driven engineering (velocity 2011)

Apache access logs

Tuesday, June 5, 12

Page 50: Metrics driven engineering (velocity 2011)

LogFormat "%h %l %u %t \"%r\" %>s %b" common

Tuesday, June 5, 12

Page 51: Metrics driven engineering (velocity 2011)

LogFormat "%{X-Forwarded-For}i %{True-Client-IP}i %l %u %t \"%r\" %>s %b

\"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %

{etsy_ab_selections}n %{etsy_request_uuid}n %

{etsy_api_consumer_key}n %{etsy_api_method_name}n %

{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined

Tuesday, June 5, 12

Page 52: Metrics driven engineering (velocity 2011)

%{etsy_ab_selections}n

Tuesday, June 5, 12

Page 53: Metrics driven engineering (velocity 2011)

%{etsy_uaid}n

Tuesday, June 5, 12

Page 54: Metrics driven engineering (velocity 2011)

Graphs

Tuesday, June 5, 12

Page 55: Metrics driven engineering (velocity 2011)

“If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we

track it.” - Erik Kastner

http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

Tuesday, June 5, 12

Page 56: Metrics driven engineering (velocity 2011)

Tuesday, June 5, 12

Page 57: Metrics driven engineering (velocity 2011)

StatsD

Tuesday, June 5, 12

Page 59: Metrics driven engineering (velocity 2011)

StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec);

Tuesday, June 5, 12

Page 60: Metrics driven engineering (velocity 2011)

StatsD::timing("gearman.time", $msec);

90th pct

average

lower

Tuesday, June 5, 12

Page 61: Metrics driven engineering (velocity 2011)

Ad hocname value timestamp

Tuesday, June 5, 12

Page 62: Metrics driven engineering (velocity 2011)

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Tuesday, June 5, 12

Page 63: Metrics driven engineering (velocity 2011)

Correlations

Tuesday, June 5, 12

Page 64: Metrics driven engineering (velocity 2011)

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Tuesday, June 5, 12

Page 65: Metrics driven engineering (velocity 2011)

Trends + Eventstarget=drawAsInfinite(events.deploy.site)

Tuesday, June 5, 12

Page 66: Metrics driven engineering (velocity 2011)

What Happened?

Tuesday, June 5, 12

Page 67: Metrics driven engineering (velocity 2011)

Holt-Winters

Tuesday, June 5, 12

Page 68: Metrics driven engineering (velocity 2011)

"Forecasting Sales by Exponentially Weighted Moving Averages". Peter

Tuesday, June 5, 12

Page 69: Metrics driven engineering (velocity 2011)

"Aberrant Behavior Detection in Time Series for Network Monitoring".

Tuesday, June 5, 12

Page 70: Metrics driven engineering (velocity 2011)

"Holt-Winters Forecasting Applied to Poisson

Processes in Real-Time".

Tuesday, June 5, 12

Page 71: Metrics driven engineering (velocity 2011)

holtWintersConfidence(Upper|Lower)

Tuesday, June 5, 12

Page 72: Metrics driven engineering (velocity 2011)

holtWintersAberration

Tuesday, June 5, 12

Page 73: Metrics driven engineering (velocity 2011)

business metrics with confidence bands

==alertable business metrics

Tuesday, June 5, 12

Page 74: Metrics driven engineering (velocity 2011)

16,000 metrics in GRAPHITE

(plus 32,000 metrics in GANGLIA)

Tuesday, June 5, 12

Page 75: Metrics driven engineering (velocity 2011)

16,000 metrics in GRAPHITE

(plus 32,000 metrics in GANGLIA)

Tuesday, June 5, 12

Page 76: Metrics driven engineering (velocity 2011)

Dashboards

Tuesday, June 5, 12

Page 77: Metrics driven engineering (velocity 2011)

Dashboards

Tuesday, June 5, 12

Page 78: Metrics driven engineering (velocity 2011)

Dashboards

Tuesday, June 5, 12

Page 79: Metrics driven engineering (velocity 2011)

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Hard

Tuesday, June 5, 12

Page 80: Metrics driven engineering (velocity 2011)

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');$g->showDeploys(true);echo $g->getDashboardHTML(280, 220);

Easy!

Tuesday, June 5, 12

Page 81: Metrics driven engineering (velocity 2011)

48 dashboards by32 engineers

Tuesday, June 5, 12

Page 82: Metrics driven engineering (velocity 2011)

Application health

Tuesday, June 5, 12

Page 83: Metrics driven engineering (velocity 2011)

High-level visibility

Tuesday, June 5, 12

Page 84: Metrics driven engineering (velocity 2011)

Low MTTD

Tuesday, June 5, 12

Page 85: Metrics driven engineering (velocity 2011)

Confidence

Tuesday, June 5, 12

Page 86: Metrics driven engineering (velocity 2011)

Make metrics

Tuesday, June 5, 12

Page 87: Metrics driven engineering (velocity 2011)

Make metrics

Tuesday, June 5, 12

Page 88: Metrics driven engineering (velocity 2011)

Make metrics

Tuesday, June 5, 12

Page 89: Metrics driven engineering (velocity 2011)

Not that much

Tuesday, June 5, 12

Page 90: Metrics driven engineering (velocity 2011)

codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logster

bitbucket.org/maplebed/ganglia-logtailer

Tuesday, June 5, 12

Page 91: Metrics driven engineering (velocity 2011)

Questions?

Tuesday, June 5, 12