metrics driven engineering (velocity 2011)

Post on 12-Sep-2014

2.156 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

METRICS-DRIVENENGINEERING at

Kellan Elliott-McCrea, VP of Eng.kellan@etsy.com @kellan

Tuesday, June 5, 12

Tuesday, June 5, 12

Tuesday, June 5, 12

What is Etsy?

Tuesday, June 5, 12

8.5+ million items in the marketplace

Tuesday, June 5, 12

400,000+ active

Tuesday, June 5, 12

$300+ million in sales in 2010

~$41 million/month

Tuesday, June 5, 12

> $1000 / minute

Tuesday, June 5, 12

> 1 billion page views / month

Tuesday, June 5, 12

business in over 150 countries

Tuesday, June 5, 12

deploy the site, every ~20 minutes

Tuesday, June 5, 12

engineering team grew

~4x in 2010

Tuesday, June 5, 12

Metrics?

Tuesday, June 5, 12

Logs, Graphs, Trends,

and Correlations

Tuesday, June 5, 12

Metrics Driven?

Tuesday, June 5, 12

Making Decisions

Tuesday, June 5, 12

How many visitors are

using this thing?

Tuesday, June 5, 12

Can we deploy that to

100% of our visitors?

Tuesday, June 5, 12

Did we make it faster?

Tuesday, June 5, 12

Did I just break something?

Tuesday, June 5, 12

WHO MAKES THESE GRAPHS?

Well, the Ops team manages the network, racks the servers, installed the

monitoring tools, wears the pagers, blah, blah, blah...

Q.A.

Tuesday, June 5, 12

but... Engineers build

the application.

Tuesday, June 5, 12

Dev + Ops

Tuesday, June 5, 12

ACCESS

Tuesday, June 5, 12

Yes! No.

Tuesday, June 5, 12

“Engineers are too busy!”

Tuesday, June 5, 12

Here’s the BIG SECRET...

Tuesday, June 5, 12

... MAKE IT EASY!

Tuesday, June 5, 12

Simple, open source tools

Tuesday, June 5, 12

Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)

Tuesday, June 5, 12

Gan★cluster oriented★huge community contributed recipes★2.0 released today (including several Flickr and Etsy patches!)★gmetad makes it easy to track custom metrics

Tuesday, June 5, 12

Tuesday, June 5, 12

Graphite★super flexible collection and display★per metrics buckets★single instance ★super easy to write and use custom display functions

Tuesday, June 5, 12

Logging

Tuesday, June 5, 12

Logger::log_error("User login failed. Reason: $msg for

$username", “login”);

Tuesday, June 5, 12

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong

password for ...

Tuesday, June 5, 12

Logster

Tuesday, June 5, 12

Logsterhttps://github.com/etsy/logster

Tuesday, June 5, 12

Forked from ganglia-logtailer :

- Daemon mode (only cron mode) + Support for Graphite + Simplified parsing scripts

Tuesday, June 5, 12

web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling

Tuesday, June 5, 12

Fatals Errors Warnings

Tuesday, June 5, 12

★runs out of cron★maintains a cursor into log files★supports ganglia and graphite ★custom parsers much easier to write then gmetad

Tuesday, June 5, 12

Apache access logs

Tuesday, June 5, 12

LogFormat "%h %l %u %t \"%r\" %>s %b" common

Tuesday, June 5, 12

LogFormat "%{X-Forwarded-For}i %{True-Client-IP}i %l %u %t \"%r\" %>s %b

\"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %

{etsy_ab_selections}n %{etsy_request_uuid}n %

{etsy_api_consumer_key}n %{etsy_api_method_name}n %

{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined

Tuesday, June 5, 12

%{etsy_ab_selections}n

Tuesday, June 5, 12

%{etsy_uaid}n

Tuesday, June 5, 12

Graphs

Tuesday, June 5, 12

“If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we

track it.” - Erik Kastner

http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

Tuesday, June 5, 12

Tuesday, June 5, 12

StatsD

Tuesday, June 5, 12

StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec);

Tuesday, June 5, 12

StatsD::timing("gearman.time", $msec);

90th pct

average

lower

Tuesday, June 5, 12

Ad hocname value timestamp

Tuesday, June 5, 12

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Tuesday, June 5, 12

Correlations

Tuesday, June 5, 12

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Tuesday, June 5, 12

Trends + Eventstarget=drawAsInfinite(events.deploy.site)

Tuesday, June 5, 12

What Happened?

Tuesday, June 5, 12

Holt-Winters

Tuesday, June 5, 12

"Forecasting Sales by Exponentially Weighted Moving Averages". Peter

Tuesday, June 5, 12

"Aberrant Behavior Detection in Time Series for Network Monitoring".

Tuesday, June 5, 12

"Holt-Winters Forecasting Applied to Poisson

Processes in Real-Time".

Tuesday, June 5, 12

holtWintersConfidence(Upper|Lower)

Tuesday, June 5, 12

holtWintersAberration

Tuesday, June 5, 12

business metrics with confidence bands

==alertable business metrics

Tuesday, June 5, 12

16,000 metrics in GRAPHITE

(plus 32,000 metrics in GANGLIA)

Tuesday, June 5, 12

16,000 metrics in GRAPHITE

(plus 32,000 metrics in GANGLIA)

Tuesday, June 5, 12

Dashboards

Tuesday, June 5, 12

Dashboards

Tuesday, June 5, 12

Dashboards

Tuesday, June 5, 12

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Hard

Tuesday, June 5, 12

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');$g->showDeploys(true);echo $g->getDashboardHTML(280, 220);

Easy!

Tuesday, June 5, 12

48 dashboards by32 engineers

Tuesday, June 5, 12

Application health

Tuesday, June 5, 12

High-level visibility

Tuesday, June 5, 12

Low MTTD

Tuesday, June 5, 12

Confidence

Tuesday, June 5, 12

Make metrics

Tuesday, June 5, 12

Make metrics

Tuesday, June 5, 12

Make metrics

Tuesday, June 5, 12

Not that much

Tuesday, June 5, 12

codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logster

bitbucket.org/maplebed/ganglia-logtailer

Tuesday, June 5, 12

Questions?

Tuesday, June 5, 12

top related