metrics-driven engineering

106
Mike Brittain @ mikebrittain Director of engineering, Infrastructure Metrics-Driven Engineering October 13, 2011

Upload: mike-brittain

Post on 08-Sep-2014

17.906 views

Category:

Technology


0 download

DESCRIPTION

Presented at Web 2.0 Expo, Oct. 13 2011

TRANSCRIPT

Page 1: Metrics-Driven Engineering

Mike Brittain @ mikebrittain

Director of engineering, Infrastructure

Metrics-Driven Engineering

October 13, 2011

Page 2: Metrics-Driven Engineering

Tools and Process at Etsy

Page 3: Metrics-Driven Engineering

How many new visits?How many listings created?How many registrations?

How do people use Etsy?How many convos sent?

How many purchases?How many new shops?

Page 4: Metrics-Driven Engineering

Search indexing?How fast are pages generating?

Async tasks currently in queue?

What is the application doing?Developer API auth and rate limiting?

Images resized and stored?Error and warning rates?

Page 5: Metrics-Driven Engineering

Replication slave lag?Memcache hits/misses?

Available connections?

Are the servers in good shape ?Database queries per second?

Total outgoing bandwidth?CPU, Memory, I/O?

Page 6: Metrics-Driven Engineering

Business Metrics

Page 7: Metrics-Driven Engineering

Application Metrics

Page 8: Metrics-Driven Engineering

System Metrics

Page 9: Metrics-Driven Engineering

Visibility EVERYWHERE

Page 10: Metrics-Driven Engineering

Constant Change

Page 11: Metrics-Driven Engineering
Page 12: Metrics-Driven Engineering

$314 Million GMS 2010

$180 Million GMS 2009$87 Million GMS 2008

$26 Million GMS 2007

credit: pentarux (flickr)

Page 13: Metrics-Driven Engineering

25 Million Unique Visitors1 Billion page views per month

credit: pentarux (flickr)

Page 14: Metrics-Driven Engineering

Engineering team grew 500% over 18 months

credit: martin_heigan (flickr)

Page 15: Metrics-Driven Engineering

Less talk, more do.

Page 16: Metrics-Driven Engineering

Always Be Shipping

credit: ibailemon (flickr)

Page 17: Metrics-Driven Engineering

Always Be Shipping(even if it’s your first day)

credit: ibailemon (flickr)

Page 18: Metrics-Driven Engineering
Page 19: Metrics-Driven Engineering

90+ Engineers40+ Deploys / day

credit: misswired (flickr)

Page 20: Metrics-Driven Engineering

credit: digidave (flickr)

Page 21: Metrics-Driven Engineering

Code Reviews

Page 22: Metrics-Driven Engineering

Automated Tests

Page 23: Metrics-Driven Engineering

$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'),);

Config FlagsEnable and disable features quickly

Page 24: Metrics-Driven Engineering

$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'),);

Config FlagsEnable and disable features quicklyPlus “admin-only,” percentage ramp-up, A/B testing,whitelists, blacklists, etc...

Page 25: Metrics-Driven Engineering

Failure is not an option

Page 26: Metrics-Driven Engineering

Failure is not an optioninevitable!

Page 27: Metrics-Driven Engineering

Failure is not an optioninevitable!

a learning opportunity!

Page 28: Metrics-Driven Engineering

Failure is not an optioninevitable!

a learning opportunity!

DETECTABLE!

Page 29: Metrics-Driven Engineering

Access

Page 30: Metrics-Driven Engineering
Page 31: Metrics-Driven Engineering
Page 32: Metrics-Driven Engineering
Page 33: Metrics-Driven Engineering

Detect problems quickly

Page 34: Metrics-Driven Engineering

CONFIDENCE

Page 35: Metrics-Driven Engineering
Page 36: Metrics-Driven Engineering

Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears

the pagers, blah, blah, blah...

A:

Page 37: Metrics-Driven Engineering

Engineers build the application

Page 38: Metrics-Driven Engineering

OPS

LoggingGraphingTrendingAlerting

ENG

Page 39: Metrics-Driven Engineering

“Engineers are too busy writing features to build metrics.”

Page 40: Metrics-Driven Engineering

Metrics are part of every feature...and so are config flags

Page 41: Metrics-Driven Engineering

Dead Simple

Page 42: Metrics-Driven Engineering

Simple, open source tools

Page 43: Metrics-Driven Engineering

Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)

LoggingLogsterStatsD

Page 44: Metrics-Driven Engineering

Ganglia

Page 45: Metrics-Driven Engineering

Cluster-orientedHuge community contributed recipesCustom metrics (gmetad)

Ganglia

Page 46: Metrics-Driven Engineering

Graphite

Page 47: Metrics-Driven Engineering

Single-instanceCreate new metrics on-the-fly

Customize via URLs and display functions

Graphite

Page 48: Metrics-Driven Engineering

Logging

Page 49: Metrics-Driven Engineering

It’s 2:48 PM.

Do you know where yourlogs are?

Page 50: Metrics-Driven Engineering

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

Page 51: Metrics-Driven Engineering

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

Page 52: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 53: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 54: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 55: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 56: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 57: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 58: Metrics-Driven Engineering

LogFormat "%h %l %u %t \"%r\" %>s %b" common

Page 59: Metrics-Driven Engineering

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

Page 60: Metrics-Driven Engineering

apache_note()

Page 61: Metrics-Driven Engineering

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

Page 62: Metrics-Driven Engineering

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

Page 63: Metrics-Driven Engineering

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

Page 64: Metrics-Driven Engineering

grep "/listing/" access.log | \awk '{sum=sum+$(NF-2)} END {print sum/NR}'

Page 65: Metrics-Driven Engineering

web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling

Page 66: Metrics-Driven Engineering

Fatals Errors Warnings

Logster

Page 67: Metrics-Driven Engineering

github.com/etsy

Run by cronKeeps a cursor on your log fileAggregate lines anyway you wantOutput to Ganglia or GraphiteSimple parsers

Logster

Page 68: Metrics-Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

Page 69: Metrics-Driven Engineering

^.+ \[.+\] \[(?P<log_level>.+)\]

Page 70: Metrics-Driven Engineering

if (fields['log_level'] == “fatal”): self.fatals += 1

elif (fields['log_level'] == “error”): self.errors += 1

elif (fields['log_level'] == “warning”): self.warnings += 1

...

Page 71: Metrics-Driven Engineering

MetricObject("fatals", (self.fatals / self.duration), "per sec")

MetricObject("errors", (self.errors / self.duration), "per sec")

MetricObject("warning", (self.warnings / self.duration), "per sec")

Page 72: Metrics-Driven Engineering

Fatals Errors Warnings

Page 73: Metrics-Driven Engineering

StatsD

Page 74: Metrics-Driven Engineering

github.com/etsy

StatsDNetwork daemon (node.js)

Accepts data over UDPFlushes to Graphite every 10 sec

One-line of code

Page 75: Metrics-Driven Engineering

StatsD::increment("logins.success");

Page 76: Metrics-Driven Engineering

StatsD::increment("logins.success");

logins

Page 77: Metrics-Driven Engineering

StatsD::timing("gearman.time", $msec);

Page 78: Metrics-Driven Engineering

StatsD::timing("gearman.time", $msec);

90th pct

average

lower

Page 79: Metrics-Driven Engineering

Ad hocname value timestamp

Page 80: Metrics-Driven Engineering

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Page 81: Metrics-Driven Engineering

Vertical Line Technology!target=drawAsInfinite(events.deploy.site)

Page 82: Metrics-Driven Engineering
Page 83: Metrics-Driven Engineering

We could stare at graphs all day...

Page 85: Metrics-Driven Engineering

http://graphite/render?from=-1hours&width=600&height=200

&target=webs.errorLog.warning&rawData=1

webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None

Page 86: Metrics-Driven Engineering

Holt-Winters Confidence Bands

lower

upper

Page 87: Metrics-Driven Engineering

Holt-Winters Aberration

Page 88: Metrics-Driven Engineering

Business metrics+ Confidence bands

_____________ Alertable metrics

Page 89: Metrics-Driven Engineering

40,000+ metrics at EtsySystems, Applications, Business

Page 90: Metrics-Driven Engineering

Dashboards

Page 91: Metrics-Driven Engineering

Dashboards

Page 92: Metrics-Driven Engineering

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Kind of Hard :-/

Page 93: Metrics-Driven Engineering

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');echo $g->getDashboardHTML(280, 220);

Super Easy!

Page 94: Metrics-Driven Engineering

Metrics!

Page 95: Metrics-Driven Engineering

Metrics!Metrics + Events

Page 96: Metrics-Driven Engineering

Metrics!Metrics + EventsMetrics + Alerts

Page 97: Metrics-Driven Engineering

Metrics!Metrics + EventsMetrics + Alerts

Metrics + Metrics

Page 98: Metrics-Driven Engineering

High-level, real-time visibility

Page 99: Metrics-Driven Engineering

Detect problems quickly

Page 100: Metrics-Driven Engineering

CONFIDENCE

Page 101: Metrics-Driven Engineering

Make them required features

Page 102: Metrics-Driven Engineering

Make them dead simple

Page 103: Metrics-Driven Engineering

Make them accessible

Page 104: Metrics-Driven Engineering

Make them!

Page 105: Metrics-Driven Engineering

Thank You

Homeworkcodeascraft.etsy.comgithub.com/etsy

We’re always looking for people who are interested in this kind of stuff...

etsy.com/careers

Get in touchmike @ etsy . com

@ mikebrittain

Page 106: Metrics-Driven Engineering