metrics driven engineering

82
Metrics-Driven Engineering Mike Brittain ENGINEERING DIRECTOR, ETSY @mikebrittain

Upload: mike-brittain

Post on 08-May-2015

28.594 views

Category:

Technology


0 download

DESCRIPTION

Presented at USI 2013 in Paris, France. In this talk I discuss how Etsy's engineering team collects and uses real-time metrics to add confidence to our Continuous Deployment culture. Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: [email protected]. http://www.etsy.com/careers

TRANSCRIPT

Page 1: Metrics Driven Engineering

Metrics-Driven EngineeringMike Brittain ENGINEERING DIRECTOR, ETSY@mikebrittain

Page 2: Metrics Driven Engineering

PROCESS AND TOOLSSupporting a culture of Continuous Deployment

Page 3: Metrics Driven Engineering
Page 4: Metrics Driven Engineering
Page 5: Metrics Driven Engineering

How many new visitors?How many listings created?How many registrations?

How do people use Etsy?How many messages sent?

How many purchases?How many new shops?

Page 6: Metrics Driven Engineering

Search indexing?How fast are pages generating?

Async tasks currently in queue?

How is the application behaving?Developer API auth and rate limiting?

Images resized and stored?Error and warning rates?

Page 7: Metrics Driven Engineering

Replication slave lag?Memcache hits/misses?

Available connections?

Are the servers and network OK?Database queries per second?

Total outgoing bandwidth?CPU, Memory, I/O?

Page 8: Metrics Driven Engineering

Business Metrics

Page 9: Metrics Driven Engineering

Application Metrics

Page 10: Metrics Driven Engineering

System Metrics

Page 11: Metrics Driven Engineering

System Metrics

Page 12: Metrics Driven Engineering

Visibility EVERYWHERE

Page 13: Metrics Driven Engineering

Metrics help you identify goals

Page 14: Metrics Driven Engineering

Metrics help you identify goals... but also tell you when you’ve broken something.

Page 15: Metrics Driven Engineering

Always Be Shipping

credit: ibailemon (flickr)

Page 16: Metrics Driven Engineering

1!" #$%Put yourself on the web site.

Page 17: Metrics Driven Engineering

2&# #$%Complete tax, insurance, and

benefits forms.

credit: ktpupp (flickr)

Page 18: Metrics Driven Engineering

Dev Sandbox Trunk / master Production

You!

Test

Page 19: Metrics Driven Engineering

7e9a814 -> 63a2bb3

Deploy to Production

Page 20: Metrics Driven Engineering

50+ Deploys / day

200+ Committers15 Product teams

8 Infrastructure teams

50+ Deploys / day

credit: misswired (flickr)

Page 21: Metrics Driven Engineering

credit: digidave (flickr)

Page 22: Metrics Driven Engineering

Peer ReviewCode reviews, Architecture reviews, Operability reviews

Page 23: Metrics Driven Engineering

Automated TestsStatic analysis, Unit tests, Integration tests, Functional tests

Page 24: Metrics Driven Engineering

May 2013

$102.9 Million in good sold1.37 Billion page views

https://www.etsy.com/blog/news/2013/etsy-statistics-may-2013-weather-report/

Page 25: Metrics Driven Engineering

Failure is not an option

Page 26: Metrics Driven Engineering

Failure is not an optioninevitable

Page 27: Metrics Driven Engineering

Failure is not an option

and detectable!

inevitable

Page 28: Metrics Driven Engineering

Access

Page 29: Metrics Driven Engineering
Page 30: Metrics Driven Engineering

Sounds like a lot of work, who’s going to build all of this?

Q:

Page 31: Metrics Driven Engineering

Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears

the pagers, blah, blah, blah...

A:

Sounds like a lot of work, who’s going to build all of this?

Q:

Page 32: Metrics Driven Engineering

Engineers build the application

Page 33: Metrics Driven Engineering

OPS

LoggingGraphingTrendingAlerting

ENG

Page 34: Metrics Driven Engineering

Metrics are part of every feature(and so are config flags)

Page 35: Metrics Driven Engineering

Make it DEAD SIMPLE

Page 36: Metrics Driven Engineering

Ganglia (application, servers, network)

Logster* (application, servers)

Cacti (network, SNMP)

FITB* (network)

* github.com/etsy

Simple, open-source tools

Graphite (application)

Statsd* (application)

Log formats (application, servers)

Nagios (alerting)

Page 37: Metrics Driven Engineering

Ganglia

Page 38: Metrics Driven Engineering

Cluster-orientedHuge community contributed recipesCustom metrics (gmetad)

Ganglia

Page 39: Metrics Driven Engineering

Graphite

Page 40: Metrics Driven Engineering

Single-instanceCreate new metrics on-the-fly

Customize via URLs and display functions

http://www.aosabook.org/en/graphite.html

Graphite

Page 41: Metrics Driven Engineering

Log Formats

Page 42: Metrics Driven Engineering

Time, remote address, http method, request uri, referrer, user-agent, response size, response code, execution time, memory consumed, plus custom fields...

• Signed-in/out (user_id vs. “-”)• display mode (“desktop” vs. “mobile”)• i10n/i18n (“en-US”)• etc.

Access Logs

Page 43: Metrics Driven Engineering

LogFormat %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{custom_field}n ...

apache_note(“custom_field”, $whatever);

Page 44: Metrics Driven Engineering

LogFormat "%{True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{display_mode}n %{user_id}n %{php_bytes}n %{php_usec}n %D”

web0060 66.249.71.110 - - [11/May/2011:17:08:53 +0000] "GET /listing/12189259/tropical-etched-pair-of-lampwork-glass HTTP/1.1" 200 11034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" desktop - 13399576 505780 554876

Page 45: Metrics Driven Engineering

LogFormat "%{True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{display_mode}n %{user_id}n %{php_bytes}n %{php_usec}n %D”

web0060 66.249.71.110 - - [11/May/2011:17:08:53 +0000] "GET /listing/12189259/tropical-etched-pair-of-lampwork-glass HTTP/1.1" 200 11034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" desktop - 13399576 505780 554876

Page 46: Metrics Driven Engineering

Logger::error("User login failed. Reason: $msg for $email_addr", “login”);

Method name denotes log “level”—error, fatal, warning, notice, debug.

A “namespace” parameter is providedso we can aggregate log entries withsimilar concerns.

Page 47: Metrics Driven Engineering

Logger::error("User login failed. Reason: $msg for $email_addr", “login”);

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password was submitted for [email protected]

Unique request ID

Server nameDate and time Level

Namespace

Page 48: Metrics Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason

Page 49: Metrics Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason

Page 50: Metrics Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason

FATALS ERRORS WARNINGS

Logster

Page 51: Metrics Driven Engineering

github.com/etsy/logster

Run by cron (e.g. 1m intervals)

Keeps a cursor on your log fileParse and aggregate values however you wantOutput to Ganglia, Graphite, Amazon CloudWatchSimple parsers

Logster

Page 52: Metrics Driven Engineering

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password was submitted for [email protected]

^.+ \[.+\] \[(?P<log_level>.+)\]

1. Pattern match on fields of interest

Page 53: Metrics Driven Engineering

if (fields['log_level'] == “fatal”): self.fatals += 1

elif (fields['log_level'] == “error”): self.errors += 1

elif (fields['log_level'] == “warning”): self.warnings += 1

...

2. Aggregate values (sum, average, percentile, etc.)

Page 54: Metrics Driven Engineering

MetricObject("fatals", (self.fatals / self.duration), "per sec")

MetricObject("errors", (self.errors / self.duration), "per sec")

MetricObject("warning", (self.warnings / self.duration), "per sec")

3. Send the values as “metric objects” to the collectors

Page 55: Metrics Driven Engineering

github.com/etsy/logster

FATALS ERRORS WARNINGS

Logster

Page 56: Metrics Driven Engineering

StatsD

Page 57: Metrics Driven Engineering

github.com/etsy/statsd

StatsDNetwork daemon (node.js)

Accepts data over UDPFlushes to Graphite every 10 sec

One-line of code

Page 58: Metrics Driven Engineering

StatsD::increment("logins.success");

Page 59: Metrics Driven Engineering

StatsD::increment("logins.success");

Logins

Page 60: Metrics Driven Engineering

StatsD::timing("profile.time", $msec);

Page 61: Metrics Driven Engineering

StatsD::timing("profile.time", $msec);

90th pct

average

lower

Page 62: Metrics Driven Engineering

Ad hocname value timestamp

Page 63: Metrics Driven Engineering

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Page 64: Metrics Driven Engineering

Vertical Line Technology!target=drawAsInfinite(events.deploy.site)

Page 65: Metrics Driven Engineering

User Logins

Page 66: Metrics Driven Engineering

PHP Warnings

Page 67: Metrics Driven Engineering

PHP Fatal Errors

Page 68: Metrics Driven Engineering

250,000+ metrics at EtsySystems, Applications, Business

Page 69: Metrics Driven Engineering

github.com/etsy/dashboard

Dashboards

Page 70: Metrics Driven Engineering

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Kind of Hard :-/

github.com/etsy/dashboard

Page 71: Metrics Driven Engineering

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');echo $g->getDashboardHTML(280, 220);

Super Easy!

github.com/etsy/dashboard

Page 72: Metrics Driven Engineering

But, you said...

“250,000+ metrics at Etsy”Systems, Applications, Business

Page 74: Metrics Driven Engineering

http://graphite/render?from=-1hours&width=600&height=200

&target=webs.errorLog.warning&rawData=1

webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None

Page 75: Metrics Driven Engineering

Holt-Winters Confidence Bands

lower

upper

Page 76: Metrics Driven Engineering

Holt-Winters Aberration

Page 77: Metrics Driven Engineering

Business metrics+ Confidence bands

_____________ Alertable metrics

Page 78: Metrics Driven Engineering

Metrics!Metrics + EventsMetrics + Alerts

Metrics + Metrics

Page 79: Metrics Driven Engineering

High-level, real-time visibility

Page 80: Metrics Driven Engineering

Detect problems early,and resolve them quickly.

Page 81: Metrics Driven Engineering

Make them accessibleMake them required features

Make them dead simple

Page 82: Metrics Driven Engineering

Merci!These slides will be available atmikebrittain.com/talks

codeascraft.etsy.comgithub.com/etsy

Say “Hello!”[email protected]

@mikebrittain

Metrics-Driven Engineering