metrics driven engineering

Post on 08-May-2015

28.594 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at USI 2013 in Paris, France. In this talk I discuss how Etsy's engineering team collects and uses real-time metrics to add confidence to our Continuous Deployment culture. Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: mike@etsy.com. http://www.etsy.com/careers

TRANSCRIPT

Metrics-Driven EngineeringMike Brittain ENGINEERING DIRECTOR, ETSY@mikebrittain

PROCESS AND TOOLSSupporting a culture of Continuous Deployment

How many new visitors?How many listings created?How many registrations?

How do people use Etsy?How many messages sent?

How many purchases?How many new shops?

Search indexing?How fast are pages generating?

Async tasks currently in queue?

How is the application behaving?Developer API auth and rate limiting?

Images resized and stored?Error and warning rates?

Replication slave lag?Memcache hits/misses?

Available connections?

Are the servers and network OK?Database queries per second?

Total outgoing bandwidth?CPU, Memory, I/O?

Business Metrics

Application Metrics

System Metrics

System Metrics

Visibility EVERYWHERE

Metrics help you identify goals

Metrics help you identify goals... but also tell you when you’ve broken something.

Always Be Shipping

credit: ibailemon (flickr)

1!" #$%Put yourself on the web site.

2&# #$%Complete tax, insurance, and

benefits forms.

credit: ktpupp (flickr)

Dev Sandbox Trunk / master Production

You!

Test

7e9a814 -> 63a2bb3

Deploy to Production

50+ Deploys / day

200+ Committers15 Product teams

8 Infrastructure teams

50+ Deploys / day

credit: misswired (flickr)

credit: digidave (flickr)

Peer ReviewCode reviews, Architecture reviews, Operability reviews

Automated TestsStatic analysis, Unit tests, Integration tests, Functional tests

May 2013

$102.9 Million in good sold1.37 Billion page views

https://www.etsy.com/blog/news/2013/etsy-statistics-may-2013-weather-report/

Failure is not an option

Failure is not an optioninevitable

Failure is not an option

and detectable!

inevitable

Access

Sounds like a lot of work, who’s going to build all of this?

Q:

Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears

the pagers, blah, blah, blah...

A:

Sounds like a lot of work, who’s going to build all of this?

Q:

Engineers build the application

OPS

LoggingGraphingTrendingAlerting

ENG

Metrics are part of every feature(and so are config flags)

Make it DEAD SIMPLE

Ganglia (application, servers, network)

Logster* (application, servers)

Cacti (network, SNMP)

FITB* (network)

* github.com/etsy

Simple, open-source tools

Graphite (application)

Statsd* (application)

Log formats (application, servers)

Nagios (alerting)

Ganglia

Cluster-orientedHuge community contributed recipesCustom metrics (gmetad)

Ganglia

Graphite

Single-instanceCreate new metrics on-the-fly

Customize via URLs and display functions

http://www.aosabook.org/en/graphite.html

Graphite

Log Formats

Time, remote address, http method, request uri, referrer, user-agent, response size, response code, execution time, memory consumed, plus custom fields...

• Signed-in/out (user_id vs. “-”)• display mode (“desktop” vs. “mobile”)• i10n/i18n (“en-US”)• etc.

Access Logs

LogFormat %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{custom_field}n ...

apache_note(“custom_field”, $whatever);

LogFormat "%{True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{display_mode}n %{user_id}n %{php_bytes}n %{php_usec}n %D”

web0060 66.249.71.110 - - [11/May/2011:17:08:53 +0000] "GET /listing/12189259/tropical-etched-pair-of-lampwork-glass HTTP/1.1" 200 11034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" desktop - 13399576 505780 554876

LogFormat "%{True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{display_mode}n %{user_id}n %{php_bytes}n %{php_usec}n %D”

web0060 66.249.71.110 - - [11/May/2011:17:08:53 +0000] "GET /listing/12189259/tropical-etched-pair-of-lampwork-glass HTTP/1.1" 200 11034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" desktop - 13399576 505780 554876

Logger::error("User login failed. Reason: $msg for $email_addr", “login”);

Method name denotes log “level”—error, fatal, warning, notice, debug.

A “namespace” parameter is providedso we can aggregate log entries withsimilar concerns.

Logger::error("User login failed. Reason: $msg for $email_addr", “login”);

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password was submitted for mike@etsy.com

Unique request ID

Server nameDate and time Level

Namespace

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason

FATALS ERRORS WARNINGS

Logster

github.com/etsy/logster

Run by cron (e.g. 1m intervals)

Keeps a cursor on your log fileParse and aggregate values however you wantOutput to Ganglia, Graphite, Amazon CloudWatchSimple parsers

Logster

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password was submitted for mike@etsy.com

^.+ \[.+\] \[(?P<log_level>.+)\]

1. Pattern match on fields of interest

if (fields['log_level'] == “fatal”): self.fatals += 1

elif (fields['log_level'] == “error”): self.errors += 1

elif (fields['log_level'] == “warning”): self.warnings += 1

...

2. Aggregate values (sum, average, percentile, etc.)

MetricObject("fatals", (self.fatals / self.duration), "per sec")

MetricObject("errors", (self.errors / self.duration), "per sec")

MetricObject("warning", (self.warnings / self.duration), "per sec")

3. Send the values as “metric objects” to the collectors

github.com/etsy/logster

FATALS ERRORS WARNINGS

Logster

StatsD

github.com/etsy/statsd

StatsDNetwork daemon (node.js)

Accepts data over UDPFlushes to Graphite every 10 sec

One-line of code

StatsD::increment("logins.success");

StatsD::increment("logins.success");

Logins

StatsD::timing("profile.time", $msec);

StatsD::timing("profile.time", $msec);

90th pct

average

lower

Ad hocname value timestamp

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Vertical Line Technology!target=drawAsInfinite(events.deploy.site)

User Logins

PHP Warnings

PHP Fatal Errors

250,000+ metrics at EtsySystems, Applications, Business

github.com/etsy/dashboard

Dashboards

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Kind of Hard :-/

github.com/etsy/dashboard

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');echo $g->getDashboardHTML(280, 220);

Super Easy!

github.com/etsy/dashboard

But, you said...

“250,000+ metrics at Etsy”Systems, Applications, Business

http://graphite/render?from=-1hours&width=600&height=200

&target=webs.errorLog.warning&rawData=1

webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None

Holt-Winters Confidence Bands

lower

upper

Holt-Winters Aberration

Business metrics+ Confidence bands

_____________ Alertable metrics

Metrics!Metrics + EventsMetrics + Alerts

Metrics + Metrics

High-level, real-time visibility

Detect problems early,and resolve them quickly.

Make them accessibleMake them required features

Make them dead simple

Merci!These slides will be available atmikebrittain.com/talks

codeascraft.etsy.comgithub.com/etsy

Say “Hello!”mike@etsy.com

@mikebrittain

Metrics-Driven Engineering

top related