Metrics-Driven Engineering at Etsy

Download Metrics-Driven Engineering at Etsy

Post on 08-May-2015

15.818 views

Category:

Technology

1 download

Embed Size (px)

TRANSCRIPT

<ul><li>1.Metrics-drivenEngineering at EtsyMIKE BRITTAIN mike@etsy.com @mikebrittain</li></ul> <p>2. Logs, Graphs, Trends,and Correlations 3. Making Decisions 4. How many visitors areusing this thing? 5. Can we deploy that to100% of our visitors? 6. Did we make it faster? 7. Did I just breaksomething? 8. Q. Who makes the graphs?A. Well, the Ops team manages the network, racksthe servers, installed the monitoring tools, wears the pagers, blah, blah, blah... 9. (but...) Engineers build the application. 10. Dev + Ops 11. Access 12. Yes No 13. Engineers are too busy meeting our productdeadlines. 14. Heres the big secret... 15. Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting) 16. Logging 17. Logger::log_error("User login failed. Reason: $msg for $username", login); 18. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ... 19. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ... 20. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ... 21. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ... 22. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ... 23. Logster 24. Forked from ganglia-logtailer...- Daemon mode (only cron mode)+ Support for Graphite+ Simplied parsing scripts 25. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! 26. Fatals Errors Warnings 27. StatsD 28. StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec); 29. 90th pct average lowerStatsD::timing("gearman.time", $msec); 30. Ad hocname value timestampn 31. echo "events.deploy.site 1 `date +%s`"| nc graphite.etsycorp.com 2003 32. Trends + Eventstarget=drawAsInfinite(events.deploy.site) 33. What Happened? 34. 16,000 metrics in Graphite (plus 32,000 metrics in Ganglia) 35. Dashboards 36. Mix &amp; MatchDashboards 37. Hard 38. Easy$g = new Graphite($time);$g-&gt;setTitle(File Not Found);$g-&gt;addMetric(webs.errorLog.notExist, #00cc00);$g-&gt;showDeploys(true);echo $g-&gt;getDashboardHTML(280, 220); 39. 20 dashboards by25 engineers 40. Application healthcorrelated with events 41. High-level visibility 42. Low MTTD 43. Validation 44. Condence 45. codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logsterbitbucket.org/maplebed/ganglia-logtailer 46. Q&amp;ADoes this sound like fun? Get in touch with us.chad@etsy.com kellan@etsy.com kastner@etsy.com mike@etsy.com </p>