Patchwork Data at Etsy

Download Patchwork Data at Etsy

Post on 20-May-2015

4.771 views

Category:

Technology

0 download

Embed Size (px)

DESCRIPTION

Big data at Etsy began in early 2010 and has since grown to power applications as diverse as ETL, A/B testing, recommender systems, and search indexing. Join us at this talk for an amusing tour through the history of big data at Etsy going back to the roots of our mission-critical A/B testing approach followed by a dive into a selection of the technologies that power such applications today.

TRANSCRIPT

<ul><li> 1. Patchwork Data at EtsyMatt Walker</li></ul> <p> 2. Etsy June200520072009 2011 2013 3. What happened? 4. We dont like to talk about it 5. Okay, we do http://codeascraft.etsy.com https://www.etsy.com/codeascraft/talks http://kongscreenprinting.com 6. Catch Phrases Continuous deployment Blameless postmortems Measure everything Continuous experimentation 7. Metrics-Driven Development Ganglia StatsD/Graphite Splunk 8. Scaling a Traditional RDBMS Sharded MySQL memcached Object-relational mapping in PHP 9. December2005 2007 20092011 2013 10. Adtuitive Online advertising network Match forum post with rich product advertisements Unafraid of scaling across Etsy sellers 11. Adtuitive Amazon Web Services JRuby Rails 12. LAMP Stack for Big Data HDFS Pig MapReduce Oozie HBase Avro Hive Zookeeper Flume JDBC/ODBChttp://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/ Hue 13. LAMP Stack for Big Data HDFS Pig MapReduce Oozie HBase Avro Hive Zookeeper Flume JDBC/ODBChttp://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/ Hue 14. LAMP Stack for Big Data HDFS S3 Pig Cascading MapReduce (Elastic) Oozie HBase Avro TupleSerialization Hive Zookeeper Flume JDBC/ODBC Hue 15. Powered by MapReduce ETL Analytics A/B testing Recommenders Search 16. Applications Log ETL A/B Analyzer Database snapshotter Catapult TasteTest Distributed search indexing Facebook Gift Recommender Fast Game (search index) Complimentary/similar listings Search autosuggest Funnel Cake SearchAds Feature Funnel SCRAM ETL (fraud detection) 17. Applications Log ETL A/B Analyzer Database snapshotter Catapult TasteTest Distributed search indexing Facebook Gift Recommender Fast Game (search index) Complimentary/similar listings Search autosuggest Funnel Cake SearchAds Feature Funnel SCRAM ETL (fraud detection) 18. Catapult End-to-end success story Extremely valuable for a web shop 19. Relevancy ThursdaysJanuary2005 20072009 2011 2013 20. Relevancy Thursdays Switch default sort order to relevance Each Thursday in January 21. Relevancy Thursdays Default search order was recency Relisting was our equivalent of advertising $0.20 updated your listings timestamp 22. Relevancy Thursdays Recency was meant to support freshness in search results Search originated as PostgreSQL query Converted to Solr to scale 23. What happens if we switch torelevance? 24. Relevancy Thursdays No A/B testing framework No event logs Limping along with Google Analytics 25. First Log AnalysisFebruary2005 200720092011 2013 26. First Log Analysis Raw web access logs URL- and ref tag-based Regex parser 27. Heyday of Tooling A/B framework Front end event logger Database snapshotter Barnum and Bailey Custom operator library Loaders 28. LAMP Stack for Big Data HDFS S3 Pig Cascading MapReduce (Elastic) Oozie HBase Avro TupleSerialization Hive Zookeeper Flume JDBC/ODBC Hue 29. LAMP Stack for Big Data HDFS S3 Pig Cascading MapReduce (Elastic) Oozie Barnum HBase Avro TupleSerialization Hive Zookeeper Flume Akamai JDBC/ODBC snapshotter/loaders Hue 30. A/B Framework Ramp-ups + A/B testing Feature ag development 31. Self-service analytics for any A/B test on the site 32. A/B FrameworkJune2005 200720092011 2013 33. A/B AnalyzerNovember2005 2007200920112013 34. Why did it take so long? Non-web developers learning the PHP stack Failed experiments with easier to use MapReduce tools Realizing self-service analytics was what Etsy needed 35. CatapultFebruary2005 200720092011 2013 36. Catapult A/B Analyzer + Launch Calendar Full product lifecycle 37. LAMP Stack for Big Data HDFS S3 Pig Cascading MapReduce (Elastic) Oozie Barnum HBase Avro TupleSerialization Hive Zookeeper Flume Akamai JDBC/ODBC snapshotter/loaders Hue 38. LAMP Stack for Big Data HDFS Pig Cascading MapReduce Oozie HBase Avro TupleSerialization Hive Vertica Zookeeper Flume logrotate JDBC/ODBC snapshotter/loaders Hue 39. Computation Models Batch Interactive Streaming 40. Batch 41. Cascading 42. RDBMS / Cascading SQLcascading.jrubyQuery Planner/Optimizer Cascading Execution EngineMapReduceStorage HDFS 43. cascading.jruby 44. cascading.jruby Productivity: no compile Reuse: factor out structure Efciency: no JRuby runtime Optimization: move aggregations map-side 45. A nice constructor 46. cascading.jruby 47. Productivity Job templates Reloader Cascading local mode Sampled data 48. Reuse 49. Reuse 50. Field Names 51. Efciency Just a constructor Calls into Cascading API No JRuby runtime on cluster 52. Optimization 53. Tuple Data Model 54. UDFs 55. Scalding Distributed collections Function literals replace UDFs 56. Interactive 57. Vertica 58. Sharded MySQL Borrowed from Flickr Works 59. Thou Shalt Not Join 60. HiveJanuary2005 20072009 20112013 61. Hive Turned OffApril2005 2007 200920112013 62. Hive Slow Sensitive Operational burden Educational burden 63. Vertica Ofine copy of shards, master, auxiliary databases Joins are easy Reasonable latency 64. Vertica November2005 2007 20092011 2013 65. Vertica Game changer at Etsy High demand for joins Rapid prototyping data pipelines 66. RDBMS / Cascading SQLcascading.jrubyQuery Planner/Optimizer Cascading Execution EngineMapReduceStorage HDFS 67. Back to MapReduce Event logs Schedule Load data in prod Scale 68. Vertica Not Hive, Impala, Shark, etc. May change our minds 69. Streaming 70. Not Powered by MapReduce Activity Feed Shop Stats 71. Etsyweb memcached Gearman Sharded MySQL 72. Usecases Trending Fraud detection ? 73. Turns out people dont makeproduct decisions in real time http://mcfunley.com/whom-the-gods-would-destroy-they-rst-give-real-time-analytics 74. Summing Up Be glad youre living in the future Automated tools for the common case Dont be afraid to experiment 75. Image Credits http://kongscreenprinting.com/what-we-do- http://www.globaltimes.cn/showcase SPECIALCOVERAGE/Top10Peopleof2011.aspx http://animal.discovery.com http://www.theculturemap.com/scream-time- edvard-munch-museum/ http://www.rallyrace.com/turning-over-the-stone-event-production-basics/ http://www.repentamerica.com/webelieve.html http://www.ickr.com/photos/bbalaji/ https://soundcloud.com/tearland/tl-hive2443820505/ http://pocketnow.com/2012/08/02/wi-vs-data- http://www.madeyoulaugh.com/funny_photos/speed-vs-battery-life/bush-scratching-headcaveman_harley/caveman_harley.jpg http://theundercoverrecruiter.com/6-ways-catapult-your-job-search-after-layoff/ 76. Contact / Reference Matt Walker @data_daddy http://codeascraft.etsy.com/ http://www.etsy.com/codeascraft/talks</p>