Download - Patchwork Data at Etsy
Patchwork Data at EtsyMatt Walker
2005 20132007 20112009
June
Etsy
What happened?
We don’t like to talk about it
Okay, we do
• http://codeascraft.etsy.com
• https://www.etsy.com/codeascraft/talks
• http://kongscreenprinting.com
Catch Phrases
• Continuous deployment
• Blameless postmortems
• Measure everything
• Continuous experimentation
Metrics-Driven Development
• Ganglia
• StatsD/Graphite
• Splunk
Scaling a Traditional RDBMS
• Sharded MySQL
• memcached
• Object-relational mapping in PHP
2005 20132007 20112009
December
Adtuitive
• Online advertising network
• Match forum post with rich product advertisements
• Unafraid of scaling across Etsy sellers
Adtuitive
• Amazon Web Services
• JRuby
• Rails
LAMP Stack for Big Data• HDFS
• MapReduce
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig
• Oozie
• Avro
• Zookeeper
http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/
LAMP Stack for Big Data• HDFS
• MapReduce
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig
• Oozie
• Avro
• Zookeeper
http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig Cascading
• Oozie
• Avro TupleSerialization
• Zookeeper
Powered by MapReduce
• ETL
• Analytics
• A/B testing
• Recommenders
• Search
Applications• Log ETL
• Database snapshotter
• TasteTest
• Facebook Gift Recommender
• Complimentary/similar listings
• Funnel Cake
• Feature Funnel
• A/B Analyzer
• Catapult
• Distributed search indexing
• Fast Game (search index)
• Search autosuggest
• SearchAds
• SCRAM ETL (fraud detection)
Applications• Log ETL
• Database snapshotter
• TasteTest
• Facebook Gift Recommender
• Complimentary/similar listings
• Funnel Cake
• Feature Funnel
• A/B Analyzer
• Catapult
• Distributed search indexing
• Fast Game (search index)
• Search autosuggest
• SearchAds
• SCRAM ETL (fraud detection)
Catapult
• End-to-end success story
• Extremely valuable for a web shop
2005 20132007 20112009
January
Relevancy Thursdays
Relevancy Thursdays
• Switch default sort order to relevance
• Each Thursday in January
Relevancy Thursdays
• Default search order was recency
• Relisting was our equivalent of advertising
• $0.20 updated your listing’s timestamp
Relevancy Thursdays
• Recency was meant to support “freshness” in search results
• Search originated as PostgreSQL query
• Converted to Solr to scale
What happens if we switch to relevance?
Relevancy Thursdays
• No A/B testing framework
• No event logs
• Limping along with Google Analytics
2005 20132007 20112009
February
First Log Analysis
First Log Analysis
• Raw web access logs
• URL- and ref tag-based
• Regex parser
Heyday of Tooling
• A/B framework
• Front end event logger
• Database snapshotter
• Barnum and Bailey
• Custom operator library
• Loaders
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig Cascading
• Oozie
• Avro TupleSerialization
• Zookeeper
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume Akamai
• JDBC/ODBC snapshotter/loaders
• Hue
• Pig Cascading
• Oozie Barnum
• Avro TupleSerialization
• Zookeeper
A/B Framework
• Ramp-ups + A/B testing
• Feature flag development
Self-service analytics for any A/B test on the site
2005 20132007 20112009
A/B Framework
June
2005 20132007 20112009
A/B Analyzer
November
Why did it take so long?
• Non-web developers learning the PHP stack
• Failed experiments with “easier to use” MapReduce tools
• Realizing self-service analytics was what Etsy needed
2005 20132007 20112009
February
Catapult
Catapult
• A/B Analyzer + Launch Calendar
• Full product lifecycle
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume Akamai
• JDBC/ODBC snapshotter/loaders
• Hue
• Pig Cascading
• Oozie Barnum
• Avro TupleSerialization
• Zookeeper
LAMP Stack for Big Data• HDFS
• MapReduce
• HBase
• Hive Vertica
• Flume logrotate
• JDBC/ODBC snapshotter/loaders
• Hue
• Pig Cascading
• Oozie
• Avro TupleSerialization
• Zookeeper
Computation Models
• Batch
• Interactive
• Streaming
Batch
Cascading
SQL cascading.jruby
Query Planner/Optimizer Cascading
Execution Engine MapReduce
Storage HDFS
RDBMS / Cascading
cascading.jruby
cascading.jruby
• Productivity: no compile
• Reuse: factor out structure
• Efficiency: no JRuby runtime
• Optimization: move aggregations map-side
A nice constructor
cascading.jruby
Productivity
• Job templates
• Reloader
• Cascading local mode
• Sampled data
Reuse
Reuse
Field Names
Efficiency
• Just a constructor
• Calls into Cascading API
• No JRuby runtime on cluster
Optimization
Tuple Data Model
UDFs
Scalding
• Distributed collections
• Function literals replace UDFs
Interactive
Vertica
Sharded MySQL
• Borrowed from Flickr
• Works
Thou Shalt Not Join
2005 20132007 20112009
Hive
January
2005 20132007 20112009
Hive Turned Off
April
Hive
• Slow
• Sensitive
• Operational burden
• Educational burden
Vertica
• Offline copy of shards, master, auxiliary databases
• Joins are easy
• Reasonable latency
2005 20132007 20112009
Vertica
November
Vertica
• Game changer at Etsy
• High demand for joins
• Rapid prototyping data pipelines
SQL cascading.jruby
Query Planner/Optimizer Cascading
Execution Engine MapReduce
Storage HDFS
RDBMS / Cascading
Back to MapReduce
• Event logs
• Schedule
• Load data in prod
• Scale
Vertica
• Not Hive, Impala, Shark, etc.
• May change our minds
Streaming
Not Powered by MapReduce
• Activity Feed
• Shop Stats
Etsyweb
• memcached
• Gearman
• Sharded MySQL
Usecases
• Trending
• Fraud detection
• ?
Turns out people don’t make product decisions in real time
http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics
Summing Up
• Be glad you’re living in the future
• Automated tools for the common case
• Don’t be afraid to experiment
Image Credits• http://kongscreenprinting.com/what-we-do-
showcase
• http://animal.discovery.com
• http://www.rallyrace.com/turning-over-the-stone-event-production-basics/
• http://www.flickr.com/photos/bbalaji/2443820505/
• http://www.madeyoulaugh.com/funny_photos/caveman_harley/caveman_harley.jpg
• http://theundercoverrecruiter.com/6-ways-catapult-your-job-search-after-layoff/
• http://www.globaltimes.cn/SPECIALCOVERAGE/Top10Peopleof2011.aspx
• http://www.theculturemap.com/scream-time-edvard-munch-museum/
• http://www.repentamerica.com/webelieve.html
• https://soundcloud.com/tearland/tl-hive
• http://pocketnow.com/2012/08/02/wifi-vs-data-speed-vs-battery-life/bush-scratching-head
Contact / Reference
• Matt Walker
• @data_daddy
• http://codeascraft.etsy.com/
• http://www.etsy.com/codeascraft/talks