patchwork data at etsy

95
Patchwork Data at Etsy Matt Walker

Upload: matt-walker

Post on 20-May-2015

4.796 views

Category:

Technology


0 download

DESCRIPTION

Big data at Etsy began in early 2010 and has since grown to power applications as diverse as ETL, A/B testing, recommender systems, and search indexing. Join us at this talk for an amusing tour through the history of big data at Etsy going back to the roots of our mission-critical A/B testing approach followed by a dive into a selection of the technologies that power such applications today.

TRANSCRIPT

Page 1: Patchwork Data at Etsy

Patchwork Data at EtsyMatt Walker

Page 2: Patchwork Data at Etsy
Page 3: Patchwork Data at Etsy
Page 4: Patchwork Data at Etsy

2005 20132007 20112009

June

Etsy

Page 5: Patchwork Data at Etsy

What happened?

Page 6: Patchwork Data at Etsy

We don’t like to talk about it

Page 7: Patchwork Data at Etsy

Okay, we do

• http://codeascraft.etsy.com

• https://www.etsy.com/codeascraft/talks

• http://kongscreenprinting.com

Page 8: Patchwork Data at Etsy

Catch Phrases

• Continuous deployment

• Blameless postmortems

• Measure everything

• Continuous experimentation

Page 9: Patchwork Data at Etsy

Metrics-Driven Development

• Ganglia

• StatsD/Graphite

• Splunk

Page 10: Patchwork Data at Etsy

Scaling a Traditional RDBMS

• Sharded MySQL

• memcached

• Object-relational mapping in PHP

Page 11: Patchwork Data at Etsy

2005 20132007 20112009

December

Page 12: Patchwork Data at Etsy

Adtuitive

• Online advertising network

• Match forum post with rich product advertisements

• Unafraid of scaling across Etsy sellers

Page 13: Patchwork Data at Etsy

Adtuitive

• Amazon Web Services

• JRuby

• Rails

Page 14: Patchwork Data at Etsy
Page 15: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS

• MapReduce

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig

• Oozie

• Avro

• Zookeeper

http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/

Page 17: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Page 18: Patchwork Data at Etsy

Powered by MapReduce

• ETL

• Analytics

• A/B testing

• Recommenders

• Search

Page 19: Patchwork Data at Etsy

Applications• Log ETL

• Database snapshotter

• TasteTest

• Facebook Gift Recommender

• Complimentary/similar listings

• Funnel Cake

• Feature Funnel

• A/B Analyzer

• Catapult

• Distributed search indexing

• Fast Game (search index)

• Search autosuggest

• SearchAds

• SCRAM ETL (fraud detection)

Page 20: Patchwork Data at Etsy

Applications• Log ETL

• Database snapshotter

• TasteTest

• Facebook Gift Recommender

• Complimentary/similar listings

• Funnel Cake

• Feature Funnel

• A/B Analyzer

• Catapult

• Distributed search indexing

• Fast Game (search index)

• Search autosuggest

• SearchAds

• SCRAM ETL (fraud detection)

Page 21: Patchwork Data at Etsy

Catapult

• End-to-end success story

• Extremely valuable for a web shop

Page 22: Patchwork Data at Etsy

2005 20132007 20112009

January

Relevancy Thursdays

Page 23: Patchwork Data at Etsy

Relevancy Thursdays

• Switch default sort order to relevance

• Each Thursday in January

Page 24: Patchwork Data at Etsy

Relevancy Thursdays

• Default search order was recency

• Relisting was our equivalent of advertising

• $0.20 updated your listing’s timestamp

Page 25: Patchwork Data at Etsy

Relevancy Thursdays

• Recency was meant to support “freshness” in search results

• Search originated as PostgreSQL query

• Converted to Solr to scale

Page 26: Patchwork Data at Etsy

What happens if we switch to relevance?

Page 27: Patchwork Data at Etsy

Relevancy Thursdays

• No A/B testing framework

• No event logs

• Limping along with Google Analytics

Page 28: Patchwork Data at Etsy
Page 29: Patchwork Data at Etsy
Page 30: Patchwork Data at Etsy

2005 20132007 20112009

February

First Log Analysis

Page 31: Patchwork Data at Etsy

First Log Analysis

• Raw web access logs

• URL- and ref tag-based

• Regex parser

Page 32: Patchwork Data at Etsy
Page 33: Patchwork Data at Etsy
Page 34: Patchwork Data at Etsy

Heyday of Tooling

• A/B framework

• Front end event logger

• Database snapshotter

• Barnum and Bailey

• Custom operator library

• Loaders

Page 35: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Page 36: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume Akamai

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie Barnum

• Avro TupleSerialization

• Zookeeper

Page 37: Patchwork Data at Etsy

A/B Framework

• Ramp-ups + A/B testing

• Feature flag development

Page 38: Patchwork Data at Etsy

Self-service analytics for any A/B test on the site

Page 39: Patchwork Data at Etsy

2005 20132007 20112009

A/B Framework

June

Page 40: Patchwork Data at Etsy

2005 20132007 20112009

A/B Analyzer

November

Page 41: Patchwork Data at Etsy

Why did it take so long?

• Non-web developers learning the PHP stack

• Failed experiments with “easier to use” MapReduce tools

• Realizing self-service analytics was what Etsy needed

Page 42: Patchwork Data at Etsy
Page 43: Patchwork Data at Etsy
Page 44: Patchwork Data at Etsy

2005 20132007 20112009

February

Catapult

Page 45: Patchwork Data at Etsy

Catapult

• A/B Analyzer + Launch Calendar

• Full product lifecycle

Page 46: Patchwork Data at Etsy
Page 47: Patchwork Data at Etsy
Page 48: Patchwork Data at Etsy
Page 49: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume Akamai

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie Barnum

• Avro TupleSerialization

• Zookeeper

Page 50: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS

• MapReduce

• HBase

• Hive Vertica

• Flume logrotate

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Page 51: Patchwork Data at Etsy

Computation Models

• Batch

• Interactive

• Streaming

Page 52: Patchwork Data at Etsy
Page 53: Patchwork Data at Etsy

Batch

Page 54: Patchwork Data at Etsy

Cascading

Page 55: Patchwork Data at Etsy

SQL cascading.jruby

Query Planner/Optimizer Cascading

Execution Engine MapReduce

Storage HDFS

RDBMS / Cascading

Page 56: Patchwork Data at Etsy

cascading.jruby

Page 57: Patchwork Data at Etsy

cascading.jruby

• Productivity: no compile

• Reuse: factor out structure

• Efficiency: no JRuby runtime

• Optimization: move aggregations map-side

Page 58: Patchwork Data at Etsy

A nice constructor

Page 59: Patchwork Data at Etsy

cascading.jruby

Page 60: Patchwork Data at Etsy

Productivity

• Job templates

• Reloader

• Cascading local mode

• Sampled data

Page 61: Patchwork Data at Etsy

Reuse

Page 62: Patchwork Data at Etsy

Reuse

Page 63: Patchwork Data at Etsy

Field Names

Page 64: Patchwork Data at Etsy
Page 65: Patchwork Data at Etsy

Efficiency

• Just a constructor

• Calls into Cascading API

• No JRuby runtime on cluster

Page 66: Patchwork Data at Etsy

Optimization

Page 67: Patchwork Data at Etsy

Tuple Data Model

Page 68: Patchwork Data at Etsy

UDFs

Page 69: Patchwork Data at Etsy

Scalding

• Distributed collections

• Function literals replace UDFs

Page 70: Patchwork Data at Etsy
Page 71: Patchwork Data at Etsy
Page 72: Patchwork Data at Etsy
Page 73: Patchwork Data at Etsy

Interactive

Page 74: Patchwork Data at Etsy

Vertica

Page 75: Patchwork Data at Etsy

Sharded MySQL

• Borrowed from Flickr

• Works

Page 76: Patchwork Data at Etsy

Thou Shalt Not Join

Page 77: Patchwork Data at Etsy

2005 20132007 20112009

Hive

January

Page 78: Patchwork Data at Etsy

2005 20132007 20112009

Hive Turned Off

April

Page 79: Patchwork Data at Etsy

Hive

• Slow

• Sensitive

• Operational burden

• Educational burden

Page 80: Patchwork Data at Etsy

Vertica

• Offline copy of shards, master, auxiliary databases

• Joins are easy

• Reasonable latency

Page 81: Patchwork Data at Etsy

2005 20132007 20112009

Vertica

November

Page 82: Patchwork Data at Etsy

Vertica

• Game changer at Etsy

• High demand for joins

• Rapid prototyping data pipelines

Page 83: Patchwork Data at Etsy
Page 84: Patchwork Data at Etsy

SQL cascading.jruby

Query Planner/Optimizer Cascading

Execution Engine MapReduce

Storage HDFS

RDBMS / Cascading

Page 85: Patchwork Data at Etsy

Back to MapReduce

• Event logs

• Schedule

• Load data in prod

• Scale

Page 86: Patchwork Data at Etsy

Vertica

• Not Hive, Impala, Shark, etc.

• May change our minds

Page 87: Patchwork Data at Etsy

Streaming

Page 88: Patchwork Data at Etsy

Not Powered by MapReduce

• Activity Feed

• Shop Stats

Page 89: Patchwork Data at Etsy

Etsyweb

• memcached

• Gearman

• Sharded MySQL

Page 90: Patchwork Data at Etsy

Usecases

• Trending

• Fraud detection

• ?

Page 91: Patchwork Data at Etsy
Page 92: Patchwork Data at Etsy

Turns out people don’t make product decisions in real time

http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics

Page 93: Patchwork Data at Etsy

Summing Up

• Be glad you’re living in the future

• Automated tools for the common case

• Don’t be afraid to experiment

Page 94: Patchwork Data at Etsy

Image Credits• http://kongscreenprinting.com/what-we-do-

showcase

• http://animal.discovery.com

• http://www.rallyrace.com/turning-over-the-stone-event-production-basics/

• http://www.flickr.com/photos/bbalaji/2443820505/

• http://www.madeyoulaugh.com/funny_photos/caveman_harley/caveman_harley.jpg

• http://theundercoverrecruiter.com/6-ways-catapult-your-job-search-after-layoff/

• http://www.globaltimes.cn/SPECIALCOVERAGE/Top10Peopleof2011.aspx

• http://www.theculturemap.com/scream-time-edvard-munch-museum/

• http://www.repentamerica.com/webelieve.html

• https://soundcloud.com/tearland/tl-hive

• http://pocketnow.com/2012/08/02/wifi-vs-data-speed-vs-battery-life/bush-scratching-head

Page 95: Patchwork Data at Etsy

Contact / Reference

• Matt Walker

• @data_daddy

• http://codeascraft.etsy.com/

• http://www.etsy.com/codeascraft/talks