patchwork data at etsy

Post on 20-May-2015

4.796 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big data at Etsy began in early 2010 and has since grown to power applications as diverse as ETL, A/B testing, recommender systems, and search indexing. Join us at this talk for an amusing tour through the history of big data at Etsy going back to the roots of our mission-critical A/B testing approach followed by a dive into a selection of the technologies that power such applications today.

TRANSCRIPT

Patchwork Data at EtsyMatt Walker

2005 20132007 20112009

June

Etsy

What happened?

We don’t like to talk about it

Okay, we do

• http://codeascraft.etsy.com

• https://www.etsy.com/codeascraft/talks

• http://kongscreenprinting.com

Catch Phrases

• Continuous deployment

• Blameless postmortems

• Measure everything

• Continuous experimentation

Metrics-Driven Development

• Ganglia

• StatsD/Graphite

• Splunk

Scaling a Traditional RDBMS

• Sharded MySQL

• memcached

• Object-relational mapping in PHP

2005 20132007 20112009

December

Adtuitive

• Online advertising network

• Match forum post with rich product advertisements

• Unafraid of scaling across Etsy sellers

Adtuitive

• Amazon Web Services

• JRuby

• Rails

LAMP Stack for Big Data• HDFS

• MapReduce

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig

• Oozie

• Avro

• Zookeeper

http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Powered by MapReduce

• ETL

• Analytics

• A/B testing

• Recommenders

• Search

Applications• Log ETL

• Database snapshotter

• TasteTest

• Facebook Gift Recommender

• Complimentary/similar listings

• Funnel Cake

• Feature Funnel

• A/B Analyzer

• Catapult

• Distributed search indexing

• Fast Game (search index)

• Search autosuggest

• SearchAds

• SCRAM ETL (fraud detection)

Applications• Log ETL

• Database snapshotter

• TasteTest

• Facebook Gift Recommender

• Complimentary/similar listings

• Funnel Cake

• Feature Funnel

• A/B Analyzer

• Catapult

• Distributed search indexing

• Fast Game (search index)

• Search autosuggest

• SearchAds

• SCRAM ETL (fraud detection)

Catapult

• End-to-end success story

• Extremely valuable for a web shop

2005 20132007 20112009

January

Relevancy Thursdays

Relevancy Thursdays

• Switch default sort order to relevance

• Each Thursday in January

Relevancy Thursdays

• Default search order was recency

• Relisting was our equivalent of advertising

• $0.20 updated your listing’s timestamp

Relevancy Thursdays

• Recency was meant to support “freshness” in search results

• Search originated as PostgreSQL query

• Converted to Solr to scale

What happens if we switch to relevance?

Relevancy Thursdays

• No A/B testing framework

• No event logs

• Limping along with Google Analytics

2005 20132007 20112009

February

First Log Analysis

First Log Analysis

• Raw web access logs

• URL- and ref tag-based

• Regex parser

Heyday of Tooling

• A/B framework

• Front end event logger

• Database snapshotter

• Barnum and Bailey

• Custom operator library

• Loaders

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume Akamai

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie Barnum

• Avro TupleSerialization

• Zookeeper

A/B Framework

• Ramp-ups + A/B testing

• Feature flag development

Self-service analytics for any A/B test on the site

2005 20132007 20112009

A/B Framework

June

2005 20132007 20112009

A/B Analyzer

November

Why did it take so long?

• Non-web developers learning the PHP stack

• Failed experiments with “easier to use” MapReduce tools

• Realizing self-service analytics was what Etsy needed

2005 20132007 20112009

February

Catapult

Catapult

• A/B Analyzer + Launch Calendar

• Full product lifecycle

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume Akamai

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie Barnum

• Avro TupleSerialization

• Zookeeper

LAMP Stack for Big Data• HDFS

• MapReduce

• HBase

• Hive Vertica

• Flume logrotate

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Computation Models

• Batch

• Interactive

• Streaming

Batch

Cascading

SQL cascading.jruby

Query Planner/Optimizer Cascading

Execution Engine MapReduce

Storage HDFS

RDBMS / Cascading

cascading.jruby

cascading.jruby

• Productivity: no compile

• Reuse: factor out structure

• Efficiency: no JRuby runtime

• Optimization: move aggregations map-side

A nice constructor

cascading.jruby

Productivity

• Job templates

• Reloader

• Cascading local mode

• Sampled data

Reuse

Reuse

Field Names

Efficiency

• Just a constructor

• Calls into Cascading API

• No JRuby runtime on cluster

Optimization

Tuple Data Model

UDFs

Scalding

• Distributed collections

• Function literals replace UDFs

Interactive

Vertica

Sharded MySQL

• Borrowed from Flickr

• Works

Thou Shalt Not Join

2005 20132007 20112009

Hive

January

2005 20132007 20112009

Hive Turned Off

April

Hive

• Slow

• Sensitive

• Operational burden

• Educational burden

Vertica

• Offline copy of shards, master, auxiliary databases

• Joins are easy

• Reasonable latency

2005 20132007 20112009

Vertica

November

Vertica

• Game changer at Etsy

• High demand for joins

• Rapid prototyping data pipelines

SQL cascading.jruby

Query Planner/Optimizer Cascading

Execution Engine MapReduce

Storage HDFS

RDBMS / Cascading

Back to MapReduce

• Event logs

• Schedule

• Load data in prod

• Scale

Vertica

• Not Hive, Impala, Shark, etc.

• May change our minds

Streaming

Not Powered by MapReduce

• Activity Feed

• Shop Stats

Etsyweb

• memcached

• Gearman

• Sharded MySQL

Usecases

• Trending

• Fraud detection

• ?

Turns out people don’t make product decisions in real time

http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics

Summing Up

• Be glad you’re living in the future

• Automated tools for the common case

• Don’t be afraid to experiment

Image Credits• http://kongscreenprinting.com/what-we-do-

showcase

• http://animal.discovery.com

• http://www.rallyrace.com/turning-over-the-stone-event-production-basics/

• http://www.flickr.com/photos/bbalaji/2443820505/

• http://www.madeyoulaugh.com/funny_photos/caveman_harley/caveman_harley.jpg

• http://theundercoverrecruiter.com/6-ways-catapult-your-job-search-after-layoff/

• http://www.globaltimes.cn/SPECIALCOVERAGE/Top10Peopleof2011.aspx

• http://www.theculturemap.com/scream-time-edvard-munch-museum/

• http://www.repentamerica.com/webelieve.html

• https://soundcloud.com/tearland/tl-hive

• http://pocketnow.com/2012/08/02/wifi-vs-data-speed-vs-battery-life/bush-scratching-head

Contact / Reference

• Matt Walker

• @data_daddy

• http://codeascraft.etsy.com/

• http://www.etsy.com/codeascraft/talks

top related