using joyent manta to scale event-based data collection and analysis at wanelo

43
Proprietary and Confidential Presenter: Konstantin Gredeskoul CTO, Wanelo.com Based on work of Atasay Gökkaya and other engineers "It's a Unix System! I know this!" Using Manta to Scale Event-based Data Collection and Analysis @kig @kigster

Upload: konstantin-gredeskoul

Post on 15-Jan-2015

10.457 views

Category:

Technology


0 download

DESCRIPTION

Data aggregation and analysis problems become notoriously thorny as traffic scales up: conventional databases break down at scale, and map/reduce frameworks such as Hadoop have a substantial developer and operational complexity burden. Wanelo, an online community for all the world's shopping bringing together stores, products and 10M users all in one social platform, became frustrated that the aggregation and analysis tools used when data was small (venerable Unix data processing utilities like grep, awk, cut, sed, uniq and sort) couldn't be used when data became large. Upon discovering Manta, a new cloud-based object storage system that enables the storing and processing of data simultaneously, Wanelo had a solution that no longer required the need to move data between storage and compute. Building on Manta, Wanelo has developed a system for data analysis that allows the team to tackle big data analysis using Unix utilities, resulting in a cost-effective and scalable solution. In this talk Konstantin discussed Wanelo's experiences building their system on Manta, including their motivations and considered alternatives that led to a Manta-based implementation of fully-parallelized cohort retention analysis in four lines of shell.

TRANSCRIPT

Page 1: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Presenter: Konstantin GredeskoulCTO, Wanelo.com

Based on work of Atasay Gökkaya and other engineers

"It's a Unix System! I know this!"

Using Manta to Scale Event-based Data Collection and Analysis

@kig

@kigster

Page 2: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

■ Wanelo (“Wah-nee-lo” from Want, Need Love) is a global platform for all the world’s shopping

Page 3: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

■ Users find products on online stores

■ They post these products to Wanelo via, a javascript “bookmarklet”

■ Others discover these products on Wanelo via feed, trending, search, etc

■ Users then save products they discovered to their own collections

How Wanelo Works

Page 4: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Page 5: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

■ Users can follow other users. Following is bi-directional, like Twitter, and public

■ Besides following other users, you can follow individual stores on wanelo

■ Result is a personalized shopping feed, much like Twitter’s information feed

■ After seeing a product on Wanelo, users can buy the product on the original site

Wanelo is a Social Network

Page 6: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Mobile: iOS + Android 60K ratings

Page 7: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Backend Stack & Key Vendors

Proprietary and Confidential

■ MRI Ruby 2.0 & Rails 3

■ PostgreSQL 9.2, solr, redis, memcached, twemproxy, nginx, haproxy

■ Joyent Cloud, SmartOSZFS, ARC Cache, raw IO performance, SMF, Zones, dTrace

■ Joyent Manta: Analytics and Backups

■ Chef, Opscode EnterpriseFull server automation, zero manual installs

■ Images: AWS S3 behind Fastly CDN

■ Circonus, NewRelic, statsd, Boundary

Page 8: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Final word about Wanelo...

Proprietary and Confidential

We are slightly obsessed with cat pictures =)

Page 9: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Recording User Events: Why?

Proprietary and Confidential

■ Let’s say user saves a product

■ Naturally we create a row in our main data store (PostgreSQL)

■ But we also want to record this event to an append-only log table, for future analysis

■ In the ideal world, this append-only table has every user-generated event of interest

Page 10: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Hey, What’s the Scale Here?

Proprietary and Confidential

■ 10M users

■ 7M products saved over 1B times

■ 200K+ stores

■ Backend peaks at 200,000 RPMs

■ Generating between 5M and 20M user events per day

Page 11: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Recording Events: Stupidly

Proprietary and Confidential

■ We are just starting: what’s the simplest thing we can do? Our traffic is still pretty low.

■ Let’s create a database table and append to that. Simple? Yes.

■ Scalable? Hell No.

■ One month after launch, we hit the wall.

Page 12: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Let’s Scale Data Collection

Proprietary and Confidential

■ OK, so inserting 10M records into PostgreSQL per day is pretty stupid. Even I know that.

■ We looked around for various options. There were many. Flume, Fluentd, Scribe. Meh.

■ We chose rsyslog: clients can buffer records, send cheap UDP packets.

■ More than one log collector for redundancy

Page 13: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Scaling Event Data Collection

Proprietary and Confidential

■ rsyslog rocks. We are now sending 20M events per day from 40+ hosts

■ rsyslog is dumping them into an ASCII pipe-delimited file

■ logadm rotates the file daily. We get 1GB+ file per day of activity

■ We have solved data collection problem for a long time, and very cheaply.

Page 14: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Page 15: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Now What?

Proprietary and Confidential

■ So now we have 100s of files, closing in on 500GB of data

■ We want to ask some intelligent questions

■ For example: how many people who signed up four weeks ago are still active? (cohort retention)

■ How many products saved does it take for a user to become engaged?

Page 16: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Let’s Dive Deeper

Proprietary and Confidential

■ Here is an example of our log file(spaces/alignment added for readability)

user_id        platform    action_type              object        object_id    secondary_object        sec_obj_id      timestamp-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐8524264|ipad      |SaveAction      |Product|5757428|Collection          |29399687|13683419427555287|android|SaveAction      |Product|5758908|GiftsCollection|26680024|13683419423924118|iphone  |SaveAction      |Product|1979020|Collection          |29463107|13683419421285811|ipad      |SessionAction|User      |1285811|                              |                |13683419428246365|ipod      |SaveAction      |Product|7930662|Collection          |28523544|13788951961233612|desktop|SessionAction|User      |1233612|                              |                |13788951969654098|desktop|PostAction      |Product|7962904|Store                    |158163    |13788951979654098|desktop|SaveAction      |Product|7962904|GiftsCollection|34407722|1378895197843456  |iphone  |SessionAction|User      |843456  |                              |                |13788951979005146|android|SaveAction      |Product|6389593|GiftsCollection|32117206|13788951976721497|desktop|CommentAction|Product|7930418|Comment                |37304732|1378895197

Page 17: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Parsing ASCII files is simple

Proprietary and Confidential

■ What we get with this file format is simplicity

■ grep,  sort,  uniq,  comp,  awk,  wc  

■ These UNIX tools have been optimized for four decades! I challenge you to write a faster grep!

Page 18: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Have YOU brushed up on your AWK skillz?

Proprietary and Confidential

Page 19: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Let’s Ask Some Questions

Proprietary and Confidential

cat user_actions_20130626.log | \awk -F'|' \ '{if ($2==“ipad” && $3==“FollowAction” ){ print $1 } }' | \sort | \uniq | \wc -l

■ How many unique users followed someone or something on iPad on 06/26/2013?

Page 20: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

What About Registrations?

Proprietary and Confidential

cat user_actions_20130626.log | \grep -F -e '|RegisterAction|’ | \wc -l

■ How many total user registrations happened across all platforms on the same day 06/26/2013?

Page 21: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

How fast is it really?

Proprietary and Confidential

■ It takes about 10 seconds to grep through a 1.5GB (single day of recorded events) file

>  time  gunzip  -­‐c  user_actions.log.20130512.gz  |  \>        /usr/bin/grep  SaveAction  |  wc  -­‐l......

real        0m    9.584suser        0m  12.195ssys          0m    1.672s

Page 22: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Can we go back a whole year?

Proprietary and Confidential

■ On one hand, we know how to do it...

■ The problem is: 10 seconds x 360 files

■ Sounds like a data warehouse! /run query; /come back the next day

■ Now we are talking hours of parsing!

Page 23: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Map/Reduce

Proprietary and Confidential

■ Google published this model in 2004

■ It describes a way to parallelize algorithms across huge data sets

Page 24: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Map/Reduce

Proprietary and Confidential

■ Decidedly, Map/Reduce requires a new way of thinking

■ Today we have many related projects, such as Hadoop, HDFS, Spark, Hive, Pig

■ Which means that it also requires learning these (somewhat) new tools

Page 25: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

On Demand or Permanent?

Proprietary and Confidential

■ With Hadoop, one practical question is that of infrastructure lifecycle:■ One can create an “on-demand” Hadoop

cluster to run analytics

■ But “on-demand” solution is cheap. Once queried, Hadoop cluster can be killed

■ This requires copying lots of (TBs) of data from storage (typically S3) and takes time

Page 26: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Static Hadoop Cluster

Proprietary and Confidential

■ With a continuously running Hadoop cluster, the biggest issue is cost

■ It’s very expensive to keep a large cluster around, sitting on top of a copy of a giant dataset

Page 27: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Enter Joyent’s Manta

■ Distributed Object Store, sort of like S3

■ UNIX-like file system semantics for objects, and supports directories (YES!!!!)

■ Native compute on top of objects!

■ Strongly consistent instead of eventual consistency

Page 28: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Detailed look at Manta later at Surge2013

Mark Cavage and David Pacheco (Joyent) will discuss building Manta in “Scaling the Unix Philosophy to Big Data” talk on Friday @ 10am

Page 29: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

User Events → Joyent Manta

■ Instead of saving daily event logs to NFS, we now push them as objects to Manta

■ One object = one file = one day of events

■ Let’s look at an example...

Page 30: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Uploading and Downloading

 >  mput  -­‐f  user_actions.20130911  \      /wanelo/stor/user_actions/20130911

 >  mget  \      /wanelo/stor/user_actions/20130911  >      user_actions.20130911

 >  mmkdir  /wanelo/stor/user_actions

Page 31: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Listing Uploaded User Events

>  mls  /wanelo/stor/user_actions    ....    20130909    20130910    20130911    20130912

Page 32: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Beyond Object Store

■ What makes Manta unique is native compute on top of our objects

■ We submit a compute job to Manta

■ Manta creates many virtual instances in seconds (or even milliseconds)

■ We even get root access!

■ We parse our event objects in parallel

Page 33: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Manta’s “Map/Reduce”

■ Streams objects into initial phase

■ Pipes output of initial phase into the input of the next phase (like UNIX!)

■ Each phase is either one-to-one (map phase), or many-to-one (reduce)

Page 34: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Manta’s “Map/Reduce”

input object filtered object

combined resultinput object filtered object

input object filtered object

map phase 1 map phase 2 reduce phase

It’s very familiar, because it’s so similar to piping on a single machine

Page 35: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Real Example

■ Let’s ask a more computationally expensive question:

■ How many times a store was followed in the last three months?

Page 36: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Aggegating Store Follows

■ Map phase:

■ Reduce phase (sum up all the numbers):

grep -F -e '|FollowAction|’ | \grep -F -e '|Store|’ | \wc -l

awk ' { total += $1 } END { print total } '

Page 37: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Cohort Retention Analysis

■ We can save output of map/reduce jobs in another stored object

■ “Cohort” is a set of unique users sharing a particular property

■ Let’s save a unique set of users who registered between 21 and 28 days ago into a temporary object

Page 38: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Cohort Retention Analysis, ctd

awk -F '|' '{ if ($3 == “RegisterAction”) { print $1 } }'

■ Map Phase runs only on 7 days for the given week

■ Reduce phase saves the result into a temporary object

sort | \uniq | \mtee /wanelo/stor/tmp/cohort_user_ids

Page 39: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Cohort Retention Analysis, ctd

■ Now we just need to get unique users active this week, and intersect them with the temporary object

awk -F'|' '{ print $1 }' 

sort | \uniq > period_uniq_ids && \ comm -12 period_uniq_ids \ /assets/wanelo/stor/tmp/cohort_user_ids | \wc -l

■ Map Phase runs on last 7 days

■ Reduce phase intersects

Page 40: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Other Uses of Manta @ Wanelo

■ We can migrate user images to Manta instead of S3, and serve them via CDN

■ If we need to create new image format, we submit a job to use CLI tools to generate new format, or thumbnail size

■ We can (and do!) push database backups and PostgreSQL archive logs to Manta

Page 41: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

Conclusion■ We were able to create a very cost-efficient

way to store massive amount of events

■ Manta allows us to perform complex algebraic queries on our event data, very fast and also cheap

Page 42: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Proprietary and Confidential

And we are just scratching the surface of what’s possible with Manta...

Page 43: Using Joyent Manta to Scale Event-based Data Collection and Analysis at Wanelo

Thanks!apidocs.joyent.com/mantagithub.com/wanelogithub.com/wanelo-chef

Wanelo’s technical blog:building.wanelo.com

Proprietary and Confidential

@kig

@kig

@kigster