hadoop @ foursquare

Download Hadoop @ Foursquare

Post on 24-Oct-2014

38.727 views

Category:

Documents

4 download

Embed Size (px)

TRANSCRIPT

Hadoop @ FoursquareJoe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake

What is Foursquare?An app that helps you explore your city and connect with friends A pla5orm for loca7on based services and data

What is Foursquare?People use foursquare to: share with friends discover new places get 7ps get deals earn points and badges keep track of visits

What is Foursquare?Mobile Social

Local

Stats

20,000,000+ people 30,000,000+ places 2,000,000,000+ check-ins 1500+ ac7ons/second

Video: hIp://vimeo.com/29323612

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure Past Present Future

ExploreA social recommenda7on engine built from check-in data

What is a place?Central Park JFK

Time signatures for places

Ice cream?ice cream shops 100 0.7 90

80

0.6

Temperature (f)

0.5

60 0.4 50

0.3 40

30 Jan

Feb

Mar

Apr

May

Jun

Jul Month in 2011 (New York)

Aug

Sep

Oct

Nov

Dec

Jan

% of checkins

70

Check-ins and the weatherWarm weather spots ice cream shops Cold weather spots lakes

roof decks boats or ferries harbors or marinas sculpture gardens tracks basketball courts parks

basketball stadiums hockey stadiums art galleries ska7ng rinks bou7ques steakhouses ramen or noodle house

Finding similar items Critical for our recommendation engine Large sparse k-nearest neighbor problem Items can be places, people, brands Different distance metrics Need to exploit sparsity otherwiseintractable

Finding similar items Metrics we nd work best for recommending: Places: cosine similarity x xsim(xi , xj ) =i j

Friends: intersec7on

kxi kkxj k

Brands: Jacaard similaritysim(A, B) =

sim(A, B) = |A \ B||A\B| |A[B|

Computing venue similarity

each entry is the log(# of checkins at place i by user j) one row for every 30m venues...

X2R

nd

Kij = sim(xi , xj ) xi xj = kxi kkxj k

K2R

nn

Computing venue similarity Naive solu7on for compu7ng : K

O(n d)

2

Requires ~4.5m

machines to compute in < 24 hours!!! and 3.6PB to store!

Kij = sim(xi , xj ) xi xj = kxi kkxj k

K2R

nn

Venue similarity w/ map reducekey user vi, vj vi, vj key vi, vj score score ... score ... visited venues score score

map

emit all pairs of visited venues for each user

reduce

nal score

Sum up each users score contribu7on to this pair of venues

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure Past Present Future

Data pipeline - stats

1,500,000,000+ 2,500,000,000,000+

log events / week

bytes / week

GB / day (compressed) for api collec7on May 2011 Nov 2011 May 2012

Data pipeline - statsAnd lots of people are using it! 100+ hive users. several users with > 100 jobs. ~ 700 MR jobs / day.

Data pipeline - stats

Data pipeline - backgroundFoursquares technology stack Amazon EC2 MongoDB Solr / elas7csearch Scala Lij web framework

Flume (0.9.x aka old-gen) Amazon S3

Data pipeline - overviewAPI / WWW Flume Collector JSON .../collection-name/dt=2012-06-19/... S3 Export Process

mongodb

Hive

MapReduce

Data pipeline - logsAPI / WWW Flume Collector JSON .../collection-name/dt=2012-06-19/... S3

Applications log JSON some common elds (e.g. event id, 7mestamp, host) data is par77oned by collec7on and date in S3. one table per collec7on in Hive.

Flume for data transport.

Data pipeline - mongodbMongo data is nice to work with in MapReduce info in logs can be stale. certain aIributes not in logs. can scan much less data.Export Process

S3

mongodb

MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.

Data pipeline - analyticsAutomated reporting typically a Hive query -> google docs spreadsheet.

Ad hoc reporting hive dashboard for entering query and receiving an email when results are ready. RoR

Data pipeline - beekeeper

Data pipeline - SummaryLog data and snapshots of mongo data are stored in S3. Users query/analyze the data using Hive, Pig, and MapReduce. Compiled data is inserted to mongo or google spreadsheets for reporting.

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure Past Present Future

Data infrastructure - pastElastic MapReduce but we were keeping clusters con7nuously running.

Rudimentary workow management start daily repor7ng at 7me X. Hope that data is there. dicult to monitor.

Data infrastructure - pastScaling the number of users was troublesome most of the company uses hive. hive server con7nuously crashed. lots of memory issues. resource conten7on.

Mongo data converted to delimited records, which doesnt always make sense. incremental dumps - some data not consistent (e.g. if two venues are merged). basic schema detec7on. single threaded per-collec7on.

Data infrastructure - pastHive and EMR ows supported for automated reporting. lots of mapreduce tools written in ruby everything else is scala want to use common u7li7es

installing gems on system is briIle

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure Past Present Future

Data infrastructure - presentIntroduced a lot of new systems Clouderas Hadoop Distro - CDH3u3 Oozie for workow / data management Pig for repor7ng Scaled back ruby / hive dashboard BSON mongo dumps Scala MapReduce Scoobi36

Data infrastructure - CDH3u312-node cluster in EC2 on cc8.xlarge instances data is in S3 fair scheduler (jobs run as submipng user) performance improvements skew means slowest reducer ojen denes wall-clock 7me

signs of virtualiza7on cpu bound (data compression)

Data infrastructure - ooziePros beIer monitoring (though not perfect). coordinators for dataset management are great. oozie distributes job submission via map tasks. SPoF but recovers ajer a restart (state stored in DB).

Cons deployment is not ideal, its dicult to version workows. congura7on via XML - lots of boilerplate we have a scaolding script to bootstrap a workow.

Oozie coordinator (the good)S3 / HDFS Dataset Instance Coordinator Workow F Does data exist yet? Depends On Yes? Kickoff workow!

Dataset A

Oozie XML (the bad)Hello World in Oozie just invoking HelloWorld#main

Pig for reportingConverted some ruby streaming to Pig + Scala UDFs. More natural than Hive for some reports, especially those that output to multiple locations. Elephant-Bird (twitter), Piggybank (apache), Data-fu (LinkedIn) all great UDF resources.

Ad hoc reporting dashboardUses hive thrift server to validate syntax (via EXPLAIN) Submits jobs as Oozie workows. The query is a parameter to the workow. queries run as the users that submit them.$QUERY Beekeeper

1. EXPLAIN $QUERY

Thrift Server

3. (repeatedly) is workow done?

2. submit workow, query=$QUERY

Oozie REST

42

Hive dashboard - errorClick to edit Master text styles

43

BSON data dumpsFull loads each day, parallelized. mongodbs native format is BSON. Binary JSON some extensions to JSON schema-less

BSON data infrastructureHive SerDe and Scoobi Inputs InputFormat for Thrift objects to use in MR. Scala Codegen converts to Thrift Object BSON InputFormat converts to BSONObject Oozie Workow to mount snapshot, split data, compress, upload to S3. Mongo stores BSON on EBS.-Periodic EBS SnapshotsRecordv2 SerDe / Scoobi Input ThriftBsonInputFormat

Thrift (scala codegen)

BSONObject BSON Split / LZO compress BSON

EBS Snapshots45

Scooby

Not that Scooby!

ScoobiA strongly-typed data ow language written in Scala. Much easier than writing MapReduce, but still very exible. https://github.com/nicta/scoobi

Scoobi ExampleCounting checkins

Data infrastructure - Data JoinsJoins in MapReduce are cumbersome. Do them once!

Data infrastructure - Data JoinsVenue Checkins Checkins Checkins Tips Tips Tips Likes Likes Likes

Venue

Data Join Checkins Tips Checkins Checkins Tips Checkins Tips Checkins Tips Tips Checkins Tips Checkins Checkins Tips Checkins Tips Tips

Likes Likes Likes Likes Likes Likes Likes Likes Likes

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure Past Present Future

Future WorkHCatalog Makes Hive tables (including input formats and serdes) available to Pig and MapReduce Add support for Scoobi

Indexing/Hive-indexing Relational / MPP database for analytics dashboarding Key-value store for easily serving hadoop data in prod. Replacing Flume 0.9.4

Open SourceLet us know what you might nd useful

Join us!foursquare is hiring! 115+ people and growing foursquare.com/jobsJoe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake