eventbrite data platform talk foir sfdm
DESCRIPTION
Slides for Eventbrite's data platform talk at SF data mining meetup.TRANSCRIPT
![Page 2: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/2.jpg)
A social event ticketing and discovery platform
![Page 3: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/3.jpg)
$1B total sales
68M tickets sold
1.4M events hosted
.5M organizers served
23M attendees served
12 countries
![Page 4: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/4.jpg)
Event Lifecycle
![Page 5: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/5.jpg)
Frictionless is the mantra!
![Page 6: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/6.jpg)
Data Platform and Discovery
![Page 7: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/7.jpg)
![Page 8: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/8.jpg)
![Page 9: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/9.jpg)
![Page 10: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/10.jpg)
![Page 11: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/11.jpg)
![Page 12: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/12.jpg)
![Page 13: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/13.jpg)
![Page 14: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/14.jpg)
Analytics
• Add–Hoc queries by Analysts
![Page 15: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/15.jpg)
Fraud and Spam
![Page 16: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/16.jpg)
Data Platform
![Page 17: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/17.jpg)
![Page 18: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/18.jpg)
Hadoop Cluster
• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3
formatted• CDH3 • Fair Scheduler• HBase
![Page 19: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/19.jpg)
Infrastructure
• Search• Solr• Incremental updates towards event driven
• Recommendation/Graph• Hadoop• Native Java MapReduce• Bash for workflow
• Social• Cassandra• Denormalized vview
• Persistence• MySql• HDFS• HBase• MongoDB (Moving to Cassandra)
![Page 20: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/20.jpg)
Infrastructure
• Stream• RabbitMQ• Internal Fire hose• Storm
• Offline• MapRedude• Streaming• Hive• Hue
![Page 21: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/21.jpg)
DiscoverySocial, Interest, Local
![Page 22: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/22.jpg)
![Page 23: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/23.jpg)
Categorization - Prism
Tech
MusicConference
Sports
![Page 24: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/24.jpg)
Prism - Features
• Supervised Learning• Logistic Regression using MLE• Pair wise classification into 20 categories• High precision lower recall• Use mapreduce for feature extraction• Use for clustering as well
![Page 25: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/25.jpg)
Prism – Training Data
• Binary classification for each category• Training data needed for positive and negative
• Conference and not Conference• Sports and not Sports
• Samasource and Crowdflower• Stem words to create initial set• Positive, negative, negative with stem words
![Page 26: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/26.jpg)
Prism - Features
• Convert Event and Organizer data in feature vector
• Event details, Organizer details, Ticket details• Boolean representation of predefined attributes
• Words – tf-idf, dictonaries• Phrases• Domains• Rules – regular expression• Functions – business logic e.g. ticket price between $10-
$20• Compounds – boolean combination of features & and ||
rules– <COMPOUND1>:techcrunch & disrupt & techcrunch.com– <COMPOUND2>:COMPOUND2 && after && party
![Page 27: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/27.jpg)
Prism - Features
• Each feature is represented in various context• Event Title, Event Description, Organizer Title, Organizer Description
• Each feature has meta info – Termclass• <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN>• <SPORTS_LANG_EN>:<EVENT_TITLE>ball
• Feature vector is represented as sparse vector
+1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1 427659:1 438369:1 449735:1 449736:2 455478:1 456741:1 463188:1
693|||||warrior spirit's 3rd annual fundraising auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annual,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auction,2:<DESC>auctions,2:<DESC>bring
![Page 28: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/28.jpg)
Prism - Training
• Binary classifier• Multiclass less accurate• Each event get classified into 20 category• MapReduce for creating sparse matrix• MapReduce for batch classification
• Distributed cache for feature set and models
• We can use same sparse matrix for clustering
![Page 29: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/29.jpg)
Attendee
• What your interests are? - Prism• Who your friends are? – Explicit and Implicit• What are the interests of your friends? - Prism• Which of your friend have your interests? – IBG• Location of users and events
• Purchase events location• Facebook location• Our database• Other signals – ip, mobile app etc
![Page 30: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/30.jpg)
You will like to attend this event
![Page 31: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/31.jpg)
Item Hierarchy (You bought camera so you need batteries - Amazon)
Collaborative Filtering – User-User Similarity (People who bought camera also bought batteries - Amazon)
Collaborative Filtering – Item-Item similarity(You like Godfather so you will like Scarface - Netflix)
Social Graph Based (Your friends like Lady Gaga so you will like Lady Gaga, PYMK – Facebook, Linkedin)
Interest Graph Based (Your friends who like rock music like you are attending Eric Clapton Event–Eventbrite)
Recommendation Engines
![Page 32: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/32.jpg)
Why Interest?
Events are Social Events are Interest
Dense Graph is IrrelevantInterest are Changing
![Page 33: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/33.jpg)
How do we know your Interest?
• We ask you• Based on your activity
• Events Attended• Events Browsed (In Future)
• Facebook Interests• User Interest has to match Event category• Static
• Prism
![Page 34: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/34.jpg)
Model Based vs Clustering
Building Social Graph is Clustering Step
Social Graph Recommendation is a Ranking Problem
Item-Item vs User-User
![Page 35: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/35.jpg)
Implicit Social Graph
U1
U2 U3
U4 U5
E1
E2 E3
E4
![Page 36: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/36.jpg)
Mixed Social Graph
U1
U2 U3
U4 U5
E1
E2 E3FB
LI
![Page 37: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/37.jpg)
23M * 260 * 260 = 1.5 Trillion Edges
6 Billion edges ranked
Each node is a feature vector representing a User
Each edge is a feature vector representing a Relationship
![Page 38: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/38.jpg)
Feature Generation
• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event Attendees
• MAP: eid: uid• REDUCE: eid:[uid]
• Attendees Social Graph• Input: eid:[uid]• MAP: uidi:[uid]
• REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc• Upload feature values to HBase
![Page 39: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/39.jpg)
HBase
• Why Hbase?• To process 6B edges lookup features for each node and
each edge• 6B/1000 /86400 = 70 days!!• 1M/sec = 1.5 hrs• Processing 1.3 TB of data with mapreduce
• Collect data from multiple Map Reduce jobs• Stores entire social graph• Features for each node and edge
![Page 40: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/40.jpg)
Data Model
Rowkey U UU
uid1 f1 f2 f3 uid2:f4 uid2:f5 uid3:f4
rowid neighbors events featureX
2718282 101 3 0.3678795
rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx
2718282 31 1 0.3183 83 2 0.618
![Page 41: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/41.jpg)
U1
U2 U3
![Page 42: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/42.jpg)
HBase
![Page 43: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/43.jpg)
Hadoop Tips & Tricks
• Joins• Distributed cache• Hive map side joins
• Hive• Nice set of statistical functions• Lots of hive queries
• Hbase• Lots of memory• WAL• LZO• Proper configs• Avoid hot regioservers
![Page 44: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/44.jpg)
Hadoop tips & tricks
• Combiners did not work• Shuffle and Merge
![Page 45: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/45.jpg)
More Innovation
• Rethink everything• Add social to search• Add time series features• Real time updates using firehose and storm• Various sorts of data
![Page 46: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/46.jpg)
Developers! Developers! Developers!
• Interested in scaling, messaging, data, machine learning, mobile, services
• We will continue to push the boundaries of hard problems
![Page 47: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/47.jpg)
Storm at Eventbrite
Tuesday August 21, 2012 at Eventbrite HQ
How we are using Storm for real time processing of our data
Andrew Whang [email protected]
http://www.eventbrite.com/event/4010290888
![Page 48: Eventbrite Data Platform Talk foir SFDM](https://reader033.vdocuments.site/reader033/viewer/2022051611/54b769114a795971038b45e1/html5/thumbnails/48.jpg)
Questions?