big data use cases in the cloud presentation
TRANSCRIPT
![Page 1: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/1.jpg)
Big Data Use Cases in the cloudPeter Sirota, GM Elastic MapReduce @petersirota
![Page 2: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/2.jpg)
What is Big Data?
![Page 3: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/3.jpg)
Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)
![Page 4: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/4.jpg)
Human generated data Twitter “Firehose” (50 mil tweets/day 1,400% growth
per year) Blogs/Reviews/Emails/Pictures
Social graphs Facebook, linked-in, contacts
![Page 5: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/5.jpg)
Big Data is full of valuable, unanswered questions!
![Page 6: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/6.jpg)
Why is Big Data Hard (and Getting Harder)?
![Page 7: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/7.jpg)
Data Volume Unconstrained growth Current systems don’t scale
Why is Big Data Hard (and Getting Harder)?
![Page 8: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/8.jpg)
Why is Big Data Hard (and Getting Harder)?
Data Structure Need to consolidate data from multiple data sources
in multiple formats across multiple businesses
![Page 9: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/9.jpg)
Why is Big Data Hard (and Getting Harder)?
Changing Data Requirements Faster response time of fresher data Sampling is not good enough and history is important Increasing complexity of analytics Users demand inexpensive experimentation
![Page 10: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/10.jpg)
We need tools built specifically for Big Data!
![Page 11: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/11.jpg)
Innovation #1:
Apache HadoopThe MapReduce computational paradigm Open source, scalable, fault‐tolerant, distributed system
Hadoop lowers the cost of developing a distributed system for data processing
![Page 12: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/12.jpg)
Innovation #2:
Amazon Elastic Compute Cloud (EC2) “provides resizable compute capacity in the cloud.”
Amazon EC2 lowers the cost of operating a distributed system for data processing
![Page 13: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/13.jpg)
Amazon Elastic MapReduce =
Amazon EC2 + Hadoop
![Page 14: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/14.jpg)
Elastic MapReduce applications
Targeted advertising / Clickstream analysisSecurity: anti-virus, fraud detection, image recognitionPattern matching / Recommendations Data warehousing / BIBio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation)File processing (resize jpegs, video encoding)Web indexing
![Page 15: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/15.jpg)
Clickstream Analysis –
Big Box Retailer came to Razorfish 3.5 billion records 71 million unique cookies 1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
![Page 16: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/16.jpg)
Clickstream Analysis –
Targeted Ad
User recently purchased a sports movie and is searching for video games (1.7 Million per day)
![Page 17: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/17.jpg)
Clickstream Analysis –
Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop
![Page 18: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/18.jpg)
Clickstream Analysis –
Processing time dropped from 2+ days to 8 hours (with lots more data)
![Page 19: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/19.jpg)
Clickstream Analysis –
Increased Return On Ad Spend by 500%
![Page 20: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/20.jpg)
World’s largest handmade marketplace 8.9 million items 1 billion page view per month $320MM 2010 GMS
![Page 21: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/21.jpg)
• Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes
Production DB snapshots
Production DB snapshots
Web event logs
Web event logs ETL – Step
1ETL – Step
1ETL – Step
2ETL – Step
2
JobJob
JobJob
JobJob
![Page 23: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/23.jpg)
Recommendations
etsy.com/gifts
Gift Ideas for Facebook Friends
![Page 24: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/24.jpg)
•
• Yelp generates close to 400GB of logs per day
Yelp
![Page 25: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/25.jpg)
• Yelp does not have a physical MapReduce cluster
• Running 250 production clusters per week
• All of those run on Elastic MapReduce
MapReduce at Yelp
![Page 26: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/26.jpg)
Features driven by MapReduce
![Page 27: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/27.jpg)
Features driven by MapReduce
![Page 28: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/28.jpg)
• Analyze ad stats (reporting, billing, algorithm inputs)
• Analyze A/B test results
• Detect duplicate business listings
• Email bounce processing
• Identify bots based on traffic patterns
More MapReduce uses
![Page 29: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/29.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Big Data @ foursquare
![Page 30: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/30.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
How do we use EMR?
• Map-Reduce– Run algorithms on our entire dataset– Streaming jobs, complex analyses
• Hive– Business intelligence– Exploratory analyses– Infographics!
![Page 31: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/31.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
How big is our data?
• Global reach (North Pole, Space)• Native app for almost every smartphone, SMS,
web, mobile-web• 10M+ users, 15M+ venues, ~1B check-ins• Terabytes of log data
![Page 32: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/32.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Our Stack
![Page 33: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/33.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Computing venue-to-venue similarity
• Spin up 40 node cluster• Submit Ruby streaming job
– Invert User x Venue matrix– Grab Co-occurrences– Compute similarity
• Spin down cluster• Load data to app server
![Page 34: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/34.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Who is checking in?
![Page 35: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/35.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
What are people doing?
![Page 36: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/36.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Where are our users?
![Page 37: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/37.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
When do people go to a place?
Thursday Friday Saturday Sunday
![Page 38: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/38.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Why are people checking in?
• Explore their city, discover new places• Find friends, meet up• Save with local deals• Get insider tips on venues• Personal analytics, diary• Follow brands and celebrities• Earn points, badges, gamification of life• The list grows…
![Page 39: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/39.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
How can we leverage these insights?
![Page 40: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/40.jpg)
9/23/2011 Amazon EMR Strata Justin Moore - @injust
Join us!
foursquare is hiringwww.foursquare.com/jobs
Justin Moore@injust
![Page 41: Big data use cases in the cloud presentation](https://reader034.vdocuments.site/reader034/viewer/2022051414/55cd0106bb61ebae098b47f1/html5/thumbnails/41.jpg)
http://aws.amazon.com/elasticmapreduce/