building and improving products with hadoop
DESCRIPTION
In many instances the terms `big data` and `Hadoop` are reserved for conversations on business analytics. Instead, I posit that these technologies are most powerful when they are deployed as a way to both build new products, and improve existing ones. Measurement is a fundamental part of the process, but more importantly I will walk through an effective tool-chain that can be used to: a) build unique new products, based on data. b) test improvements to a product At Foursquare, we`ve used a Hadoop-based tool chain to build new products (like social-recommendations), and to improve existing features through initiatives such as experimentation, and offline data generation. These products and improvements are fundamental to our core business, yet their existence would not be possible without Hadoop. I will pull examples from Foursquare and other companies to demonstrate these points, and outline the infrastructure components needed to accomplish them.TRANSCRIPT
![Page 1: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/1.jpg)
2013
Building and
Improving Products
with Hadoop
Matthew Rathbone
![Page 2: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/2.jpg)
2013
What is Foursquare
Foursquare helps you explore the world around you.
Meet up with friends, discover new places, and save money using your phone.
4bn check-ins
35mm users
50mm POI
150 employees
1tb+ a day of data
![Page 3: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/3.jpg)
2013
FIRST, A STORY
http://www.flickr.com/photos/shannonpatrick17
![Page 4: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/4.jpg)
2013
The Right Tool for the Job
• Nginx – Serving static files
• Perl – Regular expressions
• XML – Frustrating people
• Hadoop (Map Reduce) – Counting
![Page 5: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/5.jpg)
2013
COUNTING – WHAT IS IT GOOD FOR
http://www.flickr.com/photos/blaahhi/
![Page 6: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/6.jpg)
2013
![Page 7: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/7.jpg)
2013
![Page 8: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/8.jpg)
2013
![Page 9: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/9.jpg)
2013
![Page 10: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/10.jpg)
2013
![Page 11: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/11.jpg)
2013
Statistically Improbable Phrases
Statistically Improbable Phrases
![Page 12: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/12.jpg)
2013
SIPS use cases
• menu extraction
• sentiment analysis
• venue ratings
• specific recommendations
• search indexing
• pricing data
• facility information
![Page 13: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/13.jpg)
2013
How is SIPS built?
Basically lots of counting.
![Page 14: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/14.jpg)
2013
SIPS
• Tokenize data with a language model (into N-Grams)
• built using tips, shouts, menu items, likes, etc
• Apply a TF-IDF algorithm (Term frequency, inverse document frequency)
• Global phrase count
• Local phrase count ( in a venue )
• Some Filtering and ranking
• Re-compute & deploy nightly
![Page 15: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/15.jpg)
2013
WHY USE HADOOP?
http://www.flickr.com/photos/dbrekke/
![Page 16: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/16.jpg)
2013
SIPS – Without Hadoop
Potential Problems
• Database Query Throttling
• Venues are out of sync
• Altering the algorithm could take forever to populate for all venues
• Where would you store the results?
• What about debug data?
• Does it scale to 10x, 100x?
• What about other, similar workflows?
![Page 17: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/17.jpg)
2013
SIPS – Hadoop Benefits
• Quick Deployment
• Modular & Reusable
• Arbitrarily complex combination of many datasets
• Every step of the workflow creates value
![Page 18: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/18.jpg)
2013
Apple Store - Downtown San Francisco
1 tip mentions "haircuts"
Search for "haircuts" in "san francisco" Apple store???
Fixed by looking at % of tips and overall frequency
“Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix-
my-f!@#$-imac”
![Page 19: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/19.jpg)
2013
Data & Modularity
![Page 20: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/20.jpg)
2013
![Page 21: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/21.jpg)
2013
![Page 22: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/22.jpg)
2013
![Page 23: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/23.jpg)
2013
ACTUALLY, IT’S A BIT MORE
COMPLICATED http://www.flickr.com/photos/bfishadow
![Page 24: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/24.jpg)
2013
These benefits require infrastructure
![Page 25: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/25.jpg)
2013
Dependency Management
Many options
• Oozie (Apache)
• Azkaban (LinkedIn)
• Luigi ( Spotify, we <3 this )
• Hamake ( Codeminders )
• Chronos ( AirBNB)
![Page 26: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/26.jpg)
2013
![Page 27: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/27.jpg)
2013
Database / Log Ingestion
• Sqoop
• Mongo-Hadoop
• Kafka
• Flume
• Scribe
• etc
![Page 28: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/28.jpg)
2013
![Page 29: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/29.jpg)
2013
MapReduce Friendly Datastore
A few obvious ones:
• Hbase
• Cassandra
• Voldemort
we built our own, it’s very similar to
Voldemort and uses the Hfile API
![Page 30: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/30.jpg)
2013
![Page 31: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/31.jpg)
2013
Getting started without all that stuff
![Page 32: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/32.jpg)
2013
Components you likely don’t have
![Page 33: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/33.jpg)
2013
The best way to start
Don’t use Hadoop.
*but pretend you do
![Page 34: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/34.jpg)
2013
Other reasons to not use Hadoop
• Your idea might not be very good
• Hadoop will slow you down to start with
• You don’t have enough infrastructure yet
• build it when you need it
• V1 might not be that complex
• V1 could be a spreadsheet
![Page 35: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/35.jpg)
2013
![Page 36: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/36.jpg)
2013
![Page 37: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/37.jpg)
2013
SIPS
Version 1
• Off the shelf language model
• A subset of Venues & Tips
• Did not use Map Reduce
• Did not push to production at all
![Page 38: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/38.jpg)
2013
SIPS
Version 2
• Started building our own language model
• Rewritten as a Map Reduce
• Manually loaded data to production
• Filters for English data only.
Tweak, improve, etc
![Page 39: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/39.jpg)
2013
SIPS
Version 3
• Incorporated more data sources into our language model
• Deployment to KV store (auto)
• Incorporated lots of debug output
• Language pipeline also feeds sentiment analysis
Now we’re in the perfect place to iterate & improve
![Page 40: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/40.jpg)
2013
…to explore data
![Page 41: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/41.jpg)
2013
In Summary
• Hadoop is good for counting, so use it for counting
• Move quickly whenever possible and don’t worry about automation
• Bring in new production services as you need them
• Freedom!
![Page 42: Building and Improving Products with Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/42.jpg)
20132013
Thanks!
@rathboma
Bonus:
http://hadoopweekly.com
from my colleague, Joe Crobak (presenting later!)