rapid data exploration with hadoop
DESCRIPTION
LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio. To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way. This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.TRANSCRIPT
![Page 1: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/1.jpg)
Rapid Data Exploration With Hadoop
Peter SkomorochSenior Data Scientist
@peteskomoroch
![Page 2: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/2.jpg)
Outline• Overview: LinkedIn Biz, Tech, & Analytics• Rapid Data Exploration 101
- Spatial Analytics Pig Code- Trend detection with Pig & Python- R Streaming Example
• Deep Dive: Our Data Analysis Approach• Building Data Products• LinkedIn Data Insights
![Page 3: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/3.jpg)
Connect the world’s professionals to make them more productive and successful
![Page 4: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/4.jpg)
Professional Identity
![Page 5: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/5.jpg)
LinkedIn at a glance• Founded in 2003• #17 site in the US (Alexa)• 60+ million members• First million members = 477 days• Latest million = 9 days• 500K+ company profiles• 12+ million small business professionals• In 2009 - 1billion people searches• Average age: 41• Household income $107,000• 42% are “decision makers”
![Page 6: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/6.jpg)
How International?• More than 50% international
(members in over 200 countries & territories) • 13+ million in Europe• 4+ million in India• 3+ million in UK• #13 site in UK (Alexa)
![Page 7: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/7.jpg)
How do we keep the lights on?• Profitable since 2007• Valued at over $1B at the last funding round• Subscriptions• Ads• Job Postings• Enterprise Client
![Page 8: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/8.jpg)
Hadoop on LinkedIn1,400+ members list “Hadoop” on their profileWhat other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they?
• 36% in Bay Area• 8% in India• 6% in NYC• 4% in Seattle• 4% in Los Angeles
Who do they work for?• 11% Yahoo!• 2% Apache Software Foundation• 1% LinkedIn• 1% Google• 1% Facebook
![Page 9: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/9.jpg)
Hadoop at LinkedIn
![Page 10: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/10.jpg)
Voldemort Data StorageCompact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }
![Page 11: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/11.jpg)
Getting Data In•From Databases (user data, news, jobs etc.)
• Need a way to get data reliably periodically• Need tests to verify data• Support for incremental replication• Solution: Transmogrify Driver Program
• InputReader: JDBCReader, CSV Reader• Output Writer: JDBCWriter, HDFS writers
• From web logs (page views, search, clicks etc)• Weblogs files are rsynced and loaded up in HDFS• Hadoop jobs for date cleaning and transformation.
![Page 12: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/12.jpg)
Getting Data Out
![Page 13: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/13.jpg)
Giving Back: Open Sourcehttp://sna-projects.com/sna/
![Page 14: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/14.jpg)
Analytics Technologies
![Page 15: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/15.jpg)
We Build Things With Data
Give smart people great tools, enable them to solve problems
![Page 16: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/16.jpg)
Prototyping Culture
![Page 17: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/17.jpg)
How does Hadoop enable rapid data
exploration?
![Page 18: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/18.jpg)
Pig for Spatial Analytics
![Page 19: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/19.jpg)
US County HeatMap
![Page 20: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/20.jpg)
Pig for Trend Detection
![Page 21: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/21.jpg)
Python Streaming Script
![Page 22: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/22.jpg)
Sort Output & Display
![Page 23: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/23.jpg)
R Streaming Also Easy
*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
![Page 24: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/24.jpg)
Let’s Talk Data
![Page 25: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/25.jpg)
Business is recognizing the importance of analytics
![Page 26: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/26.jpg)
What data do we start with?
![Page 27: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/27.jpg)
We can also leverage... • Connection Graph• Recommendations• Address Book Uploads• Search Logs• Profile Views & Activity• Job Postings• LinkedIn Groups• LinkedIn Questions
• Company Pages• Talent Match• Web Referrals• 1M+ Twitter Accounts• Wikipedia Data• Mechanical Turk• Census, BLS, & Data.gov• Much more...
![Page 28: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/28.jpg)
How do we think of Analytics?
Data Jujitsu
![Page 29: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/29.jpg)
Lots of Medium can be more powerful than Big
>
![Page 30: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/30.jpg)
Reconstruct Realityfrom Data Exhaust
![Page 31: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/31.jpg)
Data Scientist Lessons• Follow the data, avoid assumptions• Sanity check the extremes (0, infinity)• Don’t get mired in rare edge cases• Data Jujitsu: solve easier auxiliary problems• Build smaller consistent samples to test code• Establish a baseline model quickly, iterate often• Use the right tool for the job at hand• Iterate quickly with high level languages
![Page 32: Rapid Data Exploration With Hadoop](https://reader033.vdocuments.site/reader033/viewer/2022052619/55501f8cb4c90555618b5303/html5/thumbnails/32.jpg)
Where did the bankers go?