large scale modeling overview
TRANSCRIPT
Large Scale Modeling Overview
Ferris Jumah
Predic9on Analy9cs Innova9on Summit 2013 November 15th, 2013
Large Scale Modeling
• What does large scale modeling mean to you?
“Building models that consume and process data sets so large that it is difficult to use current modeling tools and methods”
LinkedIn News
LinkedIn News
• Any9me a user lands on their homepage, a few items from our news product are recommended to them
• This is powered by a large scale recommenda9on engine
• For every user, at LinkedIn Scale
3M+ Company Pages
2 new Members per second
184 M+ Monthly Unique Visitors
2.5 B+ Monthly PageViews
The World’s Largest Professional Network 259,000,000 +
Use It All
• Use all of the data you have
• Why not store, process, and model all of it? • “The accuracy & nature of answers you get on large data sets can be completely different from what you see on small samples” • Not using it is losing compe99ve edge
Norvig, The Unreasonable Effec9veness of Data, 2013
Classic Jus9fica9on
More Data Beats Be^er Algorithms
Banko and Brill, 2001
More Data Beats Be^er Algorithms
• As data set size increases, your specific model and the tuning ma^ers a lot less
• Can worry less about sample size, biases, and generalizing
• Spend your 9me on • Exploratory Analysis • Feature Engineering
Exploratory Analysis
• With large amounts of data, insights and hypothesis present themselves
• Group By And Count • With large amounts of data, you can worry less about the distribu9on being reflec9ve of the popula9on
• Summary Sta9s9cs • Simple Correla9ons • Constantly Visualize
Exploratory Analysis Across LinkedIn Members
Exploratory Analysis Across LinkedIn Members
• Grouped by name le^er length and 9tle and counted
• No9ced that name length is heavily correlated with industry
• Able to start bootstrapping models • Quickly validate or invalidate a model
hypothesis • Generalized the results into development of
the 9tle standardiza9on models used today
Go Deep
• Massive datasets lend themselves well to very granular demographic slicing or bucke9ng • Get a very strong sense for customer segments • Reduce the size of your data without losing too much informa9on
• No9ce very specific trends that you can be confident are real
• Personalize deeply
Go Deep
Say LinkedIn wants to sell me something…
Keep Going
• When opera9ng with massive sets, combine several
• Tells you more than each would individually
Pigalls S9ll Apply
Simpson’s paradox
Large Datasets Allow More
Crea9vity with Features
Mapping LinkedIn Skills, +1 to Edge Weight
When Listed Concurrently
Feature Engineering
Can Your Infrastructure Hang?
First ques9on…..
Online or Offline?
If the problem domain can be scoped into an offline system, it usually should be Appropriate When • Data is best modeled in transient data streams rather than persistent rela9ons
• Data relevance or freshness fades fast • Too much data to store (infra, latency etc) and must be tossed
• News, Adver9sing, Gaming (A.I.), Stock Markets
Online or Offline?
Benefits • Instant Gra9fica9on – Immediate integra9on of data into modeling outcomes – Yahoo invented S4 to process user feedback in real-‐9me to op9mize search adver9sing ranking algorithms
• Mine more – In some systems it’s only possible to use all of your data in an online senng because there is simply too much
• Highly relevant now (ma^ers for news) • Personalized + Real 9me = Great User Experience
Online or Offline?
Challenges • YOLO (You Only Learn Once). • Specific exper9se • Evaluate/Interpret is Harder – YOLO makes it difficult to evaluate why a model is performing poorly, and inherently related, why a result is what it is
• Difficult to maintain – Data changing, adap9ng to new features, latency, evalua9on
• Infrastructure that can support it. Suppor9ng real 9me learning is a whole different ballgame
Big Data Tech is Young
Google Trends Hadoop & NOSQL
LinkedIn Open Source Data Tech
Developing Bleeding Edge Tech is Great
….What About Using It?
It can be a pain to use…..
As a user
High-‐level infrastructure needs
AB tes9ng plagorm Data/schema viewer
Workflow manager Access
Modeling algorithms implementa9on
Is the system set up to iterate and test new models as fast as
possible?
High-‐level LinkedIn Data Flow
Evalua9ng Models
Evalua9ng Models
CROWDSOURCE!!! Is this real?
Are we using feedback?
Summary
• Large-‐scale modeling • Isn’t easy but takes advantage of the large amounts of data we are storing
• Sees no9ceable increases in solu9on quality • More data beats be^er algorithms • Spend more 9me on exploratory analysis and feature engineering • Benefits from large scale data
• Build infrastructure that lets you iterate and AB test as fast as possible