large scale modeling overview

41
Large Scale Modeling Overview Ferris Jumah Predic9on Analy9cs Innova9on Summit 2013 November 15 th , 2013

Upload: ferris-jumah

Post on 05-Aug-2015

183 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Large Scale Modeling Overview

Large  Scale  Modeling    Overview  

 Ferris  Jumah  

Predic9on  Analy9cs  Innova9on  Summit  2013  November  15th,  2013  

Page 2: Large Scale Modeling Overview

Large  Scale  Modeling    

• What  does  large  scale  modeling  mean  to  you?    

“Building  models  that  consume  and  process  data  sets  so  large  that  it  is  difficult  to  use  current  modeling  tools  and  methods”        

Page 3: Large Scale Modeling Overview

LinkedIn  News  

Page 4: Large Scale Modeling Overview

LinkedIn  News  

•  Any9me  a  user  lands  on  their  homepage,  a  few  items  from  our  news  product  are  recommended  to  them  

•  This  is  powered  by  a  large  scale  recommenda9on  engine  

•  For  every  user,  at  LinkedIn  Scale          

Page 5: Large Scale Modeling Overview

3M+        Company  Pages  

2  new  Members  per  second  

184  M+  Monthly  Unique  Visitors  

2.5  B+  Monthly  PageViews  

The  World’s  Largest  Professional  Network  259,000,000  +  

Page 6: Large Scale Modeling Overview

Use  It  All  

•  Use  all  of  the  data  you  have  

• Why  not  store,  process,  and  model  all  of  it?    •  “The  accuracy  &  nature  of  answers  you  get  on  large  data  sets  can  be  completely  different  from  what  you  see  on  small  samples”  •  Not  using  it  is  losing  compe99ve  edge  

     

Page 7: Large Scale Modeling Overview

Norvig,  The  Unreasonable  Effec9veness  of  Data,  2013  

Classic  Jus9fica9on  

Page 8: Large Scale Modeling Overview

More  Data  Beats  Be^er  Algorithms  

Banko  and  Brill,  2001  

Page 9: Large Scale Modeling Overview

More  Data  Beats  Be^er  Algorithms  

•  As  data  set  size  increases,  your  specific  model  and  the  tuning  ma^ers  a  lot  less    

 •  Can  worry  less  about  sample  size,  biases,  and  generalizing  

•  Spend  your  9me  on    •  Exploratory  Analysis  •  Feature  Engineering    

     

Page 10: Large Scale Modeling Overview

Exploratory  Analysis  

•  With  large  amounts  of  data,  insights  and  hypothesis  present  themselves  

 •  Group  By  And  Count  •  With  large  amounts  of  data,  you  can  worry  less  about  the  distribu9on  being  reflec9ve  of  the  popula9on  

•  Summary  Sta9s9cs    •  Simple  Correla9ons  •  Constantly  Visualize    

     

Page 11: Large Scale Modeling Overview

Exploratory  Analysis  Across  LinkedIn  Members  

Page 12: Large Scale Modeling Overview

Exploratory  Analysis  Across  LinkedIn  Members  

•  Grouped  by  name  le^er  length  and  9tle  and  counted  

•  No9ced  that  name  length  is  heavily  correlated  with  industry  

•  Able  to  start  bootstrapping  models  •  Quickly  validate  or  invalidate  a  model  

hypothesis  •  Generalized  the  results  into  development  of  

the  9tle  standardiza9on  models  used  today    

     

Page 13: Large Scale Modeling Overview

Go  Deep  

•  Massive  datasets  lend  themselves  well  to  very  granular  demographic  slicing  or  bucke9ng    •  Get  a  very  strong  sense  for  customer  segments  •  Reduce  the  size  of  your  data  without  losing  too  much  informa9on  

•  No9ce  very  specific  trends  that  you  can  be  confident  are  real  

•  Personalize  deeply  

     

Page 14: Large Scale Modeling Overview

Go  Deep  

   Say  LinkedIn  wants  to  sell  me  something…        

Page 15: Large Scale Modeling Overview
Page 16: Large Scale Modeling Overview
Page 17: Large Scale Modeling Overview

Keep  Going  

• When  opera9ng  with  massive  sets,  combine  several  

•  Tells  you  more  than  each  would  individually  

Page 18: Large Scale Modeling Overview
Page 19: Large Scale Modeling Overview
Page 20: Large Scale Modeling Overview
Page 21: Large Scale Modeling Overview

Pigalls  S9ll  Apply  

Page 22: Large Scale Modeling Overview

Simpson’s  paradox  

Page 23: Large Scale Modeling Overview

Large  Datasets    Allow  More    

Crea9vity  with  Features  

Page 24: Large Scale Modeling Overview

Mapping  LinkedIn  Skills,    +1  to  Edge  Weight    

When  Listed  Concurrently  

Page 25: Large Scale Modeling Overview

Feature  Engineering  

Page 26: Large Scale Modeling Overview

Can  Your  Infrastructure  Hang?  

First  ques9on…..  

Page 27: Large Scale Modeling Overview

Online  or  Offline?  

If  the  problem  domain  can  be  scoped  into  an  offline  system,  it  usually  should  be    Appropriate  When  •  Data  is  best  modeled  in  transient  data  streams  rather  than  persistent  rela9ons  

•  Data  relevance  or  freshness  fades  fast  •  Too  much  data  to  store  (infra,  latency  etc)  and  must  be  tossed  

•  News,  Adver9sing,  Gaming  (A.I.),  Stock  Markets  

Page 28: Large Scale Modeling Overview

Online  or  Offline?  

Benefits  •  Instant  Gra9fica9on  –  Immediate  integra9on  of  data  into  modeling  outcomes  –  Yahoo  invented  S4  to  process  user  feedback  in  real-­‐9me  to  op9mize  search  adver9sing  ranking  algorithms  

•  Mine  more  –  In  some  systems  it’s  only  possible  to  use  all  of  your  data  in  an  online  senng  because  there  is  simply  too  much  

•  Highly  relevant  now  (ma^ers  for  news)  •  Personalized  +  Real  9me  =  Great  User  Experience  

 

Page 29: Large Scale Modeling Overview

Online  or  Offline?  

Challenges  •  YOLO  (You  Only  Learn  Once).    •  Specific  exper9se  •  Evaluate/Interpret  is  Harder  –  YOLO  makes  it  difficult  to  evaluate  why  a  model  is  performing  poorly,  and  inherently  related,  why  a  result  is  what  it  is  

•  Difficult  to  maintain  – Data  changing,  adap9ng  to  new  features,  latency,  evalua9on  

•  Infrastructure  that  can  support  it.  Suppor9ng  real  9me  learning  is  a  whole  different  ballgame  

Page 30: Large Scale Modeling Overview

Big  Data    Tech  is  Young  

Page 31: Large Scale Modeling Overview

Google  Trends  Hadoop  &  NOSQL  

Page 32: Large Scale Modeling Overview

LinkedIn  Open  Source  Data  Tech  

Page 33: Large Scale Modeling Overview

Developing  Bleeding  Edge    Tech  is  Great  

….What  About  Using  It?  

Page 34: Large Scale Modeling Overview

It  can  be  a  pain  to  use…..  

As  a  user  

Page 35: Large Scale Modeling Overview

High-­‐level  infrastructure  needs  

AB  tes9ng  plagorm   Data/schema  viewer  

Workflow  manager   Access  

Modeling  algorithms  implementa9on  

Page 36: Large Scale Modeling Overview

Is  the  system  set  up  to  iterate  and  test  new  models  as  fast  as  

possible?    

Page 37: Large Scale Modeling Overview

High-­‐level  LinkedIn  Data  Flow  

Page 38: Large Scale Modeling Overview

Evalua9ng  Models  

Page 39: Large Scale Modeling Overview

Evalua9ng  Models  

CROWDSOURCE!!!   Is  this  real?  

Are  we    using    feedback?  

Page 40: Large Scale Modeling Overview

Summary  

•  Large-­‐scale  modeling    •  Isn’t  easy  but  takes  advantage  of  the  large  amounts  of  data  we  are  storing  

•  Sees  no9ceable  increases  in  solu9on  quality  •  More  data  beats  be^er  algorithms  •  Spend  more  9me  on  exploratory  analysis  and  feature  engineering  •  Benefits  from  large  scale  data  

•  Build  infrastructure  that  lets  you  iterate  and  AB  test  as  fast  as  possible  

 

Page 41: Large Scale Modeling Overview

[email protected]