building data products

32
1 Building Data Products Josh Wills, Senior Director of Data Science

Upload: cloudera-inc

Post on 18-Jul-2015

3.313 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Building Data Products

1

Building  Data  Products  Josh  Wills,  Senior  Director  of  Data  Science  

Page 2: Building Data Products

About  Me  

2  

Page 3: Building Data Products

3

What  Do  Data  Scien<sts  Do?  

Page 4: Building Data Products

What  I  Think  I  Do  

4

Page 5: Building Data Products

What  Other  People  Think  I  Do  

5

Page 6: Building Data Products

What  I  Actually  Do  

6  

Page 7: Building Data Products

Data  Science  and  Data  Products  

7

Page 8: Building Data Products

8

Thinking  About  Data  Products  

Page 9: Building Data Products

The  Best  Way  To  Find  Insights  

9

Page 10: Building Data Products

Build  A  Team  

10

Page 11: Building Data Products

Measure  Everything  

11

Page 12: Building Data Products

Solve  the  Right  Problem  

12  

Page 13: Building Data Products

13

Building  Data  Products  with  Hadoop  

Page 14: Building Data Products

Hadoop  as  a  PlaMorm  for  Data  Products  

14

Page 15: Building Data Products

ETL,  Data  Science,  and  Machine  Learning  

15  

Page 16: Building Data Products

Changing  the  Unit  of  Analysis  

16

Page 17: Building Data Products

Machine  Learning  and  You  

17

Page 18: Building Data Products

The  Five  Ques<ons  

1.  When  should  I  use  it?    2.  What  does  the  input  look  like?  

3.  What  does  the  output  look  like?  

4.  How  many  parameters  do  I  have  to  tune?  

5.  Why  will  it  fail?  

18

Page 19: Building Data Products

1.  Collabora<ve  Filtering  

19

Page 20: Building Data Products

Collabora<ve  Filtering  (cont.)  

1.  To  see  things  that  are  hidden.  

2.  <user_id>,<item_id>,<weight>  

3.  <item1>,<item2>,<score>  

4.  The  distance  metric  and  the  weight  calcula<ons.  

5.  If  the  input  data  is  too  sparse.  

20

Page 21: Building Data Products

Collabora<ve  Filtering  on  Hadoop  

21

Page 22: Building Data Products

2.  K-­‐Means  Clustering  

22

Page 23: Building Data Products

K-­‐Means  Clustering  (cont.)  

1.  To  find  anomalous  events.  

2.  Vectors  of  normally  distributed  values.  

3.  Cluster  centroids.  

4.  The  choice(s)  of  K.  

5.  The  points  aren’t  even  remotely  normally  distributed.  

23

Page 24: Building Data Products

K-­‐Means  on  Hadoop  

24

Page 25: Building Data Products

3.  Random  Forests  

25

Page 26: Building Data Products

Random  Forests  (cont.)  

1.  To  classify  and  predict.  

2.  A  dependent  variable  and  many  independent  variables.  

3.  Lots  and  lots  of  liale  trees.  

4.  The  number  of  variables  to  consider  at  each  level.  

5.  Too  many  independent  variables.  

26

Page 27: Building Data Products

Random  Forests  on  Hadoop  

•  R’s  randomForest  and  rhadoop  tools  

•  Map:  par<<on  the  input  data  among  the  reducers  

•  Reduce:  fit  the  random  forests  to  each  par<<on  

•  Re-­‐combine  the  resul<ng  trees  in  the  client  

27  

Page 28: Building Data Products

The  Art  of  Model  Design  

28

Page 29: Building Data Products

Cau<on:  Mind  the  Gap  

29  

Page 30: Building Data Products

The  Joy  of  Experiments  

30

Page 31: Building Data Products

31

Introduc<on  to  Data  Science:  Building  Recommender  Systems  hap://university.cloudera.com/  

Page 32: Building Data Products

 Josh  Wills,  Director  of  Data  Science,  Cloudera            @josh_wills  

 

Thank  you!