cloudera user group - from the lab to the factory
DESCRIPTION
This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.TRANSCRIPT
![Page 1: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/1.jpg)
1
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera
![Page 2: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/2.jpg)
One Other Thing About Me
2
![Page 3: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/3.jpg)
Data Science: Another Definition
3
![Page 4: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/4.jpg)
Data Scientists Build Data Products.
4
![Page 5: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/5.jpg)
A Shift In Perspective
Analytics in the Lab
• Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database
scoring engine
Analytics in the Factory
• Metric-driven
• Automated
• Systematic
• Fluid data
• Focus on transparency and reliability
• Output is a production system that makes customer-facing decisions
5
![Page 6: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/6.jpg)
All* Products Become Data Products
6
![Page 7: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/7.jpg)
Identifying the Bottlenecks
7
![Page 8: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/8.jpg)
Oryx: Model Building and Serving
• Algorithms
• ALS Recommenders
• K-Means Parallel
• RDF
• Batch model building
via MapReduce*
• Server for real-time
scoring and updates
• PMML 4.1 Models
8
![Page 9: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/9.jpg)
Oryx Design
9
![Page 10: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/10.jpg)
Generational Thinking
10
![Page 11: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/11.jpg)
The Limits of Our Models
11
![Page 12: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/12.jpg)
Space Exploration
12
![Page 13: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/13.jpg)
Data Science Needs DevOps
13
![Page 14: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/14.jpg)
Introducing Gertrude
• Multivariate Testing
• Define and explore a
space of parameters
• Overlapping
Experiments
• Tang et al. (2010)
• Runs multiple
independent
experiments on every
request
14
![Page 15: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/15.jpg)
Simple Conditional Logic
• Declare experiment
flags in compiled code
• Settings that can vary per request
• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
15
![Page 16: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/16.jpg)
Separate Data Push from Code Push
• Validate config files and
push updates to servers
• Zookeeper via Curator
• File-based
• Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
16
![Page 17: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/17.jpg)
The Experiments Dashboard
17
![Page 18: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/18.jpg)
A Few Links I Love
• http://research.google.com/pubs/pub36500.html
• The original paper on the overlapping experiments
infrastrucure at Google
• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on
their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-
than-lossless-in-online-bootstrapping/
• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
18
![Page 19: Cloudera User Group - From the Lab to the Factory](https://reader033.vdocuments.site/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/19.jpg)
Josh Wills, Director of Data Science, Cloudera @josh_wills
Thank you!