bringing deep learning into production
TRANSCRIPT
Brief introduction
• CTO & co-founder of Agile Lab
• Data & Tech addicted
• Contributor of Spark Notebook
• Spark early adopter
• Certified Cassandra Architect
• DeepLearning enthusiast
Who is Agile Lab ?
GO BIG (data) or GO HOME
http://www.meetup.com/it-IT/Torino-Scala-Programming-Big-Data-Meetup/
What we do
ApplicationsHigh scalability
Decision Support
Systemsdata engineering, data mining and data
«meaning»
Big Data Strategies
TrainingReactive, NoSQL, Big Data, Machine
learning
Why Deep Learning
Deep Learning is trending
What is Deep Learning
• Deep learning is just another name for artificial neural networks
• An algorithm is deep if the input is passed through several non-linearity before being output
• Deep learning is discovering the features that best represent the problem, rather than just a way to combine them
Deep Learning: Use cases
Do you want start with Deep
Learning ?
Let’s choose the right tools !!
Deep Learning Frameworks
• Deeplearning4J• TensorFlow• Caffe• Theano• Torch• Spark ML MultilayerPerceptrons• H2O• CNTK• MatLab• maxDNN
And many others
How to choose
Background
Target Environment
Vision
Background
Productivity !!
• Scala
• JavaBig Data Engineer
• Java
• PythonMath
Engineer
• R
• PythonStatistician
Target Environment
Trained model should be deployable !! Trained
Model
Dev Env
Prod Env
Target Environment
Prod Env Dev Env
TrainingData
CleaningETLScheduling
ML Pipeline
- Track model performance over time- Care about SLA- Continous tweaks
Enterprise Architecture
HADOOPOnline
DataStore
Enterprise Service BUS
Data
In
teg
rati
on
La
ye
r
Data Integration Layer
Data
In
teg
rati
on
La
ye
r
External
Sources
ANALYTICS
VALUE
ADDED
SERVICES
API
SERVICES
Internal
Business
Sources
Internal
System
Sources
DeepLearning
Easy Wins
Training pipeline should run on Spark or Hadoop
Trained Model should be represented in Java objects
Vision: keep in mind Scaling
High Level dynamic languages are incredibly productive for prototyping and data exploration
Scaling on larger data sets quickly runs into performance limitations
Keep in mind scaling requirements from beginning
Vision: simplify the pipeline
Copy & Sample data from Dev Env to Data Scientist Env
Prototype in Python or R
Train model
Predict on validation Data
Translate Model to match Prod Env Java, MapReduce, Spark
Deploy training pipeline and model
Easy Wins
Datascientists should work directly on distributed
environment
Datascientist and big data engineers should co-operate
on the same platform
SWOT Analysis
Tensor Flow
Strenghts: - Powered By Google- Nice UI
Weaknesses: - Powered By Google- No support for “inline” matrix operations Slow
Opportunities:- Awesome community
Threats: - No Scala or Java integration- No commercial support
TheanoStrenghts: - Grand Daddy of deep learning - RNN and CNN- Computational graph abstraction- Python
Weaknesses:- No support for Hadoop or Spark- No plug & play nets
Opportunities:- Great community
Threats: - No Scala or Java integration- No commercial support
TorchStrenghts: - GPU support- Lots of pretrained models and packages- Easy to use
Weaknesses: - Lua language
Opportunities:- Backed by DeepMind and Facebook
Threats: - No Scala or Java integration- No commercial support
Caffè
Strenghts: - C++ & Python- Good Performance- GPU Support
Weaknesses: - Focused on image processing
Opportunities:- Backed by Yahoo for Spark integration- Gpu Clustering
Threats: - No commercial support
DeepLearning4j
Strenghts: - GPU support- Java and Scala- Full DNN set- Support Hadoop, Spark & Akka
Weaknesses: - Not for dummies
Opportunities:- Commercial support - SkyMind
Threats: - Not so sexy for DataScientist because of Java/Scala
H2O
• Easy to use Web UI• Multi language API• Run directly on HDFS or S3• Model is Java PoJo• Big Data Ready• Really Fast• Compressed data• Regularization• Grid Search
• GPU is still on roadmap• CNN and RNN too
H2O - Flow
H20 – Sparkling Water
• Python, R and Scala API• Best Kagglers use H20• Tons of tools for profiling and tu
ning• Spark leverage• Best in class algorithms – battle
tested• Regolarization• Grid search
H20 – Sparkling Water
Workflow
POJO Java
Training Set
Embeddable in:• J2EE App• Spark Job• MR Job• DWH as UDF
training
Spark as middleware
Using Spark as middleware, you can leverage :
• Deeplearning4J• H2O• TensorFlow ( Arimo Extension)• Caffe ( Yahoo Extension )• ML MultilayerPerceptrons and future implementations
NO tech provider Lock-in
Our Stack for Enterprise
• Ready for Enterprise and Hadoop World• Deployable into Java Env• Notebook ( Flow )• H2O for out of the box algorithms• DeepLearning 4J for advanced DNN and
n-dimension array manipulation• Good usability for both DataScientists and
Big Data Engineers• Enterprise Support along the whole stack