bringing deep learning into production

Brief introduction

• CTO & co-founder of Agile Lab

• Data & Tech addicted

• Contributor of Spark Notebook

• Spark early adopter

• Certified Cassandra Architect

• DeepLearning enthusiast

Who is Agile Lab ?

GO BIG (data) or GO HOME

http://www.meetup.com/it-IT/Torino-Scala-Programming-Big-Data-Meetup/

http://www.meetup.com/it-IT/Torino-Scala-Programming-Big-Data-Meetup/

What we do

ApplicationsHigh scalability

Decision Support

Systemsdata engineering, data mining and data

«meaning»

Big Data Strategies

TrainingReactive, NoSQL, Big Data, Machine

learning

Why Deep Learning

Deep Learning is trending

What is Deep Learning

• Deep learning is just another name for artificial neural networks

• An algorithm is deep if the input is passed through several non-linearity before being output

• Deep learning is discovering the features that best represent the problem, rather than just a way to combine them

Deep Learning: Use cases

Do you want start with Deep

Learning ?

Let’s choose the right tools !!

Deep Learning Frameworks

• Deeplearning4J• TensorFlow• Caffe• Theano• Torch• Spark ML MultilayerPerceptrons• H2O• CNTK• MatLab• maxDNN

And many others

How to choose

Background

Target Environment

Vision

Background

Productivity !!

• Scala

• JavaBig Data Engineer

• Java

• PythonMath

Engineer

• R

• PythonStatistician

Target Environment

Trained model should be deployable !! Trained

Model

Dev Env

Prod Env

Target Environment

Prod Env Dev Env

TrainingData

CleaningETLScheduling

ML Pipeline

- Track model performance over time- Care about SLA- Continous tweaks

Enterprise Architecture

HADOOPOnline

DataStore

Enterprise Service BUS

Data

In

teg

rati

on

La

ye

r

Data Integration Layer

Data

In

teg

rati

on

La

ye

r

External

Sources

ANALYTICS

VALUE

ADDED

SERVICES

API

SERVICES

Internal

Business

Sources

Internal

System

Sources

DeepLearning

Easy Wins

Training pipeline should run on Spark or Hadoop

Trained Model should be represented in Java objects

Vision: keep in mind Scaling

High Level dynamic languages are incredibly productive for prototyping and data exploration

Scaling on larger data sets quickly runs into performance limitations

Keep in mind scaling requirements from beginning

Vision: simplify the pipeline

Copy & Sample data from Dev Env to Data Scientist Env

Prototype in Python or R

Train model

Predict on validation Data

Translate Model to match Prod Env Java, MapReduce, Spark

Deploy training pipeline and model

Easy Wins

Datascientists should work directly on distributed

environment

Datascientist and big data engineers should co-operate

on the same platform

SWOT Analysis

Tensor Flow

Strenghts: - Powered By Google- Nice UI

Weaknesses: - Powered By Google- No support for “inline” matrix operations Slow

Opportunities:- Awesome community

Threats: - No Scala or Java integration- No commercial support

TheanoStrenghts: - Grand Daddy of deep learning - RNN and CNN- Computational graph abstraction- Python

Weaknesses:- No support for Hadoop or Spark- No plug & play nets

Opportunities:- Great community


TorchStrenghts: - GPU support- Lots of pretrained models and packages- Easy to use

Weaknesses: - Lua language

Opportunities:- Backed by DeepMind and Facebook


Caffè

Strenghts: - C++ & Python- Good Performance- GPU Support

Weaknesses: - Focused on image processing

Opportunities:- Backed by Yahoo for Spark integration- Gpu Clustering

Threats: - No commercial support

DeepLearning4j

Strenghts: - GPU support- Java and Scala- Full DNN set- Support Hadoop, Spark & Akka

Weaknesses: - Not for dummies

Opportunities:- Commercial support - SkyMind

Threats: - Not so sexy for DataScientist because of Java/Scala

H2O

• Easy to use Web UI• Multi language API• Run directly on HDFS or S3• Model is Java PoJo• Big Data Ready• Really Fast• Compressed data• Regularization• Grid Search

• GPU is still on roadmap• CNN and RNN too

H2O - Flow

H20 – Sparkling Water

• Python, R and Scala API• Best Kagglers use H20• Tons of tools for profiling and tu

ning• Spark leverage• Best in class algorithms – battle

tested• Regolarization• Grid search

H20 – Sparkling Water

Workflow

POJO Java

Training Set

Embeddable in:• J2EE App• Spark Job• MR Job• DWH as UDF

training

Spark as middleware

Using Spark as middleware, you can leverage :

• Deeplearning4J• H2O• TensorFlow ( Arimo Extension)• Caffe ( Yahoo Extension )• ML MultilayerPerceptrons and future implementations

NO tech provider Lock-in

Our Stack for Enterprise

• Ready for Enterprise and Hadoop World• Deployable into Java Env• Notebook ( Flow )• H2O for out of the box algorithms• DeepLearning 4J for advanced DNN and

n-dimension array manipulation• Good usability for both DataScientists and

Big Data Engineers• Enterprise Support along the whole stack

Thanks!

We are hiring !

[email protected]

bringing deep learning into production

Software