dl4j at workday meetup

DL4J: Deep Learning for the JVM and Enterprise

David C. Kale, Ruben Fiszel Skymind

Workday Data Science Meetup August 10, 2016

Who are we?• Deeplearning4j: open source deep learning on the JVM

• Skymind: deep learning for enterprise • fighting good fight vs. python deep learning mafia • founded by Adam Gibson • CEO Chris Nicholson

• Dave Kale: developer, Skymind: Scala API • also PhD student, USC • research: deep learning for healthcare

• Ruben Fiszel: intern, Skymind: reinforcement learning (RL4J) • also MS student, EPFL

Outline• Overview of deep learning

• Tour of DL4J

• Scaling up DL4J

• DL4J versus…

• Preview of DL4J Scala API

• Preview of RL4J

What is Deep Learning?• Compositions of (deterministic) differentiable functions, some parameterized

• compute transformations of data• eventually emit output• can have multiple paths

• architecture is end-to-end differentiable w.r.t. parameters (w’s) • training:

• define targets, loss function• apply gradient methods: use chain rule to get component-wise updates

x1 f1(x1;w1) z1 f2(z1) z2

f3(z2;w3) y Loss(y,t)

t

f4(x2;w4)x2 z4f3([z2,z4];

w3)

Example: multilayer perceptron

• Classic “neural net” architecture — a powerful nonlinear function approximator • Zero or more fully connected (“dense”) layers of “neurons”

• ex. neuron: h = f(Wx + b) for some nonlinearity f (e.g., ReLu(a) = max(a, 0))• Predict y from fixed-size, not-too-large x with no structure

• Classify digits in MNIST (digits are generally centered and upright) • Model risk of mortality in patients with pneumonia

• Special case: logistic regression (zero hidden layers)

http://deeplearning4j.org/mnist-for-beginners

http://deeplearning4j.org/mnist-for-beginners

Variation of MLP: autoencoder

• “Unsupervised” training: no separate target y • Learns to accurately reconstruct x from succinct latent z• Probabilistic generative variants (e.g., deep belief net) can generate novel x’s by

first sampling z from prior probability distribution p(z)

http://deeplearning4j.org/deepautoencoder

http://deeplearning4j.org/deepautoencoder

Example: convolutional (neural) networks

• Convolution layers “filter” x to extract features ➡ Filters exploit (spatially) local regularities while preserving spatial relationships

• Subsampling (pooling) layers combine local information, reduce resolution ➡ pooling gives translational invariance (i.e., classifier robust to shifts in x)

• Predict y from x with local structure (e.g., images, short time series) • 2D: classify images of, e.g., cats, cat may appear in different locations • 1D: diagnose patients from lab time series, symptoms at different times

• Special case: fully convolutional network with no MLP at “top” (filter for variable-sized x’s)

http://deeplearning4j.org/convolutionalnets

63

CONVOLUTIONAL NET

Share the same parameters across different locations: Convolutions with learned kernels

Ranzato

(CVPR 2012 Tutorial, pt. 3 by M.A. Ranzato)

http://deeplearning.net/tutorial/lenet.html

http://deeplearning4j.org/convolutionalnets

http://deeplearning.net/tutorial/lenet.html

Example: recurrent neural networks

• Recurrent connections between hidden units: ht+1 = f(Wx + Vht) • Gives neural net a form of memory for capturing longterm dependencies • More elaborate RNNs (LSTMs) learn when/what to remember or forget • Predict y from sequential x (natural language, video, time series) • Among most flexible and powerful learning algorithms available

• Also can be most challenging to train

http://deeplearning4j.org/recurrentnetwork

http://deeplearning4j.org/recurrentnetwork

RNNs: flexible input-to-output modeling

• Diagnose patients from temporal data (Lipton/Kale, ICLR 2016) • Predict next word or character (language modeling) • Generate beer review from category, score (Strata NY talk) • Translate from English to French (machine translation)

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://arxiv.org/abs/1511.03677

http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52036

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Let’s get crazy with architectures

• How about automatically captioning videos? • Recall: we are just composing functions that transform inputs • Compose ConvNets with RNNs • You can do this with DL4J today!

(Venugopalan, et al., NAACL 2015)

Machine learning in the deep learning era• Architecture design + hyperparameter tuning replace iterative feature

engineering

• Easier to transfer “knowledge” across problems • direct: can adapt generic image classifier into, e.g., tumor classifier • indirect: analogies across problems point to architectures

• Often better able to leverage Big Data: • start with high capacity neural net • add regularization and tuning

• None of the following is true: • your Big Data problems will all be solved magically • the machines are going to take over • the Singularity is right around the corner

DL4J architecture for image captioning

Shawn marries Maya’s Mom. Mr. Feeny officiates.

LSTM

MLP

Conv

Shawn marries Mr. Feeny. Some lady is

there.Loss

DL4J: MultilayerNetwork}

DL4J: ConvolutionLayer

DL4J: Dense

DL4J: GravesLSTM

DL4J: RnnOutputLayer

DataVec: RecordReader

ND4J: LossFunction

DL4J: OptimizationAlgorithm

Backpropagation

https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-core/src/main/java/org/deeplearning4j/nn/layers/recurrent/RnnOutputLayer.java

https://github.com/deeplearning4j/DataVec/blob/master/datavec-api/src/main/java/org/datavec/api/records/reader/RecordReader.java

https://github.com/deeplearning4j/nd4j/blob/e9ada6822a39ca6e48156b1300bea20506dc47b8/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/lossfunctions/LossFunctions.java

https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-core/src/main/java/org/deeplearning4j/nn/api/OptimizationAlgorithm.java

DL4J ecosystem for scalable DL

Arbiter • Platform agnostic model evaluation • Includes randomized grid search

Spark API • Spark API wraps core DL4J classes • Designing and configuring model

architecture identical • Currently provides data parallelism

• Scales to massive datasets • Accelerated, distributed training

• DataVec compatible with Spark RDDs

Core • Efficient numpy-like numerical

framework (ND4J) • ND4J backends for CUDA, ATLAS,

MKL, OpenBLAS • Multi-GPU

https://github.com/deeplearning4j/Arbiter

http://deeplearning4j.org/spark

http://nd4j.org/

Scalable DL with Spark API

• Use Downpour SGD model from (Dean, et al. NIPS 2012) • Data parallelism

• training data sharded across workers • workers each have complete model, train in parallel on disjoint minibatches

• Parameter averaging • Master stores “canonical” model parameters • Workers send parameter updates (gradients) to master • Workers periodically ask for updated parameters from master

Example: LeNet image classifier

LeNet on github

https://github.com/deeplearning4j/ImageNet-Example/blob/master/src/main/java/imagenet/Models/LeNet.java

Example: train LeNet on multi-GPU server

multi-GPU example on github

https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/MultiGpuLenetMnistExample.java

Example: distributed training of LeNet on Spark

Spark LeNet example on github

……

https://github.com/deeplearning4j/ImageNet-Example/blob/master/src/main/java/imagenet/ImageNetSparkExample.java

DL4J versus…

• For comparison of frameworks, see • DL4J comparison page • Karpathy lecture • A zillion billion other blog posts and articles

http://deeplearning4j.org/compare-dl4j-torch7-pylearn

http://cs231n.stanford.edu/slides/winter1516_lecture12.pdf

http://fastml.com/torch-vs-theano/

http://venturebeat.com/2015/11/14/deep-learning-frameworks/

https://arimo.com/machine-learning/deep-learning/2016/arimo-distributed-tensorflow-on-spark/

DL4J versus…my two cents• Using Java Big Data ecosystem (Hadoop, Spark, etc.): DL4J

• Want robust data preprocessing tools/pipelines: DL4J • esp. natural language, images, video

• Custom layers, loss functions, etc.: Theano/TF + keras/lasagne • grad student trying to publish NIPS papers • trying to win Kaggle competition with OpenAI model from NIPS (keras) • prototype an idea before implementing gradients by hand in DL4J

• Use published CV models from Caffe zoo: Caffe

• Python shop and don’t mind being hostage to Google Cloud: TF

• Good news: this is a false choice, like most things (see Scala API)

• Scala API for DL4J that emulates keras user experience

• Goal: reduce friction for going between keras and DL4J • make it easy to mimic keras architectures • load models keras-trained using common model format

(coming soon)

DL4J Scala API Preview

http://keras.io

DL4J Scala API Keras

DL4J Scala API Preview

Thank you!• DL4J: http://deeplearning4j.org/

• Skymind: https://skymind.io/

• Dave: • email: [email protected] • Twitter: @davekale • website: http://www-scf.usc.edu/~dkale • MLHC Conference: http://mucmd.org

• Ruben • email: [email protected] • website: http://rubenfiszel.github.io/

Gibson&Patterson.DeepLearning:APractitioner’sApproach.O’Reilly,Q22016.

http://deeplearning4j.org/

https://skymind.io/

http://www-scf.usc.edu/~dkale

http://mucmd.org

mailto:[email protected]

http://rubenfiszel.github.io/

dl4j at workday meetup

Technology