h2o.ai's distributed deep learning by arno candel 04/03/14

31
Deep Learning with H2O H2O.ai Scalable In-Memory Machine Learning H20 Meetup, TypeSafe, San Francisco, 4/3/14 Arno Candel

Upload: srisatish-ambati

Post on 26-Jan-2015

113 views

Category:

Technology


1 download

DESCRIPTION

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. http://docs.0xdata.com/datascience/deeplearning.html

TRANSCRIPT

Page 1: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

Deep Learning with H2O

!

H2O.aiScalable In-Memory Machine Learning

!

H20 Meetup, TypeSafe, San Francisco, 4/3/14

Arno Candel

Page 2: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

Who am I?

PhD in Computational Physics, 2005from ETH Zurich Switzerland

!

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 3 months at 0xdata/H2O - Machine Learning

!

10+ years in HPC, C++, MPI, Supercomputing

Arno Candel

Page 3: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

OutlineIntro

Theory

Implementation

Results

MNIST handwritten digits classification

Live Demo

Prostate cancer classification and age regression

text classification

3

Page 4: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Distributed in-memory math platform ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK / API➔ Java, R, Scala, Python, JSON, Browser-based GUI

!Businesses can use ALL of their data (w or w/o Hadoop)

➔ Modeling without Sampling

Big Data + Better Algorithms ➔ Better Predictions

H2O Open Source in-memoryPrediction Engine for Big Data

4

Page 5: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

About H20 (aka 0xdata)Pure Java, Apache v2 Open Source Join the www.h2o.ai/community!

5

Page 6: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

H2O w or w/o Hadoop

H2OH2O H2O

HDFS HDFS HDFS

YARN Hadoop MR

R Java Scala JSON Python

Standalone Over YARN On MRv1

6

Page 7: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

H2O Architecture

in-memory K-V store

compressionMachine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

memory manager

e.g. Deep Learning

7

MapReduce

Page 8: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Wikipedia:

Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by

using architectures composed of multiple non-linear transformations.

!!!!!

Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces” !

Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) !

FBI FACE: $1 billion face recognition project

What is Deep Learning?

Example: Input data

(facial image)

Prediction (person’s ID)

8

Page 9: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Deep Learning is trending

20132012

Google trends

2011

9

Page 10: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Deep Learning Historyslides by Yan LeCun (now Facebook)

10

Deep Learning wins competitions AND

makes humans, businesses and machines (cyborgs!?) smarter

Page 11: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

What is NOT DeepLinear models are not deep (by definition)

!

Neural nets with 1 hidden layer are not deep (no feature hierarchy)

!

SVMs and Kernel methods are not deep (2 layers: kernel + linear)

!

Classification trees are not deep (operate on original input space)

11

Page 12: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) !+ distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) !+ multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) !+ smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization)

!

= Top-notch prediction engine!

Deep Learning in H2O12

Page 13: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

“fully connected” directed graph of neurons

age

income

employment

married

not married

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2#connections

information flow

input/output neuronhidden neuron

4 3 2#neurons 3

Example Neural Network13

Page 14: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

age

income

employmentyj = tanh(sumi(xi*uij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yj*vjk)+ck)

vjk

zk pl

pl = softmax(sumk(zk*wkl)+dl)

wkl

softmax(xk) = exp(xk) / sumk(exp(xk))

“neurons activate each other via weighted sums”

Prediction: Forward Propagation

married

not married

activation function: tanh alternative:

x -> max(0,x) “rectifier”

pl is a non-linear function of xi: can approximate ANY function

with enough layers!

bj, ck, dl: bias values(indep. of inputs)

14

Page 15: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

age

income

employment

xi

standardize input xi: mean = 0, stddev = 1 !

horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed}

-> {0,1,0} = part-time, {0,0,0} = self-employed

Poor man’s initialization: random weights !

Better: Uniform distribution in+/- sqrt(6/(#units + #units_previous_layer))

Data preparation & InitializationNeural Networks are sensitive to numerical noise, operate best in the linear regime (not saturated)

married

not married

15

Page 16: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Mean Square Error = (0.2^2 + 0.2^2)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness”

Stochastic Gradient Descent

SGD: improve weights and biases for EACH training row

married

not married

For each training row, we make a prediction and compare with the actual label (supervised training):

1

0

0.8

0.2

predicted actual

Objective: minimize prediction error (MSE or cross-entropy)

w <— w - rate * ∂E/∂w

1

16

Page 17: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Backward Propagation

!∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi

= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi

Backprop: Compute ∂E/∂wi via chain rule going backwards

wi

net = sumi(wi*xi) + b

xiE = error(y)

y = activation(net)

How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?

Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!

17

Page 18: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodes/JVMs: sync

threads: async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w* = (w1+w2+w3+w4)/4

map: each node trains a copy of the weights

and biases with (some* or all of) its

local data with asynchronous F/J

threads

initial weights and biases w

updated weights and biases w*

H2O atomic in-memoryK-V store

reduce: average weights and biases from

all nodes(model averaging)

Keep iterating over the data (“epochs”), score from time to time

Query & display the model via

JSON, WWW

2

2 431

1

1

1

43 2

1 2

1

i

*user can specify the number of total rows per MapReduce iteration

18

Page 19: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

“Secret” Sauce to Higher Accuracy

Momentum training:keep changing weights and biases (even if there’s no error)

“find other local minima, and go faster along valleys”

Adaptive learning rate - ADADELTA (Google):automatically set learning rate for each neuron based on its training history, combines annealing and momentum features

Learning rate annealing: rate r = r0 / (1+ß*N), N = training samples

“dig deeper into local minimum”

Grid Search and Checkpointing: Run a grid search over multiple hyper-parameters,

then continue training the best model

L1/L2/Dropout/MaxSumWeights regularization: L1: penalizes non-zero weights, L2: penalizes large weights

Dropout: randomly ignore certain inputs “train exp. many models at once” max_w2: Scale down all incoming weights if their squared sum > max_w2 “regularization avoids overtraining and improves generalization error”

19

Page 20: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

MNIST: digits classification

Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes

MNIST: Digitized handwritten digits database (Yann LeCun) Data: 28x28=784 pixels with values in 0…255 (gray-scale) One of the most popular multi-class classification problems

Without distortions or convolutions (which help), the best-ever published error rate on test set: 0.83% (Microsoft)

20

Page 21: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

most frequent mistakes: confuse 4 with 6 and 9, and 7 with 2

test set error: 1.5% after 40 epochs 1.02% after 400 epochs 0.95% after 4000 epochs

H2O Deep Learning on MNIST: 0.95% test set error (so far)

1 node

21

Page 22: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Prostate Cancer Dataset

22

Page 23: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Live Demo: Cancer Prediction

Interactive ROC curve with real-

time updates

23

Page 24: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Live Demo: Cancer Prediction

0% training error with only 322

model parameters in seconds!

24

Page 25: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

H2O Deep Learning with Scala25

Predict CAPSULE: Variable 1

Page 26: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

H2O Deep Learning with Scala26

Page 27: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Live Demo: Grid Search RegressionDoing a grid search to find good hyper-parameters

to predict AGE from other 7 features

Then continue training the best model 5 hidden 50 tanh layers, rho=0.99,

epsilon = 1e-10, normal distribution scale=1

MSE = 0.5 for test set ages in 44…79

Regression: 1 linear output

neuron

27

Page 28: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Live Demo: ebay Text Classification

Users enter a description when selling an item Task: Predict the type of item from the words used Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 H2O parses SVMLight sparse format: label 3:1 9:1 13:1 … !

“Small” sample dataset on jewelry and watches: Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes !H2O compressed columnar in-memory store: Only needs 60MB to store 5 billion entries (never inflated)

28

Page 29: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Live Demo: ebay Text Classification

Default parameters, no tuning (results for illustration only!)

Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes

29

Page 30: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

Tips for H2O Deep Learning!General: More layers for more complex functions (exp. more non-linearity) More neurons per layer to detect finer structure in data (“memorizing”) Add some regularization for less overfitting (smaller validation error) Do a grid search to get a feel for convergence, then continue training. Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5. Try Dropout (input: 20%, hidden: 50%) with test/validation set after finding good parameters for convergence on training set. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8, momentum_start = 0.5, momentum_stable = 0.99, momentum_ramp = 1/rate_annealing Try balance_classes = true for imbalanced classes. Use force_load_balance and replicate_training_data for small datasets.

30

Page 31: H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14

H2O Deep Learning, A. Candel

SummaryH2O is a distributed in-memory math platform that allows fast prototyping in Java, R, Scala and Python. !H2o enables the development of enterprise-quality blazing fast machine learning applications. !H2O Deep Learning is distributed, easy to use, and early results compete with the world’s best. !Deep Learning makes better predictions! !Try it yourself and join our next meetup! git clone https://github.com/0xdata/h2o

31