H2O Distributed Deep Learning by Arno Candel 071614

Download H2O Distributed Deep Learning by Arno Candel 071614

Post on 27-Aug-2014




8 download

Embed Size (px)


Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice in traditional business analytics. This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of enterprise-scale problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization and optimization for class imbalance. World record performance on the classic MNIST dataset, best-in-class accuracy for eBay text classification and others showcase the power of this game changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core. About the Speaker: Arno Candel Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world's largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes. He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich.


<ul><li> Deep Learning with H2O ! 0xdata, H2O.ai Scalable In-Memory Machine Learning ! Hadoop User Group, Chicago, 7/16/14 Arno Candel </li> <li> Who am I? PhD in Computational Physics, 2005 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 7 months at 0xdata/H2O - Machine Learning ! 15 years in HPC, C++, MPI, Supercomputing @ArnoCandel </li> <li> H2O Deep Learning, @ArnoCandel Outline Intro &amp; Live Demo (5 mins) Methods &amp; Implementation (20 mins) Results &amp; Live Demos (25 mins) MNIST handwritten digits text classification Weather prediction Q &amp; A (10 mins) 3 </li> <li> H2O Deep Learning, @ArnoCandel Distributed in-memory math platform GLM, GBM, RF, K-Means, PCA, Deep Learning Easy to use SDK / API Java, R, Scala, Python, JSON, Browser-based GUI ! Businesses can use ALL of their data (w or w/o Hadoop) Modeling without Sampling Big Data + Better Algorithms Better Predictions H2O Open Source in-memory Prediction Engine for Big Data 4 </li> <li> H2O Deep Learning, @ArnoCandel About H20 (aka 0xdata) Pure Java, Apache v2 Open Source Join the www.h2o.ai/community! 5 +1 Cyprien Noel for prior work </li> <li> H2O Deep Learning, @ArnoCandel Customer Demands for Practical Machine Learning 6 Requirements Value In-Memory Fast (Interactive) Distributed Big Data (No Sampling) Open Source Ownership of Methods API / SDK Extensibility H2O was developed by 0xdata to meet these requirements </li> <li> H2O Deep Learning, @ArnoCandel H2O Integration H2O HDFS HDFS HDFS YARN Hadoop MR R ScalaJSON Python Standalone Over YARN On MRv1 7 H2O H2O Java </li> <li> H2O Deep Learning, @ArnoCandel H2O Architecture Distributed In-Memory K-V store Col. compression Machine Learning Algorithms R Engine Nano fast Scoring Engine Prediction Engine Memory manager e.g. Deep Learning 8 MapReduce </li> <li> H2O Deep Learning, @ArnoCandel H2O - The Killer App on Spark 9 http://databricks.com/blog/2014/06/30/ sparkling-water-h20-spark.html </li> <li> H2O Deep Learning, @ArnoCandel 10 John Chambers (creator of the S language, R-core member) names H2O R API in top three promising R projects H2O R CRAN package </li> <li> H2O Deep Learning, @ArnoCandel H2O + R = Happy Data Scientist 11 Machine Learning on Big Data with R: Data resides on the H2O cluster! </li> <li> H2O Deep Learning, @ArnoCandel H2O Deep Learning in Action Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes 12 MNIST = Digitized handwritten digits database (Yann LeCun) Live Demo Build a H2O Deep Learning model on MNIST train/test data Data: 28x28=784 pixels with (gray-scale) values in 0255 Yann LeCun: Yet another advice: don't get fooled by people who claim to have a solution to Artificial General Intelligence. Ask them what error rate they get on MNIST or ImageNet. </li> <li> H2O Deep Learning, @ArnoCandel Wikipedia: Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. What is Deep Learning? Example: Input data (image) Prediction (who is it?) 13 Facebook's DeepFace (Yann LeCun) recognises faces as well as humans </li> <li> H2O Deep Learning, @ArnoCandel Deep Learning is Trending 20132012 Google trends 2011 14 Businesses are using Deep Learning techniques! Google Brain (Andrew Ng, Jeff Dean &amp; Geoffrey Hinton) ! FBI FACE: $1 billion face recognition project ! Chinese Search Giant Baidu Hires Man Behind the Google Brain (Andrew Ng) </li> <li> H2O Deep Learning, @ArnoCandel Deep Learning History slides by Yan LeCun (now Facebook) 15 Deep Learning wins competitions AND makes humans, businesses and machines (cyborgs!?) smarter </li> <li> H2O Deep Learning, @ArnoCandel What is NOT Deep Linear models are not deep (by definition) ! Neural nets with 1 hidden layer are not deep (no feature hierarchy) ! SVMs and Kernel methods are not deep (2 layers: kernel + linear) ! Classification trees are not deep (operate on original input space) 16 </li> <li> H2O Deep Learning, @ArnoCandel 1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) ! + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) ! + multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) ! + smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization) ! = Top-notch prediction engine! Deep Learning in H2O 17 </li> <li> H2O Deep Learning, @ArnoCandel fully connected directed graph of neurons age income employment married single Input layer Hidden layer 1 Hidden layer 2 Output layer 3x4 4x3 3x2#connections information flow input/output neuron hidden neuron 4 3 2#neurons 3 Example Neural Network 18 </li> <li> H2O Deep Learning, @ArnoCandel age income employment yj = tanh(sumi(xi*uij)+bj) uij xi yj per-class probabilities sum(pl) = 1 zk = tanh(sumj(yj*vjk)+ck) vjk zk pl pl = softmax(sumk(zk*wkl)+dl) wkl softmax(xk) = exp(xk) / sumk(exp(xk)) neurons activate each other via weighted sums Prediction: Forward Propagation activation function: tanh alternative: x -&gt; max(0,x) rectifier pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values (indep. of inputs) 19 married single </li> <li> H2O Deep Learning, @ArnoCandel age income employment xi Automatic standardization of data xi: mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} -&gt; {0,1,0} = part-time, {0,0,0} = self-employed Automatic initialization of weights ! Poor mans initialization: random weights wkl ! Default (better): Uniform distribution in +/- sqrt(6/(#units + #units_previous_layer)) Data preparation &amp; Initialization Neural Networks are sensitive to numerical noise, operate best in the linear regime (not saturated) 20 married single wkl </li> <li> H2O Deep Learning, @ArnoCandel Mean Square Error = (0.22 + 0.22)/2 penalize differences per-class ! Cross-entropy = -log(0.8) strongly penalize non-1-ness Training: Update Weights &amp; Biases Stochastic Gradient Descent: Update weights and biases via gradient of the error (via back-propagation): For each training row, we make a prediction and compare with the actual label (supervised learning): married10.8 predicted actual Objective: minimize prediction error (MSE or cross-entropy) w &lt; w - rate * E/w 1 21 single00.2 E w rate </li> <li> H2O Deep Learning, @ArnoCandel Backward Propagation ! E/wi = E/y * y/net * net/wi = (error(y))/y * (activation(net))/net * xi Backprop: Compute E/wi via chain rule going backwards wi net = sumi(wi*xi) + b xi E = error(y) y = activation(net) How to compute E/wi for wi &lt; wi - rate * E/wi ? Naive: For every i, evaluate E twice at (w1,,wi,,wN) Slow! 22 </li> <li> H2O Deep Learning, @ArnoCandel H2O Deep Learning Architecture K-V K-V HTTPD HTTPD nodes/JVMs: sync threads: async communication w w w w w w w w1 w3 w2 w4 w2+w4 w1+w3 w* = (w1+w2+w3+w4)/4 map: each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial model: weights and biases w updated model: w* H2O atomic in-memory K-V store reduce: model averaging: average weights and biases from all nodes, speedup is at least #nodes/log(#rows) arxiv:1209.4129v3 Keep iterating over the data (epochs), score from time to time Query &amp; display the model via JSON, WWW 2 2 431 1 1 1 4 3 2 1 2 1 i *user can specify the number of total rows per MapReduce iteration 23 </li> <li> H2O Deep Learning, @ArnoCandel Adaptive learning rate - ADADELTA (Google) Automatically set learning rate for each neuron based on its training history Grid Search and Checkpointing Run a grid search to scan many hyper- parameters, then continue training the most promising model(s) Regularization L1: penalizes non-zero weights L2: penalizes large weights Dropout: randomly ignore certain inputs 24 Secret Sauce to Higher Accuracy </li> <li> H2O Deep Learning, @ArnoCandel Detail: Adaptive Learning Rate ! Compute moving average of wi 2 at time t for window length rho: ! E[wi 2]t = rho * E[wi 2]t-1 + (1-rho) * wi 2 ! Compute RMS of wi at time t with smoothing epsilon: ! RMS[wi]t = sqrt( E[wi 2]t + epsilon ) Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents freezing (unlike ADAGRAD: no window) Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time RMS[wi]t-1 RMS[E/wi]t rate(wi, t) = Do the same for E/wi, then obtain per-weight learning rate: cf. ADADELTA paper 25 </li> <li> H2O Deep Learning, @ArnoCandel Detail: Dropout Regularization 26 Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations. ! age income employment married single X X X Testing: Use all activations, but reduce them by a factor p (to simulate the missing activations during training). cf. Geoff Hinton's paper </li> <li> H2O Deep Learning, @ArnoCandel MNIST: digits classification Standing world record: Without distortions or convolutions, the best-ever published error rate on test set: 0.83% (Microsoft) 27 Time to check in on the demo! Lets see how H2O did in the past 20 minutes! </li> <li> H2O Deep Learning, @ArnoCandel Frequent errors: confuse 2/7 and 4/9 H2O Deep Learning on MNIST: 0.87% test set error (so far) 28 test set error: 1.5% after 10 mins 1.0% after 1.5 hours 0.87% after 4 hours World-class results! No pre-training No distortions No convolutions No unsupervised training Running on 4 nodes with 16 cores each </li> <li> H2O Deep Learning, A. Candel Weather Dataset 29 Predict RainTomorrow from Temperature, Humidity, Wind, Pressure, etc. </li> <li> H2O Deep Learning, A. Candel Live Demo: Weather Prediction Interactive ROC curve with real-time updates 30 3 hidden Rectifier layers, Dropout, L1-penalty 12.7% 5-fold cross-validation error is at least as good as GBM/RF/GLM models 5-fold cross validation </li> <li> H2O Deep Learning, @ArnoCandel Live Demo: Grid Search How did I find those parameters? Grid Search! (works for multiple hyper parameters at once) 31 Then continue training the best model </li> <li> H2O Deep Learning, @ArnoCandel Use Case: Text Classification Goal: Predict the item from sellers text description 32 Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes Vintage 18KT gold Rolex 2 Tone in great condition Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,,0 vintagegold condition Lets see how H2O does on the ebay dataset! </li> <li> H2O Deep Learning, @ArnoCandel Out-Of-The-Box: 11.6% test set error after 10 epochs! Predicts the correct class (out of 143) 88.4% of the time! 33 Note 2: No tuning was done (results are for illustration only) Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes Note 1: H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB) Use Case: Text Classification </li> <li> H2O Deep Learning, @ArnoCandel Parallel Scalability (for 64 epochs on MNIST, with 0.87% parameters) 34 Speedup 0.00 10.00 20.00 30.00 40.00 1 2 4 8 16 32 63 H2O Nodes (4 cores per node, 1 epoch per node per MapReduce) 2.7 mins Training Time 0 25 50 75 100 1 2 4 8 16 32 63 H2O Nodes in minutes </li> <li> H2O Deep Learning, @ArnoCandel Tips for H2O Deep Learning ! General: More layers for more complex functions (exp. more non-linearity) More neurons per layer to detect finer structure in data (memorizing) Add some regularization for less overfitting (smaller validation error) Do a grid search to get a feel for convergence, then continue training. Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5. Try Dropout (input: 20%, hidden: 50%) with test/validation set after finding good parameters for convergence on training set. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-41e-2, rate_annealing = 1e-51e-8, momentum_start = 0.5, momentum_stable = 0.99, momentum_ramp = 1/rate_annealing. Try balance_classes = true for imbalanced classes. Use force_load_balance and replicate_training_data for small datasets. 35 </li> <li> H2O Deep Learning, @ArnoCandel 36 and more docs coming soon! Draft All parameters are available from R H2O brings Deep Learning to R </li> <li> H2O Deep Learning, @ArnoCandel POJO Model Export for Production Scoring 37 Plain old Java code is auto-generated to take your H2O Deep Learning models into production! </li> <li> H2O Deep Learning, @ArnoCandel Deep Learning Auto-Encoders for Anomaly Detection 38 Toy example: Find anomaly in ECG heart beat data. First, train a model on whats normal: 20 time-series samples of 210 data points each Deep Auto-Encoder: Learn low-dimensional non-linear structure of the data that allows to reconstruct the orig. data A...</li></ul>