distributed tensorflow: scaling google's deep learning library on spark
TRANSCRIPT
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Christopher Nguyen, PhD. | Chris Smith | Ushnish De | Vu Pham | Nanda Kishore
DISTRIBUTED TENSORFLOW:
SCALING GOOGLE'S DEEP LEARNING LIBRARY ON SPARK
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Feb 2016
Distributed TensorFlow
on Spark
Spark Summit East
June 2016
TensorFlow Ensembles
Spark
Hadoop Summit
Nov 2015
Distributed DL on Spark + Tachyon
Strata NYC
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
1. Motivation
2. Architecture
3. Experimental Parameters
4. Results
5. Lesson Learned
6. Summary
Outline
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Motivation
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Google TensorFlow1Spark Deployments2
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
TensorFlow Review
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Computational Graph
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Graph of a Neural Network
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Google uses TensorFlow for the following:
★ Search ★ Advertising ★ Speech Recognition ★ Photos
★ Maps ★ StreetView—Blurring out house numbers ★ Translate ★ YouTube
Why TensorFlow?
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
DistBelief
Large Scale Distributed Deep Networks, Jeff Dean et al, NIPS 2012
1. Compute Gradients
2. Update Params (Descent)
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Arimo’s 3 Pillars
App Development
BIG APPS
PREDICTIVE ANALYTICS
NATURAL INTERFACES
COLLABO-RATION
Big Data + Big Compute
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Architecture
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
TensorSpark Architecture
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Spark & Tachyon Architectural Options
Model as Broadcast Variable
Spark-Only
Model Stored as
Tachyon File
Tachyon- Storage
Model Hosted in
HTTP Server
Param- Server
Model Stored and Updated by Tachyon
Tachyon- CoProc
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Experimental Parameters
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Data Set Size ParametersHidden Layers
HiddenUnits
MNIST50K x 784 =39.2M
1.8M 2 1024
Higgs10M x 28 =280M
8.5M 3 2048
Molecular150K x 2871
=430M14.2M 3 2048
Datasets
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Experimental Dimensions
Dataset Size1Model Size2Compute Size3CPU vs. GPU4
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Results
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
MNIST
Higgs
Molecular
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Lessons Learned
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
GPU vs CPU
★ GPU is 8-10x faster on local; lower gains with Spark but better for larger data sets
★ On local: GPU 8–10x faster
★ On distributed: GPU 2–3x faster, more with larger datasets
★ GPU memory is limited > Performance impact if hit
★ AWS has old GPUs, need to build TF from source
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Distributed Gradients
★ Initial warmup phase on param server needed
★ “Wobble” effect may still happen with convergence due to mini-batches
✦ Drop “obsolete” gradients
✦ Reduce learning rate and initial parameters
✦ Reduce mini-batch size
★ Work to maximize network throughput, e.g., model compression
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Deployment Configuration
★ Should run only 1 TF instance per machine
★ Compress gradients/parameters (e.g., as binaries) to reduce network overhead
★ Batch-size tradeoff: convergence vs. communication overhead
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Conclusions
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Conclusions★ This implementation best for large datasets & small models
✦ For large models, we’re looking into a) model parallelism b) model compression
★ If data fits on single machine, single-machine GPU would be best (if you have it)
★ For large datasets, leverage both speed and scale using Spark cluster with GPUs
★ Future Work: a) Tachyon b) Ensembles (Hadoop Summit - June 2016)
@pentagoniac @arimoinc #SparkSummit #DDF arimo.com
Top 10 World’sMost Innovative
Companies in Data Science
https://github.com/adatao/tensorspark
VISIT US BOOTH
K8 We’re OPEN SOURCING our work at