best practices for deep learning on apache...
TRANSCRIPT
![Page 1: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/1.jpg)
Best Practices for Deep Learning on Apache Spark Tim Hunter (speaker)
Joseph K. Bradley May 10th, 2017
GPU Technology Conference
![Page 2: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/2.jpg)
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
• Contributor to MLlib
• Author of TensorFrames and GraphFrames
![Page 3: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/3.jpg)
Founded by the creators of Apache Spark in 2013
to make big data simple
Provides hosted Spark platform in the cloud
![Page 4: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/4.jpg)
Deep Learning and Apache Spark
Deep Learning frameworks w/ Spark
bindings
• Caffe (CaffeOnSpark)
• Keras (Elephas)
• mxnet
• Paddle
• TensorFlow (TensorFlow on Spark,
TensorFrames)
Extensions to Spark for specialized
hardware
• Blaze (UCLA & Falcon Computing Solutions)
• IBM Conductor with Spark
Native Spark
• BigDL
• DeepDist
• DeepLearning4J
• MLlib
• SparkCL
• SparkNet
![Page 5: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/5.jpg)
Deep Learning and Apache Spark
2016: the year of emerging solutions for Spark + Deep
Learning
No consensus
• Many approaches for libraries: integrate existing ones with Spark, build
on top of Spark, modify Spark itself
• Official Spark MLlib support is limited (perceptron-like networks)
![Page 6: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/6.jpg)
One Framework to Rule Them All?
Should we look for The One Deep Learning Framework?
![Page 7: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/7.jpg)
Databricks’ perspective
• Databricks: hosted Spark platform on public cloud
• GPUs for compute-intensive workloads
• Customers use many Deep Learning frameworks: TensorFlow, MXNet,
BigDL, Theano, Caffe, and more
This talk
• Lessons learned from supporting many Deep Learning frameworks
• Multiple ways to integrate Deep Learning & Spark
• Best practices for these integrations
![Page 8: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/8.jpg)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
![Page 9: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/9.jpg)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
![Page 10: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/10.jpg)
ML is a small part of data pipelines.
Hidden technical debt in Machine Learning systems Sculley et al., NIPS 2016
![Page 11: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/11.jpg)
DL in a data pipeline: Training
Data
collection ETL Featurization
Deep
Learning Validation Export,
Serving
compute intensive IO intensive IO intensive
Large cluster
High memory/CPU ratio
Small cluster
Low memory/CPU ratio
![Page 12: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/12.jpg)
DL in a data pipeline: Transformation Specialized data transforms: feature extraction & prediction
Input Output
cat
dog
dog
Saulius Garalevicius - CC BY-SA 3.0
![Page 13: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/13.jpg)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning
integrations
• Developer tips
• Monitoring
![Page 14: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/14.jpg)
Recurring patterns
Spark as a scheduler • Data-parallel tasks
• Data stored outside Spark
Embedded Deep Learning transforms • Data-parallel tasks
• Data stored in DataFrames/RDDs
Cooperative frameworks • Multiple passes over data
• Heavy and/or specialized communication
![Page 15: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/15.jpg)
Streaming data through DL Primary storage choices:
• Cold layer (HDFS/S3/etc.)
• Local storage: files, Spark’s on-disk persistence layer
• In memory: Spark RDDs or Spark DataFrames
Find out if you are I/O constrained or processor-constrained • How big is your dataset? MNIST or ImageNet?
If using PySpark: • All frameworks heavily optimized for disk I/O
• Use Spark’s broadcast for small datasets that fit in memory
• Reading files is fast: use local files when it does not fit
![Page 16: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/16.jpg)
Cooperative frameworks
• Use Spark for data input
• Examples: • IBM GPU efforts
• Skymind’s DeepLearning4J
• DistML and other Parameter Server efforts
RDD
Partition 1
Partition n
RDD
Partition 1
Partition m
Black box
![Page 17: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/17.jpg)
Cooperative frameworks
• Bypass Spark for asynchronous / specific communication
patterns across machines
• Lose benefit of RDDs and DataFrames and
reproducibility/determinism
• But these guarantees are not requested anyway when doing
deep learning (stochastic gradient)
• “reproducibility is worth a factor of 2” (Leon Bottou, quoted
by John Langford)
![Page 18: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/18.jpg)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
![Page 19: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/19.jpg)
The GPU software stack
• Deep Learning commonly used with GPUs
• A lot of work on Spark dependencies:
• Few dependencies on local machine when compiling Spark
• The build process works well in a large number of configurations (just scala + maven)
• GPUs present challenges: CUDA, support libraries, drivers, etc.
• Deep software stack, requires careful construction (hardware + drivers + CUDA + libraries)
• All these are expected by the user
• Turnkey stacks just starting to appear
![Page 20: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/20.jpg)
• Provide a Docker image with all the GPU SDK
• Pre-install GPU drivers on the instance
Container: nvidia-docker,
lxc, etc.
The GPU software stack
GPU hardware
Linux kernel NV Kernel driver
CuBLAS CuDNN
Deep learning libraries (Tensorflow, etc.) JCUDA
Python / JVM clients
CUDA
NV kernel driver (userspace interface)
![Page 21: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/21.jpg)
Using GPUs through PySpark
• Popular choice for many independent tasks
• Many DL packages have Python interfaces: TensorFlow,
Theano, Caffe, MXNet, etc.
• Lifetime for python packages: the process
• Requires some configuration tweaks in Spark
![Page 22: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/22.jpg)
PySpark recommendation
• spark.executor.cores = 1
• Gives the DL framework full access over all the resources
• Important for frameworks that optimize processor pipelines
![Page 23: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/23.jpg)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
![Page 24: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/24.jpg)
Monitoring
?
![Page 25: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/25.jpg)
Monitoring
• How do you monitor the progress of your tasks?
• It depends on the granularity
• Around tasks
• Inside (long-running) tasks
![Page 26: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/26.jpg)
Monitoring: Accumulators
• Good to check
throughput or failure rate
• Works for Scala
• Limited use for Python
(for now, SPARK-2868)
• No “real-time” update
batchesAcc = sc.accumulator(1)
def processBatch(i):
global acc
acc += 1
# Process image batch here
images = sc.parallelize(…)
images.map(processBatch).collect()
![Page 27: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/27.jpg)
Monitoring: external system
• Plugs into an external system
• Existing solutions: Grafana, Graphite, Prometheus, etc.
• Most flexible, but more complex to deploy
![Page 28: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/28.jpg)
Conclusion
• Distributed deep learning: exciting and fast-moving space
• Most insights are specific to a task, a dataset and an
algorithm: nothing replaces experiments
• Get started with data-parallel jobs
• Move to cooperative frameworks only when your data are too large.
![Page 29: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/29.jpg)
Challenges to address
For Spark developers
• Monitoring long-running tasks
• Presenting and introspecting intermediate results
For DL developers
• What boundary to put between the algorithm and Spark?
• How to integrate with Spark at the low-level?
![Page 30: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/30.jpg)
Resources
Recent blog posts — http://databricks.com/blog
• TensorFrames
• GPU acceleration
• Getting started with Deep Learning
• Intel’s BigDL
Docs for Deep Learning on Databricks — http://docs.databricks.com
• Getting started
• Spark integration
![Page 31: Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter](https://reader031.vdocuments.site/reader031/viewer/2022020204/5b1503e87f8b9a8f548d4743/html5/thumbnails/31.jpg)
Thank you!
http://databricks.com/try