deepspark - nvidia...spark resilient distributed dataset (rdd) m o d e l - cached in memory - api...
TRANSCRIPT
![Page 1: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/1.jpg)
1
HUAWEI | EUROPEAN RESEARCH CENTER
John Doe
— Huawei Confidential —
DeepSpark
Contributors: Natan Peterfreund, Roman Talyansky, Uri Verner,
Zach Melamed, Youliang Yan, Rongfu Zheng
Presenter: Uri Verner
Asynchronous deep learning over Spark
![Page 2: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/2.jpg)
2
DeepSpark is a scalable
deep learning framework
for -based
distributed environments
![Page 3: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/3.jpg)
3
Outline
Background
DeepSpark architecture
Data locality optimizations
Initial results
Useful tools
![Page 4: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/4.jpg)
4
What is Apache Spark?
Spark is an advanced framework for distributed computation
Very fast at iterative algorithms
In-memory data caching between iterations
Provides fault-tolerance and recovery
Efficient at data transportation between nodes (“shuffle”)
Easy and expressive APIs
![Page 5: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/5.jpg)
5
Synchronous vs. Asynchronous Training
parameter server
w o r k e r s
< repeatedly >
Workers can get out of sync
• network delays
• waiting for data
• machine crashed
• etc.
update parameter server
input data
![Page 6: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/6.jpg)
6
System Architecture
Spark executor
Training manager
Caffe
GPU
GPU
GPU
GPU
Spark executor
Training manager
Caffe
GPU
GPU
GPU
GPU
Spark Driver
GPUGPUGPU GPUGPU GPU
M O D E L
data
Training worker machines
Distributed parameter server
in Spark RDD
HDFS HDFS
data
![Page 7: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/7.jpg)
7
Data Parallelismwith Asynchronous Distributed Stochastic Descent
worker
Each worker operates
asynchronously with other workers.
M O D E L
Parameter server
1
2
3
4
Download
model 𝑀
Compute
update ∆𝑀
Upload
∆𝑀 to PS
Update model:
𝑀:= 𝑀 + ∆𝑀
data
![Page 8: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/8.jpg)
8
Distributed Parameter Server
M O D E LSpark Resilient Distributed Dataset (RDD)
- Cached in memory
- API for distributed processing
Model update procedure:
- Training workers send local updates to PS
machines in split form
- Compute a new global model
- Update training workers with new model L O C A L
ready model
updates
G L O B L
merged
model
Workers
![Page 9: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/9.jpg)
9
Workers Don’t Wait For Model Update
global
model
local
model
model
update
accumulated
updates
Forward/Backwardload if new add
update
get model
from PS
send updates
to PS
training loop: 1 2 3
4
1 2update loop:
gpugpugpu
Caffe
![Page 10: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/10.jpg)
10
Preserve Local Updates
global
model
local
model
model
update
accumulated
updates
Forward/Backwardload if new add
update
get model
from PS
send updates
to PS
training loop: 1 2 3
4
1 2update loop:
gpugpugpu
Caffe
“read-my-writes” [1]
[1] ”More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server”, Ho et al., NIPS 2013
![Page 11: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/11.jpg)
11
Limited Staleness
slowest worker
global
model
local
model
gpu
global
model
local
model
gpu
another worker
version: 2 version: 10
load if new load if new
Use a configurable staleness threshold
![Page 12: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/12.jpg)
12
Work Assignment with HDFS
HDFS HDFS HDFS HDFS HDFS
The input data is distributed,
and replicated.
stored in blocks of 128MB (by default),
worker worker worker worker
Worker machines may also be HDFS machines.
Problem: assign each (unique) data block to a worker
Requirements (in order of priority):
- Equal work distribution
- Minimize data transfer over the network
![Page 13: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/13.jpg)
13
The Data Block Assignment Problem
data blocks (N) replicas (R) workers (W)
for each
data block
choose one
replica
such that each
worker gets𝑁
𝑊blocks (±1)
and non-local
assignments are
minimized
and assign it to a
worker
locality
![Page 14: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/14.jpg)
14
Solving HDFS Locality OptimizationRepresent as a minimum-cost flow optimization problem
A classical problem with an efficient solution[2]
[2] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications.
data blocks (N) replicas (R) workers (W)
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1𝑐𝑜𝑠𝑡 = 0
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1𝑐𝑜𝑠𝑡 = 0
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 1
cost = 1 𝑟𝑒𝑚𝑜𝑡𝑒 𝑟𝑒𝑝𝑙𝑖𝑐𝑎0 𝑙𝑜𝑐𝑎𝑙 𝑟𝑒𝑝𝑙𝑖𝑐𝑎
flow N out
𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = 𝑁𝑊
𝑐𝑜𝑠𝑡 = 0
![Page 15: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/15.jpg)
15
Assigning the HDFS data blocks
Assignment
data blocks (N) replicas (R) workers (W)
flow N out
![Page 16: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/16.jpg)
16
Initial Results
Setup: 4 machines with one Titan X per machine, TCP/IP over Connect-X 3 Infiniband,
GoogleNet model (from Caffe), each machine is used as both worker and PS.
0
2
4
6
8
10
12
0K 10K 20K 30K 40K
los
s
iterations
Single worker
DeepSpark
BSP (the ideal)
0
100
200
300
400
500
600
700
ite
rati
on
tim
e [
ms
]
Single worker
DeepSpark
BSP
![Page 17: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/17.jpg)
17
Useful Optimization & Debugging Tools
Visualize the program’s execution using NVIDIA Tools Extension (NVTX)
Mark the beginnings and endings of all your important operations
Caffe
Spark
![Page 18: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/18.jpg)
18
Useful Optimization & Debugging Tools
See CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
Time ranges are marked using push-pop semantics.
C++ trick: define a special class with “push” in constructor & “pop” in destructor
Define a macro that creates a “profiling” object with info about function; to describe
function use macros __PRETTY_FUNCTION__, __FILE__, and __LINE__
Example:
int func() {
PROFILER_FUNCTION_SCOPE();
... body of function ...
}
![Page 19: DeepSpark - NVIDIA...Spark Resilient Distributed Dataset (RDD) M O D E L - Cached in memory - API for distributed processing Model update procedure: - Training workers send local updates](https://reader034.vdocuments.site/reader034/viewer/2022051902/5ff14bab6eb6bc75495b8a22/html5/thumbnails/19.jpg)
19
Copyright © 2016 Huawei Technologies. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.
EUROPEAN RESEARCH CENTEREUROPEAN RESEARCH CENTER
Copyright © 2016 Huawei Technologies. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.
DeepSpark
Contact emails: