担任诸多顶级学术会议 - pic.huodongjia.comaug 14, 2017 · 2000h speech lstm model...
TRANSCRIPT
![Page 1: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/1.jpg)
![Page 2: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/2.jpg)
• 顶级学术期刊和会议上发表150余篇论文,被引用万余次。
• 担任诸多顶级学术会议(SIGIR/ICML/NIPS/KDD/WWW/AAAI/ WINE/ICTIR等)组委会主席或领域主席, 顶级学术期刊(TOIS/TWEB
/Neurocomputing等)副主编。
• 发布或联合发布知名开源项目• 微软认知工具包(CNTK)
• 微软图引擎 (Graph Engine)
• 微软分布式机器学习工具包(DMTK)
![Page 3: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/3.jpg)
28.225.8
16.4
11.7
7.3 6.73.5
ILSVRC2010 NECAmerica
ILSVRC2011 Xerox
ILSVRC2012
AlexNet
ILSVRC2013 Clarifi
ILSVRC2014 VGG
ILSVRC2014
GoogleNet
ILSVRC2015 ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places this year: ImageNet classification,
ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
![Page 4: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/4.jpg)
![Page 5: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/5.jpg)
![Page 6: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/6.jpg)
https://arxiv.org/abs/1610.05256
![Page 7: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/7.jpg)
vs.Sedol Lee
Atari Games
Machine Learning, as the driving force, is
entering a new era of big model and big data.
![Page 8: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/8.jpg)
DNN: Deep Neural Networks
CNN: Convolutional Neural
Networks
RNN: Recurrent Neural
Networks
![Page 9: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/9.jpg)
Model Hardware Time cost
ResNet on ImageNet
~1M image samples for
1K classes
K40 * 8 ~ 130 hours
GoogleNet on ImageNet K40 ~ 570 hours
2000h Speech LSTM
model training
K40 ~ 1100 hours
Neural Translation model K40 ~ 2000 hours
![Page 10: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/10.jpg)
How to well utilize computation resources
to speed up the training of big model
over big data?
![Page 11: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/11.jpg)
![Page 12: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/12.jpg)
• Mini-tutorial on distributed machine learning
• Details on CNTK’s parallel training
![Page 13: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/13.jpg)
• CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.
• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.
• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.
![Page 14: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/14.jpg)
![Page 15: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/15.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
![Page 16: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/16.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1)
h2 = s(W2 h1 + b2)
P = softmax(Wout h2 + bout)
with input x RM
![Page 17: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/17.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1)
h2 = s(W2 h1 + b2)
P = softmax(Wout h2 + bout)
with input x RM and one-hot label y RJ
and cross-entropy training criterion
ce = yT log P
Scorpusce = max
![Page 18: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/18.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label y RJ
and cross-entropy training criterion
ce = yT log P ce = cross_entropy (P, y)
Scorpusce = max
![Page 19: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/19.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
h1 = sigmoid (x @ W1 + b1)
h2 = sigmoid (h1 @ W2 + b2)
P = softmax (h2 @ Wout + bout)
ce = cross_entropy (P, y)
![Page 20: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/20.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
•
+
s
•
+
s
•
+
softmax
W1
b1
W2
b2
Wout
bout
cross_entropy
h1
h2
P
x y
h1 = sigmoid (x @ W1 + b1)
h2 = sigmoid (h1 @ W2 + b2)
P = softmax (h2 @ Wout + bout)
ce = cross_entropy (P, y)
ce
![Page 21: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/21.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
•
+
s
•
+
s
•
+
softmax
W1
b1
W2
b2
Wout
bout
cross_entropy
h1
h2
P
x y
ce
LEGO-like composability allows CNTK to supportwide range of networks & applications
![Page 22: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/22.jpg)
• Select right reader and learner to do an LEGO-
like training
• Describe network as Computation Network
• Using Simple Network Builder, Brainscript or
python to build Computation Network
CNTK
learner• SGD
(momentum,
AdaGrad, …)
• minibatching,
packing,
padding
reader• task-specific
deserializer
• automatic
randomization
corpu
s
mode
l
network• network
definition
• CPU/GPU
execution
engine
![Page 23: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/23.jpg)
• FCN-5/-8 – Fully connected network with different layers.
• AlexNet/ResNet – typical convolutional network for image classification /
recognition
• LSTM-32/-64 is for the LSTM model (2 lstm layer inside) with different batch size
• Iteration here means a (mini-batch) update.
• Caffe doesn’t support RNN/LSTM computation.
• Experiment with TitanX GPU
0
5
10
15
20
25
30
35
40
FCN-5 FCN-8 AlexNet ResNet LSTM-32 LSTM-64
Speed Comparison for Single GPU Trainingmeausred by iterations*/second, higher=better
Caffe CNTK MXNet TensorFlow Torch
• CNTK has 1bit SGD enabled.
• All the experiment is carried out based on FCN-5 model with
synchronous data parallel algorithm on K40 GPUs with
Infinitband connection.
0
5
10
15
20
25
FCN-5 FCN-5, 2 GPUs FCN-5, 4 GPUs FCN-5, 8 GPUs(2
nodes)
Speed Comparison for parallel TrainingMeasured by iterations / second, higher = better
Caffe CNTK TensorFlow Torch
http://github.com/Microsoft/cntk
• CNTK is fast for single-GPU training, and its speed is especially
outstanding for RNN training. CNTK is also the best among all
DNN tools in terms of scalability.
![Page 24: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/24.jpg)
• Quick revisit on CNTK
• Mini-tutorial on distributed machine learning
• Details on CNTK’s parallel training
![Page 25: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/25.jpg)
1. Partition the training data
2. Parallel training on different machines
3. Synchronize the local updates
4. Refresh the local model with new
parameters, go to 2.
![Page 26: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/26.jpg)
Worker 1
Worker 2
Worker 3
Worker 4
∆𝜔𝑖𝑡1
2 𝜔𝑡
Global Model
Synchronization
Time
Individual workers synchronize with each other every (k)
mini-batch:
1) Aggregate ∆𝜔𝑖’s from all workers to refine global model 𝜔.2) Broadcast global model 𝜔 back to each worker.
3) After receiving new global model, each worker starts next
step of training.
BSP: Bulk Synchronous
Parallel
Worker 1
Worker 2
Worker 3
Worker 4
∆𝜔𝑖𝑡1 2 𝜔𝑡
Time
Global Model
No Synchronization until staleness threshold is hit
Finished Iteration #5
Finished Iteration #2
When staleness=3,
worker 3 will wait for
worker 1 to catch
up.
Individual worker pushes update ∆𝝎𝒊 to global model
𝝎 every (k) mini-batch, until notice that another
worker is 𝒔 steps behind. Thus SSP tradeoffs between
BSP and ASP.
1) When 𝑠 = 0, 𝑆𝑆𝑃 = 𝐵𝑆𝑃.
2) When 𝑠 = ∞, 𝑆𝑆𝑃 = 𝐴𝑆𝑃.
SSP: Stale Synchronous
Parallel
• BSP is a well-defined mechanism, which
can be equivalent to a single-machine
SGD under certain conditions.
• BSP has convergence guarantee, but
might be inefficient due to frequent
synchronization.
• ASP always runs fast due to its
asynchronous nature, no time wasted on
waiting.
• ASP, in theory, might not converge when
differences between workers’ progresses
are unbounded (straggler will destroy
convergence by pushing stale ∆𝜔 onto
global model).
• SSP tradeoffs efficiency and
convergence: (1) It does not require
strict synchronization (2) It does not
allow workers’ progresses to have
large differences.
• SSP is proven to converge for convex
loss and bounded staleness.
Leslie G. Valiant
Worker 1
Worker 2
Worker 3
Worker 4
∆𝜔𝑖𝑡11
2 𝜔𝑡2
Time
Global Model
No Synchronization
Individual worker push its update ∆𝝎𝒊 to global model
𝝎 every (k) mini-batch, without waiting for others.
1) Push update ∆𝜔𝑖 to global model 𝜔2) Pull back whatever global model in the parameter server
3) Proceed training based on the latest 𝜔 in local machine.
ASP: Asynchronous Parallel
![Page 27: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/27.jpg)
A system approach
• The global model is partitioned into K sub-models without
overlap.
• The sub-models are distributed over K local workers and
serve as their local models.
• In each mini-batch, the local workers compute the
gradients of the local weights by back propagation.
![Page 28: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/28.jpg)
Dataflow
(Deep learning)
Synchronous
Asynchronous
Data Parallelism Model Parallelism
Iterative
MapReduc
e
(LDA, LR)
Parameter Server
(Deep learning, LDA,
GBDT, LR)
Irregular Parallelism
![Page 29: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/29.jpg)
Iterative
MapReduce• Use MapReduce /
AllReduce to sync
parameters among
workers
• Only synchronous
update
• Example: Spark and
other derived
systems
Local computation
Synchronous
update
![Page 30: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/30.jpg)
Iterative
MapReduce
Parameter
Server
• Parameter server (PS) based
solution is proposed to
support: • Asynchronous update
• Different mechanisms for
model aggregation, especially
in asynchronous manner
• Model parallelism
• Example: • Google’s DistBelief; Petuum
• Multiverso PS
+ NIPS’12 DistBelief (Google), NIPS’13 Petuum (Eric Xing), OSDI’14 Parameter server (Mu Li), Multiverso PS… etc.
![Page 31: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/31.jpg)
Iterative
MapReduce
Parameter
Server
Dataflow based solution is
proposed to support:• Irregular parallelism (e.g.,
hybrid data- and model-
parallelism), particularly in
deep learning
• Both high-level abstraction and
low-level flexibility in
implementation
Example: • Google’s TensorFlowDataflow
+ Tensorflow, Eusys’07 Dryad (Microsoft), NSDI’12 Spark (AMP Lab), CNTK, MXNet… etc.
Task scheduling &
execution based on:
1. Data dependency
2. Resource availability
Dataflow Resource
![Page 32: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/32.jpg)
Computation
Infrastructureparallel execution
engine based on
gRPC, packup
message,
communication, etc.
Computation
allocation
methodManual assign to
devices vs. automatic
conduct optimized
allocation
Parameter
aggregation
logic Inherent all research
output from
parameter server
side.
![Page 33: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/33.jpg)
The focus is still in Data Parallelism
+ data is huge
+ model is in modest size
+ a cluster of machine are working together to speed up the data partition training
![Page 34: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/34.jpg)
• Quick revisit on CNTK
• Mini-tutorial on distributed machine learning
![Page 35: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/35.jpg)
0
5
10
15
20
25
FCN-5 FCN-5, 2 GPUs FCN-5, 4 GPUs FCN-5, 8 GPUs(2 nodes)
Speed Comparison for parallel TrainingMeasured by iterations / second, higher = better
Caffe CNTK TensorFlow Torch
![Page 36: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/36.jpg)
![Page 37: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/37.jpg)
• Communicate less each time
• Communicate less often
• Asynchronous and Pipelined processing
https://github.com/Microsoft/CNTK/wiki/Multiple-GPUs-and-machines
![Page 38: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/38.jpg)
• quantize gradients to but 1 bit per value with error feedback• All parameter was decided weather to plus/minus a same value, and carries
over the rest value to next minibatch
• Delay the updates procedure
0 5 10 15 20 25 30 35
1-bit
float
Transferred Gradient (bits/value), smaller is better
1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, InterSpeech 2014, F. Seide, H. Fu, J. Droppo, G. Li, D. Yu
![Page 39: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/39.jpg)
![Page 40: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/40.jpg)
K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016
![Page 41: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/41.jpg)
Block Momentum
K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016
![Page 42: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/42.jpg)
2.9 5.4
8.0 3.3
6.7 10.8 3.7 6.9
13.8
25.5
43.7
4.1 8.1
14.1
27.3
54.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs
1bit/BMUF Speedup Factors in LSTM Training
1bit-average
1bit-peak
BMUF-average
BMUF-peak
K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016
![Page 43: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/43.jpg)
![Page 44: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/44.jpg)
Asynchronous and Pipelined processing
Communication barrier in synchronous parallelization asynchronous parallelization
![Page 45: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/45.jpg)
Asynchronous and Pipelined processing
Main Thread (training)
Communication Thread
……….. Training 1 Training
Communication
1 ………..
………..
GPU Buffer
22Communicatio
n
Fill model
to gpu
buffer
Prepare
updatesPrepare
model
Switch
buffer
![Page 46: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/46.jpg)
• Sequential SGD𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡+τ
• Async SGD𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡
Delayed communication in asynchronous parallelization
![Page 47: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/47.jpg)
ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡
DC-ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡 − λ𝜙(𝑔 𝑤𝑡 ) · ()
𝑤𝑡+τ −𝑤𝑡
Training curve for a Resnet
DNN model for cifra-10
A work that directly targets to handle the delay, and it
is experimentally effective and with convergence
analysis.
![Page 48: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/48.jpg)
![Page 49: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/49.jpg)
![Page 50: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/50.jpg)
![Page 51: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/51.jpg)
• CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.• Linux, Windows, docker, .Net
• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.• automatic differentiation, deferred computation, optimized execution and memory use
• powerful description language, composability
• implicit time; efficient static and recurrent NN training through batching
• data parallelization, GPUs & servers: 1-bit SGD, Block Momentum
• feed-forward DNN, RNN, LSTM, convolution, DSSM; speech, vision, text
• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.
![Page 52: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize](https://reader033.vdocuments.site/reader033/viewer/2022050522/5fa618bb0cfebf7c1938bee5/html5/thumbnails/52.jpg)
• Web site: https://cntk.ai/
• Github: https://github.com/Microsoft/CNTK
• Wiki: https://github.com/Microsoft/CNTK/wiki
• Issues: https://github.com/Microsoft/CNTK/issues
mailto:[email protected]
CNTK: democratizing the AI tool chain