with many contributors - cntk.ai€¦ · with many contributors: a. agarwal, e. akchurin, ......
TRANSCRIPT
![Page 1: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/1.jpg)
![Page 2: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/2.jpg)
With many contributors:
A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, W. Darling, J. Droppo, A. Eversole, B. Guenter, P. He, M. Hillebrand, X. Huang, Z. Huang, R.
Hoens, V. Ivanov, A. Kamenev, N. Karampatziakis, P. Kranen, O. Kuchaiev, W. Manousek, C. Marschner, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M.
Radmilac, A. Reznichenko, P. Parthasarathi, S. Pathak, B. Peng, A. Reznichenko, W. Richert, F. Seide, M. Seltzer, M. Slaney, A. Stolcke, T. Will, H. Wang, Z.
Wang, W. Xio. Yao, D. Yu, C. Zhang, Y. Zhang, G. Zweig
![Page 3: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/3.jpg)
![Page 4: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/4.jpg)
Microsoft
Cognitive
Toolkit
deep learning at Microsoft
• Microsoft Cognitive Services
• Skype Translator
• Cortana
• Bing
• HoloLens
• Microsoft Research
![Page 5: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/5.jpg)
Microsoft
Cognitive
Toolkit
![Page 6: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/6.jpg)
Microsoft
Cognitive
Toolkit
ImageNet: Microsoft 2015 ResNet
28.225.8
16.4
11.7
7.3 6.73.5
ILSVRC2010 NECAmerica
ILSVRC2011 Xerox
ILSVRC2012
AlexNet
ILSVRC2013 Clarifi
ILSVRC2014 VGG
ILSVRC2014
GoogleNet
ILSVRC2015 ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places this year: ImageNet classification,
ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
![Page 7: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/7.jpg)
Microsoft
Cognitive
Toolkit
![Page 8: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/8.jpg)
Microsoft
Cognitive
Toolkit
deep learning at Microsoft
• Microsoft Cognitive Services
• Skype Translator
• Cortana
• Bing
• HoloLens
• Microsoft Research
![Page 9: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/9.jpg)
Microsoft
Cognitive
Toolkit
24%
14%
![Page 10: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/10.jpg)
Microsoft
Cognitive
Toolkit
Microsoft’s historicspeech breakthrough
• Microsoft 2016 research system for
conversational speech recognition
• 5.9% word-error rate
• enabled by CNTK’s multi-server scalability
[W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke,
D. Yu, G. Zweig: “Achieving Human Parity in Conversational
Speech Recognition,” https://arxiv.org/abs/1610.05256]
![Page 12: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/12.jpg)
Microsoft Customer Support Agent
![Page 13: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/13.jpg)
![Page 14: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/14.jpg)
![Page 15: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/15.jpg)
![Page 16: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/16.jpg)
![Page 17: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/17.jpg)
![Page 18: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/18.jpg)
Microsoft
Cognitive
Toolkit
Benchmarking on a single server by HKBU
“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”
FCN-8 AlexNet ResNet-50 LSTM-64
CNTK 0.037 0.040 (0.054) 0.207 (0.245) 0.122
Caffe 0.038 0.026 (0.033) 0.307 (-) -
TensorFlow 0.063 - (0.058) - (0.346) 0.144
Torch 0.048 0.033 (0.038) 0.188 (0.215) 0.194
G980
![Page 19: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/19.jpg)
Recent update
• With a single GPU platform:• Caffe, CNTK and Torch perform better than MXNet and TensorFlow on FCNs
• MxNet is outstanding in CNNs, while Caffe and CNTK also achieve good performance.
• For RNN of LSTM, CNTK obtains excellent time efficiency, which is up to 5-10 times better than other tools.
• CNTK out performs TensorFlow on all categories often by a large margin.
https://arxiv.org/pdf/1608.07249v6.pdf
![Page 20: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/20.jpg)
Microsoft
Cognitive
Toolkit
“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”
Theano only supports 1 GPU
Achieved with 1-bit gradient quantizationalgorithm
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
speed comparison (samples/second), higher = better
[note: December 2015]
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
![Page 21: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/21.jpg)
Superior performance
![Page 22: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/22.jpg)
Superior performance
![Page 23: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/23.jpg)
What is new in CNTK 2.0?
https://esciencegroup.com/2016/11/10/cntk-revisited-a-new-deep-learning-toolkit-release-from-microsoft/
Microsoft has now released a major upgrade of the software
and rebranded it as part of the Microsoft Cognitive
Toolkit. This release is a major improvement over the initial
release.
There are two major changes from the first release that you
will see when you begin to look at the new release. First is
that CNTK now has a very nice Python API and, second, the
documentation and examples are excellent.
Installing the software from the binary builds is very easy on
both Ubuntu Linux and Windows.
![Page 24: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/24.jpg)
![Page 25: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/25.jpg)
• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.
• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.
The Microsoft Cognitive Toolkit (CNTK)
![Page 26: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/26.jpg)
• open-source model inside and outside the company• created by Microsoft Speech researchers (Dong Yu et al.) in 2012, “Computational Network Toolkit”
• open-sourced (CodePlex) in early 2015
• on GitHub since Jan 2016 under permissive license
• Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit”
• used by Microsoft product groups; but code development is out in the open
• external contributions e.g. from MIT and Stanford
• Linux, Windows, docker, cudnn5, CUDA 8
• Python and C++ API (beta; C#/.Net on roadmap)
• Keras integration in progress
“CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.”
![Page 27: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/27.jpg)
Microsoft
Cognitive
Toolkit
• CNTK is a library for deep neural networks• model definition
• scalable training
• efficient I/O
• easy to author, train, and use neural networks• think “what” not “how”
• focus on composability
• Python, C++, C#, Java
• open source since 2015 https://github.com/Microsoft/CNTK
• created by Microsoft Speech researchers (Dong Yu et al.) in 2012, “Computational Network Toolkit”
• contributions from MS product groups and external (e.g. MIT, Stanford), development is visible on Github
• Linux, Windows, docker, cudnn5, CUDA 8
Microsoft Cognitive Toolkit, CNTK
![Page 28: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/28.jpg)
MNIST Handwritten Digits (OCR)
• Data set of hand written digits with60,000 training images
10,000 test images
• Each image is: 28 x 28 pixels
• Performance with different classifiers (error rate): Neural nets (2-layers): 1.6 %
Deep nets (6-layers): 0.35 %
Conv nets (different): 0.21% - 0.31%
Handwritten Digits
1 5 4 35 3 5 35 9 0 6
Corresponding Labels
![Page 29: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/29.jpg)
28 pix
28
pix
.
784 pixels (x)
S S
weights (W)
= map to (0-1) rangeActivation function
784
10
Model
Logistic Regression
SBias (10)
(𝑏)0 1 9
…
784 pixels ( Ԧ𝑥)
Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏
![Page 30: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/30.jpg)
Weights (W)
784
Single-Layer Perceptron
28 pix
28
pix
.
784 pixels (x)
S
= Activation function
Dense Layer
Di = 784O= 400a = sigmoid
D Dense Layer
400 nodesS S
Model
S
400
Bias (10)
(𝑏)
784 pixels ( Ԧ𝑥)
Ԧ𝑧 = Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏
![Page 31: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/31.jpg)
Multi-layer Perceptron
28 pix
28
pix
.
784 pixels (x)
.
D400 nodes i = 784
O= 400a = relu
D200 nodes i = 400
O= 200a = relu
D10 nodes i = 200
O= 10a = None
0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01softmax
Weights
784
400 + 400 bias
400
200 + 200 bias
200
10 + 10 bias
Deep Model
![Page 32: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/32.jpg)
28 pix
28
pix
.
28 x 28 pix (p)
Error or Loss Function
Lossfunction
se = σ𝑗= 09 𝑦𝑗 − 𝑝𝑗
2Squared error
ce = −σ𝑗=09 𝑦𝑗 𝑙𝑜𝑔 𝑝𝑗
Cross entropy error
1 5 4 35 3 5 35 9 0 6
Label One-hot encoded (Y)
0 0 0 1 0 0 0 0 0 0
Model(w, b)
Predicted Probabilities (p)
0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01
![Page 33: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/33.jpg)
Train Workflow
MNISTTrain
12
8 s
am
ples
(min
i-ba
tch
)
..
..
3
7
8
0
Input feature (X: 128 x 784)
One-hot encoded Label
(Y: 128 x 10)
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
.
.
.
Model
z = model(X):
h1 = Dense(400, act = relu)(X)
h2 = Dense(200, act = relu)(h1)
r = Dense(10, act = None)(h2)
return r
Weights
784
400
+ 400
400
200
+ 200
200
10
+ 10bias
Model Parameters
Loss cross_entropy_with_softmax(p,Y)
Trainer(model, (loss, error), learner)
Trainer.train_minibatch({X, Y})
Error(optional)
classification_error(p,Y)
Learnersgd, adagrad etc, are solvers to estimate – W & b
![Page 34: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/34.jpg)
Test Workflow
LearnerThis is a dummy parameter during test pass
Model
z = model(X):
h1 = Dense(400, act = relu)(X)
h2 = Dense(200, act = relu)(h1)
r = Dense(10, act = None)(h2)
return r
Weights
784
400
+ 400
400
200
+ 200
200
10
+ 10bias
Model ParametersMNIST
Test
..
..
3
78
0
Input feature (X*: 32 x 784)3
2sa
mpl
es(m
ini-
batc
h)
One-hot encoded Label
(Y*: 32 x 10)
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
MNISTTrain
Loss cross_entropy_with_softmax(z,Y)
Trainer.test_minibatch({X, Y})
Error classification_error(z,Y)
![Page 35: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/35.jpg)
Prediction Workflow
Any MNIST
.9
Input feature (new X: 1 x 784)Model(w, b)
Model.eval(new X)
0.02 0.09 0.03 0.03 0.01 0.02 0.02 0.06 0.02 0.70
Predicted Softmax Probabilities (predicted_label)
[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]
[9]
![Page 36: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/36.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label L RM
and cross-entropy training criterion
ce = LT log P ce = cross_entropy (L, P)
Scorpusce = max
![Page 37: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/37.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label y RJ
and cross-entropy training criterion
ce = yT log P ce = cross_entropy (L, P)
Scorpusce = max
![Page 38: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/38.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label y RJ
and cross-entropy training criterion
ce = yT log P ce = cross_entropy (P, y)
Scorpusce = max
![Page 39: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/39.jpg)
“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”
•
+
s
•
+
s
•
+
softmax
W1
b1
W2
b2
Wout
bout
cross_entropy
h1
h2
P
x y
h1 = sigmoid (x @ W1 + b1)
h2 = sigmoid (h1 @ W2 + b2)
P = softmax (h2 @ Wout + bout)
ce = cross_entropy (P, y)
ce
![Page 40: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/40.jpg)
Microsoft
Cognitive
Toolkit
• “model function”• features predictions
• defines the model structure & parameter initialization
• holds parameters that will be learned by training
• “criterion function”• (features, labels) (training loss, additional metrics)
• defines training and evaluation criteria on top of the model function
• provides gradients w.r.t. training criteria
authoring networks as functions
•
+
s
•
+
s
•
+
softmax
W1
b1
W2
b2
Wout
bout
cross_entropy
h1
h2
P
x y
ce
![Page 41: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/41.jpg)
Microsoft
Cognitive
Toolkit
authoring networks as functions
• CNTK model: neural networks are functions• pure functions
• with “special powers”:• can compute a gradient w.r.t. any of its nodes
• external deity can update model parameters
• user specifies network as function objects:• formula as a Python function (low level, e.g. LSTM)
• function composition of smaller sub-networks (layering)
• higher-order functions (equiv. of scan, fold, unfold)
• model parameters held by function objects
• “compiled” into the static execution graph under the hood
![Page 42: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/42.jpg)
Microsoft
Cognitive
Toolkit
Microsoft Cognitive Toolkit, CNTKScript configure and executes through CNTK Python APIs…
trainer• SGD
(momentum,Adam, …)
• minibatching
reader• minibatch source• task-specific
deserializer• automatic
randomization• distributed
reading
corpus model
network• model function• criterion function• CPU/GPU
execution engine• packing, padding
![Page 43: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/43.jpg)
Microsoft
Cognitive
Toolkit
from cntk import *
# readerdef create_reader(path, is_training):
...
# networkdef create_model_function():
...def create_criterion_function(model):
...
# trainer (and evaluator)def train(reader, model):
...def evaluate(reader, model):
...
# main functionmodel = create_model_function()
reader = create_reader(..., is_training=True)train(reader, model)
reader = create_reader(..., is_training=False)evaluate(reader, model)
As easy as 1-2-3
![Page 44: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/44.jpg)
Microsoft
Cognitive
Toolkit
• prepare data
• configure reader, network, learner (Python)
• train:mpiexec --np 16 --hosts server1,server2,server3,server4 \python my_cntk_script.py
workflow
![Page 45: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/45.jpg)
def create_reader(map_file, mean_file, is_training):
# image preprocessing pipeline
transforms = [
ImageDeserializer.crop(crop_type='Random', ratio=0.8, jitter_type='uniRatio')
ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels,interpolations='linear'),
ImageDeserializer.mean(mean_file)
]
# deserializer
return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), '
labels = StreamDef(field='label', shape=num_classes)
)), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training elseFULL_DATA_SWEEP)
how to: reader
![Page 46: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/46.jpg)
def create_reader(map_file, mean_file, is_training):# image preprocessing pipelinetransforms = [
ImageDeserializer.crop(crop_type='Random', ratio=0.8, jitter_type='uniRatio')ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels,
interpolations='linear'),ImageDeserializer.mean(mean_file)
]# deserializerreturn MinibatchSource(ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), 'labels = StreamDef(field='label', shape=num_classes)
)), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)
• automatic on-the-fly randomization important for large data sets
• readers compose, e.g. image text caption
how to: reader
![Page 47: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/47.jpg)
Microsoft
Cognitive
Toolkit
• prepare data
• configure reader, network, learner (Python)
• train: --distributed!mpiexec --np 16 --hosts server1,server2,server3,server4 \python my_cntk_script.py
workflow
![Page 48: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/48.jpg)
Microsoft
Cognitive
Toolkit
• prepare data
• configure reader, network, learner (Python)
• train:mpiexec --np 16 --hosts server1,server2,server3,server4 \python my_cntk_script.py
• deploy• offline (Python): apply model file-to-file
• your code: embed model through C++ API
• online: web service wrapper through C#/Java API
workflow
![Page 49: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/49.jpg)
Microsoft
Cognitive
Toolkit
CNTK performs!
![Page 50: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/50.jpg)
Microsoft
Cognitive
Toolkit
Layers API• basic blocks:
• LSTM(), GRU(), RNNUnit()• Stabilizer(), identity• ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[]
• layers:• Dense(), Embedding()• Convolution(), Convolution1D(), Convolution2D(), Convolution3D(), Deconvolution()• MaxPooling(), AveragePooling(), GlobalMaxPooling(), GlobalAveragePooling(), MaxUnpooling()• BatchNormalization(), LayerNormalization()• Dropout(), Activation()• Label()
• composition:• Sequential(), For(), operator >>, (function tuples)• ResNetBlock(), SequentialClique()
• sequences:• Delay(), PastValueWindow()• Recurrence(), RecurrenceFrom(), Fold(), UnfoldFrom()
• models:• AttentionModel()
![Page 51: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/51.jpg)
Microsoft
Cognitive
Toolkit
Layers lib: full list of layers/blocks• layers/blocks.py:
• LSTM(), GRU(), RNNUnit()• Stabilizer(), identity• ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[]
• layers/layers.py:• Dense(), Embedding()• Convolution(), Convolution1D(), Convolution2D(), Convolution3D(), Deconvolution()• MaxPooling(), AveragePooling(), GlobalMaxPooling(), GlobalAveragePooling(), MaxUnpooling()• BatchNormalization(), LayerNormalization()• Dropout(), Activation()• Label()
• layers/higher_order_layers.py:• Sequential(), For(), operator >>, (function tuples)• ResNetBlock(), SequentialClique()
• layers/sequence.py:• Delay(), PastValueWindow()• Recurrence(), RecurrenceFrom(), Fold(), UnfoldFrom()
• models/models.py:• AttentionModel()
![Page 52: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/52.jpg)
![Page 53: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/53.jpg)
Microsoft
Cognitive
Toolkit
• higher-level features:• auto-tuning of learning rate and minibatch size
• memory sharing
• implicit handling of time
• minibatching of variable-length sequences
• data-parallel training
• you can do all this with other toolkits, but must write it yourself
Differentiating features
![Page 54: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/54.jpg)
Microsoft
Cognitive
Toolkit
deep dive: handling of time
extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
no explicit notion of time
![Page 55: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/55.jpg)
Microsoft
Cognitive
Toolkit
deep dive: handling of time
extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
no explicit notion of time
![Page 56: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/56.jpg)
Microsoft
Cognitive
Toolkit
deep dive: handling of time
extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
no explicit notion of time
![Page 57: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/57.jpg)
Microsoft
Cognitive
Toolkit
deep dive: handling of time
extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
![Page 58: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/58.jpg)
Microsoft
Cognitive
Toolkit
deep dive: handling of time
•
+
s
•
+
softmax
W1
b1
Wout
bout
cross_entropy
h1
P
x y
ce
h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1)
h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P = softmax(h2 @ Wout + bout)
ce = cross_entropy(P, L)
• CNTK automatically unrolls cycles deferred computation
• Efficient and composable
+ •
H1
z-1
•
+
s
W2
b2
h2
+ •
H2
z-1
![Page 59: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/59.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 60: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/60.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 61: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/61.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 3
sequence 5 sequence 6
sequence 7
![Page 62: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/62.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 63: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/63.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 64: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/64.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 65: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/65.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 66: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/66.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
![Page 67: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/67.jpg)
Microsoft
Cognitive
Toolkit
• minibatches containing sequences of different lengths are automatically packed and padded
• speed-up is automatic:
deep dive: variable-length sequences
para
llel sequences
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
Naïve , Single Sequence, 1
Optimized, multi sequence >20
0 5 10 15 20 25
Naïve
Optimized
Speed comparison on RNNs
![Page 68: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/68.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute each minibatch over workers, then aggregate
• challenge: communication cost
• example: DNN, MB size 1024, 160M model parameters
• compute per MB: 1/7 second
• communication per MB: 1/9 second (640M over 6 GB/s)
• can’t even parallelize to 2 GPUs: communication cost already dominates!
Deep dive: data-parallel training
![Page 69: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/69.jpg)
Microsoft
Cognitive
Toolkit
how to reduce communication cost:
communicate less each time
• 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]
• quantize gradients to 1 bit per value
• trick: carry over quantization error to next minibatch
communicate less often
• automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014]
• block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016]
• very recent, very effective parallelization method
• combines model averaging with error-residual idea
Deep dive: data-parallel training
![Page 70: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/70.jpg)
Microsoft
Cognitive
Toolkit
GPU 1 GPU 2 GPU 3
how to reduce communication cost:
communicate less each time
• 1-bit SGD:[F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]
• quantize gradients to 1 bit per value
• trick: carry over quantization error to next minibatch
1-bit quantized with residual
1-bit quantized with residual
data-parallel training
minibatch
![Page 71: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/71.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
data-parallel training
node 1 node 2 node 3
![Page 72: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/72.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
all-reduce
data-parallel training
node 1 node 2 node 3
S
![Page 73: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/73.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
data-parallel training
node 1 node 2 node 3
![Page 74: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/74.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
data-parallel training
node 1 node 2 node 3
![Page 75: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/75.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
data-parallel training
node 1 node 2 node 3
![Page 76: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/76.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
data-parallel training
node 1 node 2 node 3
![Page 77: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/77.jpg)
Microsoft
Cognitive
Toolkit
• data-parallelism: distribute minibatch over workers, all-reduce partial gradients
data-parallel training
node 1 node 2 node 3
ring algorithmO(2 (K-1)/K M)
O(1) w.r.t. K
![Page 78: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/78.jpg)
Microsoft
Cognitive
Toolkit
how to reduce communication cost:
communicate less each time
• 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]
• quantize gradients to 1 bit per value
• trick: carry over quantization error to next minibatch
communicate less often
• automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014]
• block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016]
• very recent, very effective parallelization method
• combines model averaging with error-residual idea
Deep dive: data-parallel training
![Page 79: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/79.jpg)
Microsoft
Cognitive
Toolkit
Batch Momentum
• Incremental Block Training (IBT)• Training dataset is processed block-by-block
• Intra-Block Parallel Optimization (IBPO)• Master-slave architecture to exploit data parallelism
• Each worker works independently on a split of data block
• Local model-updates are aggregated appropriately
• MPI-like framework to coordinate parallel job scheduling and communication
• Redundant workers to reduce wasted time for synchronization of multiple workers
• Blockwise Model-Update Filtering (BMUF)• Use historic model-update information to guide learning process
![Page 80: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/80.jpg)
Microsoft
Cognitive
Toolkit
Data Partition
• Partition randomly training dataset 𝒟 into 𝑆 mini-batches
𝒟 = {ℬ𝑖|𝑖 = 1,2,… , 𝑆}
• Group every 𝜏 mini-batches to form a split
• Group every 𝑁 splits to form a data block
• Training dataset 𝒟 consists of 𝑀 data blocks
𝑆 = 𝑀 × 𝑁 × 𝜏
Training dataset is processed block-by-block
Incremental Block Training (IBT)
![Page 81: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/81.jpg)
Microsoft
Cognitive
Toolkit
Intra-Block Parallel Optimization (IBPO)• Select randomly an unprocessed data block denoted as 𝒟𝑡
• Distribute 𝑁 splits of 𝒟𝑡 to 𝑁 parallel workers
• Starting from an initial model denoted as 𝑾𝑖𝑛𝑖𝑡(𝑡), each worker optimizes its local model independently by 1-sweep mini-batch SGD with momentum trick
• Average 𝑁 optimized local models to get 𝑾(𝑡)
![Page 82: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/82.jpg)
Blockwise Model-Update Filtering BMUF
Block Momentum
![Page 83: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/83.jpg)
Microsoft
Cognitive
Toolkit
Iteration
• Repeat IBPO and BMUF until all data blocks are processed• So-called “one sweep”
• Re-partition training set for a new sweep, repeat the above step
• Repeat the above step until a stopping criterion is satisfied• Obtain the final global model 𝑾𝑓𝑖𝑛𝑎𝑙
![Page 84: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/84.jpg)
Benchmark Result of Parallel Training on CNTK
13.1 13.1 13.1
13.3
13.1
13.0 13.0 13.0
13.2
13.3
12.9
13.0
13.1
13.2
13.3
13.4
1 4 8 16 32 64
WER
(%)
# of GPUs
WER of CE-trained DNN with different # of GPUs
1-bit
BMUF
11.1
10.8
10.6
11.0
11.1
10.8 10.8
10.9 10.9
11.1
10.5
10.6
10.7
10.8
10.9
11.0
11.1
11.2
1 4 8 16 32 64
WER
(%)
# of GPUs
WER of CE-trained LSTM with different # of GPUs
1-bit
BMUF
2.9 5.4
8.0 3.3
6.7 10.8
3.7 6.9
13.8
25.5
43.7
4.1 8.1
14.1
27.3
54.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs
1bit/BMUF Speedup Factors in LSTM Training
1bit-average
1bit-peak
BMUF-average
BMUF-peak
• Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana
• About 16 and 20 days to train DNN and LSTM on 1-GPU, respectively
Credit: Yongqiang Wang, Kai Chen, Qiang Huo
![Page 85: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/85.jpg)
Impact
• Achievement• Almost linear speedup without degradation of model quality
• Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks
• Released in CNTK as a critical differentiator
• Used for enterprise scale production data loads
• Production tools in other companies such as iFLYTEK and Alibaba
![Page 86: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/86.jpg)
![Page 87: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/87.jpg)
Model
Hidden
parameters
1 5 4 35 3 5 35 9 0 6
Di = 400O= 400a = relu
Di = 400O= 400a = relu
Di = 400O= 10a = sigmoid
Input feature (X: 12 x 784)
Sequences (one to one)
Output Labels(Y: 12 x 10)
Problem: Optical character recognition of MNIST data
![Page 88: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/88.jpg)
Sequences (many to one)
Model
Input feature (X: n x 14 data pnts)
Output Labels(Y: n x future prediction)
Problem: Time series prediction with IOT data
![Page 89: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/89.jpg)
Sequences (many to many)
Problem: Tagging entities in Air Traffic Controller (ATIS) data
![Page 90: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/90.jpg)
Sequences (many to many)
![Page 91: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/91.jpg)
Sequences (one to many)
Vinyals et al (https://arxiv.org/abs/1411.4555)
![Page 92: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/92.jpg)
Recurrence
Problem: Predict the output of a solar panel for a day based on past N days
Model
Ԧ𝑥(t=0)
Ԧ𝑦(t=1)
ℎ(t=1) Model
Ԧ𝑥(t=1)
Ԧ𝑦(t=2)
ℎ(t=2) Model
Ԧ𝑥(t=2)
Ԧ𝑦(t=3)
Model
Ԧ𝑥(t=10)
Ԧ𝑦(t=11)
Ԧ𝑥(t) : Input (n-dimensional array) at time t
ℎ(t) : Internal State [m-dimensional array] at time t
Di = n O= m ℎ = W Ԧ𝑥𝑇 + 𝑏
Ԧ𝑦(t) : Output (c-dimensional array) at time tC : number of classes
![Page 93: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/93.jpg)
Recurrence
ℎ(t)
Di = n + mO= ma = none
Input Ԧ𝑥(t)(n)
Internal State ℎ(t-1)(m)
ℎ(t)
Di = mO= ca = sigmoid
Ԧ𝑦(t)
ℎ(t-1) Model
Ԧ𝑥(t)
Ԧ𝑦(t)
(W, 𝑏)
(W, 𝑏) = Parameters are share & updated across different time steps
![Page 94: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/94.jpg)
Time-series Forecasting
![Page 95: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/95.jpg)
Recurrence
Model
Ԧ𝑥(t=0)
Ԧ𝑦(t=1)
Model
Ԧ𝑥(t=1)
Ԧ𝑦(t=2)
ℎ(t=1) ℎ(t=2) Model
Ԧ𝑥(t=2)
Ԧ𝑦(t=3)
Model
Ԧ𝑥(t=13)
Ԧ𝑦(t=14)
Ԧ𝑥(t)
For numeric: Array of numeric values coming from different sensorFor an image: Pixels in an array, Map the image pixels to a compact representation (say n values)For word in text: Represent words as a numeric vector using embeddings (word2vec or GLOVE)
![Page 96: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/96.jpg)
Recurrence (Vanishing Gradients)Doctor Who is a British science-fiction television programme produced by the BBCsince 1963. The programme depicts the adventures of the Doctor, a Time Lord—a space and time-travelling humanoid alien. He explores the universe in his TARDIS, a sentient time-travelling space ship. Accompanied by companions, the Doctor combats a variety of foes, while working to save civilizations and help people in need. This television series produced by the ….. ?
Model
is
Who
Model
a
is
BBC
ModelModel
by
produced
Model
the
by the
Model
Who
DoctorԦ𝑥(t)
Ԧ𝑦(t)
0
75 blocks
A single set of (W, 𝑏)has
limited memory
Di = n O= m
ℎ = W Ԧ𝑥𝑇 + 𝑏
history
![Page 97: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/97.jpg)
Long-Short Term Memory (LSTM)
ℎ(t-1)
Ԧ𝑦(t)
Ԧ𝐶(t-1)
Ԧ𝑥(t)(n)
(m)
𝑋
f
×
u
×
+Ԧ𝐶(t)
×
ℎ(t)
softmax
fi = n +mO= mAct = sigmoid
Ԧ𝑓 = sigmoid(Wf 𝑋𝑇 + 𝑏𝑓)
Forget gate
ui = n +mO= mAct = sigmoid
𝑢 = sigmoid(Wu 𝑋𝑇 + 𝑏𝑢)
Update gate
ii = n +mO= mAct = tanh
𝑋∗ = tanh(Wi 𝑋𝑇 + 𝑏𝑖)
Input
ri = n +mO= mAct = sigmoid
Ԧ𝑟 = sigmoid(Wr 𝑋𝑇 + 𝑏𝑟)
Result gate
New cell memory
Ԧ𝐶(t) = Ԧ𝐶(t-1) x + xf ui
New history
ℎ(t) = tanh( Ԧ𝐶(t)) x r
i r
tanhtanh
![Page 98: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/98.jpg)
![Page 99: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/99.jpg)
Sequences (many to many) - Classification
Problem: Tagging entities in Air Traffic Controller (ATIS) data
![Page 100: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/100.jpg)
ATIS Data
Domain: ATIS contains human-computer queries from the domain of Air Travel Information Services.
Data Summary: 943 unique words a.k.a. : Vocabulary 129 unique tags a.k.a.: Labels 26 intent tags: not used in this tutorial
![Page 101: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/101.jpg)
Sequence Id Input Word (sample) Word Index (in vocabulary) S0
Word Label Label Index (S2)
19 # BOS 178:1 # O 128:1 19 # please 688:1 # O 128:1 19 # give 449:1 # O 128:1 19 # me 581:1 # O 128:1 19 # the 827:1 # O 128:1 19 # flights 429:1 # O 128:1 19 # from 444:1 # O 128:1 19 # boston 266:1 # B-fromloc.city_name 48:1 19 # to 851:1 # O 128:1 19 # pittsburgh 682:1 # B-toloc.city_name 78:1 19 # on 654:1 # O 128:1 19 # thursday 845:1 # B-depart_date.day_name 26:1 19 # of 646:1 # O 128:1 19 # next 621:1 # B-depart_date.date_relative 25:1 19 # week 910:1 # O 128:1 19 # EOS 179:1 # O 128:1
Sequence Id: 19 indicates – this sentence is the 19th sentence in the data setWord Index: ###:1 indicates the position of the corresponding word in the vocabulary (total 929 words)Label Index: ###:1 indicates the position of the corresponding tag in tag index (total 129 tags)
![Page 102: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/102.jpg)
Sequence Tagging (Input / Label Preo
Vectorize Input Tokens (Step 1):- Create a numerical representation of the input words- This step is called Embedding
For MNIST data we had:
1 5 4 35 3 5 35 9 0 6
Label One-hot encoded (Y)
0 0 0 1 0 0 0 0 0 0
For Word data (one-hot encoding looks like)
- For vocabulary size of 929 0 0 1 0
266th element 929th element
For the label data – The one-hot representation is a 129 dimensional vector
![Page 103: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/103.jpg)
Model
ℎ(t)ℎ(t-1) Model
Ԧ𝑥(t)
Ԧ𝑦(t)
0 0 1 0
Ԧ𝑥(t)
Ei = 929O= 150
Li = 150O= 300
Di = 300O= 129a = sigmoid
ℎ(t-1) ℎ(t)
Ԧ𝑦(t)
Embedding Layer (E):
- Projects a word in the input into a vector space: We 𝑋𝑇 (simple linear embedding)
- Here the weight matrix has dimension of 943 x 150 - Alternatively, more advanced embedding such as Glove can be used as We
![Page 104: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/104.jpg)
examples: language understanding
Task: Slot tagging with an LSTM
19 |x 178:1 |# BOS |y 128:1 |# O
19 |x 770:1 |# show |y 128:1 |# O
19 |x 429:1 |# flights |y 128:1 |# O
19 |x 444:1 |# from |y 128:1 |# O
19 |x 272:1 |# burbank |y 48:1 |# B-fromloc.city_name
19 |x 851:1 |# to |y 128:1 |# O
19 |x 789:1 |# st. |y 78:1 |# B-toloc.city_name
19 |x 564:1 |# louis |y 125:1 |# I-toloc.city_name
19 |x 654:1 |# on |y 128:1 |# O
19 |x 601:1 |# monday |y 26:1 |# B-depart_date.day_name
19 |x 179:1 |# EOS |y 128:1 |# O
y "O" "O" "O" "O" "B-fromloc.city_name"
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Dense | | Dense | | Dense | | Dense | | Dense | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
+------+ +------+ +------+ +------+ +------+
0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...
+------+ +------+ +------+ +------+ +------+
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Embed | | Embed | | Embed | | Embed | | Embed | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
x ------>+--------->+--------->+--------->+--------->+------...
BOS "show" "flights" "from" "burbank"
![Page 105: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/105.jpg)
examples: language understanding
Task: Slot tagging with an LSTM
19 |x 178:1 |# BOS |y 128:1 |# O
19 |x 770:1 |# show |y 128:1 |# O
19 |x 429:1 |# flights |y 128:1 |# O
19 |x 444:1 |# from |y 128:1 |# O
19 |x 272:1 |# burbank |y 48:1 |# B-fromloc.city_name
19 |x 851:1 |# to |y 128:1 |# O
19 |x 789:1 |# st. |y 78:1 |# B-toloc.city_name
19 |x 564:1 |# louis |y 125:1 |# I-toloc.city_name
19 |x 654:1 |# on |y 128:1 |# O
19 |x 601:1 |# monday |y 26:1 |# B-depart_date.day_name
19 |x 179:1 |# EOS |y 128:1 |# O
y "O" "O" "O" "O" "B-fromloc.city_name"
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Dense | | Dense | | Dense | | Dense | | Dense | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
+------+ +------+ +------+ +------+ +------+
0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...
+------+ +------+ +------+ +------+ +------+
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Embed | | Embed | | Embed | | Embed | | Embed | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
x ------>+--------->+--------->+--------->+--------->+------...
BOS "show" "flights" "from" "burbank"
![Page 106: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/106.jpg)
examples: language understanding
Task: Slot tagging with an LSTM
19 |x 178:1 |# BOS |y 128:1 |# O
19 |x 770:1 |# show |y 128:1 |# O
19 |x 429:1 |# flights |y 128:1 |# O
19 |x 444:1 |# from |y 128:1 |# O
19 |x 272:1 |# burbank |y 48:1 |# B-fromloc.city_name
19 |x 851:1 |# to |y 128:1 |# O
19 |x 789:1 |# st. |y 78:1 |# B-toloc.city_name
19 |x 564:1 |# louis |y 125:1 |# I-toloc.city_name
19 |x 654:1 |# on |y 128:1 |# O
19 |x 601:1 |# monday |y 26:1 |# B-depart_date.day_name
19 |x 179:1 |# EOS |y 128:1 |# O
y "O" "O" "O" "O" "B-fromloc.city_name"
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Dense | | Dense | | Dense | | Dense | | Dense | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
+------+ +------+ +------+ +------+ +------+
0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...
+------+ +------+ +------+ +------+ +------+
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Embed | | Embed | | Embed | | Embed | | Embed | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
x ------>+--------->+--------->+--------->+--------->+------...
BOS "show" "flights" "from" "burbank"
![Page 107: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/107.jpg)
examples: language understanding
Task: Slot tagging with an LSTM
19 |x 178:1 |# BOS |y 128:1 |# O
19 |x 770:1 |# show |y 128:1 |# O
19 |x 429:1 |# flights |y 128:1 |# O
19 |x 444:1 |# from |y 128:1 |# O
19 |x 272:1 |# burbank |y 48:1 |# B-fromloc.city_name
19 |x 851:1 |# to |y 128:1 |# O
19 |x 789:1 |# st. |y 78:1 |# B-toloc.city_name
19 |x 564:1 |# louis |y 125:1 |# I-toloc.city_name
19 |x 654:1 |# on |y 128:1 |# O
19 |x 601:1 |# monday |y 26:1 |# B-depart_date.day_name
19 |x 179:1 |# EOS |y 128:1 |# O
y "O" "O" "O" "O" "B-fromloc.city_name"
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Dense | | Dense | | Dense | | Dense | | Dense | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
+------+ +------+ +------+ +------+ +------+
0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...
+------+ +------+ +------+ +------+ +------+
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Embed | | Embed | | Embed | | Embed | | Embed | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
x ------>+--------->+--------->+--------->+--------->+------...
BOS "show" "flights" "from" "burbank"
![Page 108: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/108.jpg)
Task: Slot tagging with an LSTM
model = Sequential ([
Embedding(150),
RecurrentLSTM(300),
Dense(labelDim)
)
y "O" "O" "O" "O" "B-fromloc.city_name"
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Dense | | Dense | | Dense | | Dense | | Dense | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
+<-----|-----+ | | | |
| +------+ | +------+ +------+ +------+ +------+
+->| LSTM |->+ LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...
+------+ +------+ +------+ +------+ +------+
^ ^ ^ ^ ^
| | | | |
+-------+ +-------+ +-------+ +-------+ +-------+
| Embed | | Embed | | Embed | | Embed | | Embed | ...
+-------+ +-------+ +-------+ +-------+ +-------+
^ ^ ^ ^ ^
| | | | |
x ------>+--------->+--------->+--------->+--------->+------...
BOS "show" "flights" "from" "burbank"
examples: language understanding
![Page 109: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/109.jpg)
Error or Loss Function
Lossfunction
ce = −σ𝑗=09 𝑦𝑗 𝑙𝑜𝑔 𝑝𝑗
Cross entropy error
Label One-hot encoded ( Ԧ𝑦(t))
0 0 0 0 0
Model
Predicted Probabilities (p)
Ԧ𝑥(t)
943 𝑑𝑖𝑚
0 0 1 0
129 𝑑𝑖𝑚
129 𝑑𝑖𝑚
![Page 110: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/110.jpg)
Train Workflow
ATISTrain
12
8 s
am
ples
(min
i-ba
tch
)
.
.
.
.
1
2
3
128
Input feature ( 128 x Ԧ𝑥(t))
One-hot encoded Label
(Y: 128 x 129/sampleOr word in
sequence)
z = model():
return
Sequential([
Embedding(emb_dim=150),
Recurrence(LSTM(hidden_dim=300),
go_backwards=False),
Dense(num_labels = 129)
])
Loss cross_entropy_with_softmax(z,Y)
Trainer(model, (loss, error), learner)
Trainer.train_minibatch({X, Y})
Error(optional)
classification_error(z,Y)
Learnersgd, adagrad etc, are solvers to estimate
![Page 111: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/111.jpg)
Test Workflow
ATISTest
12
8 s
am
ples
(min
i-ba
tch
)
.
.
.
.
1
2
3
128
Input feature ( 128 x Ԧ𝑥(t))
One-hot encoded Label
(Y: 128 x 129/sampleOr word in
sequence)
z = model():
return
Sequential([
Embedding(emb_dim=150),
Recurrence(LSTM(hidden_dim=300),
go_backwards=False),
Dense(num_labels = 129)
])
Loss cross_entropy_with_softmax(z,Y)
Trainer(model, (loss, error), learner)
Trainer.test_minibatch({X, Y})
Error(optional)
classification_error(z,Y)
Learnersgd, adagrad etc, are solvers to estimate
![Page 112: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/112.jpg)
Test Workflow
ATISTrain
12
8 s
am
ples
(min
i-ba
tch
)
.
.
.
.
1
2
3
32
Input feature ( 32 x Ԧ𝑥(t))
One-hot encoded Label
(Y: 128 x 129/sampleOr word in
sequence)
z = model():
return
Sequential([
Embedding(emb_dim=150),
Recurrence(LSTM(hidden_dim=300),
go_backwards=False),
Dense(num_labels = 129)
])
Loss cross_entropy_with_softmax(z,Y)
Trainer(model, (loss, error), learner)
Trainer.train_minibatch({X, Y})
Error(optional)
classification_error(z,Y)
Learnersgd, adagrad etc, are solvers to estimate
![Page 113: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/113.jpg)
Prediction Workflow
Any Data string
Input feature (new X: 1 x 8 x (1x943))
Model.eval(new X)
Predicted Softmax Probabilities
'BOS flights from new york to seattle EOS'
Output prediction (: 1 x 8 x (1x129))
![Page 114: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/114.jpg)
![Page 115: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/115.jpg)
115
Neural network paradigms
![Page 116: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/116.jpg)
Background
First described in the context of machine translation Cho, et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”
(2014). https://arxiv.org/abs/1406.1078.
It is a natural fit for: Automatic text summarization:
• Input sequence: full document• Output sequence: summary document
Word to pronunciation models:• Input sequence: character [grapheme] • Output sequence: pronunciation[phoneme]
Question – Answering models:• Input sequences: Query and document• Output sequence: Answer
![Page 117: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/117.jpg)
Basic TheoryA sequence-to-sequence model consists of two main pieces:
(1) an encoder,(2) a decoder, and(3) an attention module (optional)
Sequence to Sequence Mechanism:
Encoder • Processes the input sequence into a fixed representation• This representation is fed into the decoder as a context a.k.a thought vector
Decoder • Uses some mechanism to decode the processed information into an output sequence• This is a language model that is augmented with some "strong context“• Each symbol that it generates is fed back into the decoder for additional context
![Page 118: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/118.jpg)
What is “thought-vector”Term popularized by Geoffrey Hinton
“What I think is going to happen over the next few years is this ability to turn sentences into thought vectors is going to rapidly change the level at which we can understand documents”
What is a thought-vector: Like an embedding similar to (word2vec & GloVe) but instead encodes several words, or ideas, or… a “thought”
In basic sequence to sequence, the thought vector represents:• the encoded version of the input sequence after running it through the encoder RNN
• the hidden state of the encoder after all of the words in the input sequence have passed through I
• The decoder’s hidden state is then initialized with this thought vector
![Page 119: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/119.jpg)
Example
Di = n + mO= ma = none
Input Ԧ𝑥(t)(n)
Internal State ℎ(t-1)(m)
ℎ(t)
Di = mO= na = none
Ԧ𝑜(t)
(W, 𝑏)
ht = tanh(Uxt + Wht-1)
ot = Vht
def step(x):h = C.tanh(C.times(U, x) + C.times(W, h))o = C.times(V, h)return o
For every input:• the hidden state is updated• some output is returned
![Page 120: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/120.jpg)
Sequence to Sequence Decoder
In the sequence-to-sequence decoder: Output o is projected through a dense layer and softmax function The resultant word is directed back into itself as the input for the next step This is a greedy-decoding approach
![Page 121: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/121.jpg)
Sequence to Sequence Decoder
Steps in decoding:
First step is to initialize the decoder RNN with the thought vector as its hidden state
Use a "sequence start" tag (e.g. <s>) as input to prime the decoder to start generating an output sequence
The decoder keeps generating outputs until it hits the special "end sequence" tag (e.g. </s>)
def model_greedy(input): # (input*) --> (word_sequence*)
# Decoding is an unfold() operation starting from sentence_start.
# We must transform s2smodel (history*, input* -> word_logp*)
# into a generator (history* -> output*) which holds 'input' in its closure.
unfold = UnfoldFrom(lambda history: s2smodel(history, input) >> hardmax,
# stop once sentence_end_index was max-scoring output
until_predicate=lambda w: w[...,sentence_end_index],
length_increase=length_increase)
return unfold(initial_state=sentence_start, dynamic_axes_like=input)
![Page 122: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/122.jpg)
Sequence to Sequence Problems
Squeezing all the input sequence information into a single vector
At each time step: the hidden state h gets updated with the most recent information, and therefore h is gets "diluted" in information as it processes each token
Token position influence Even with a relatively short sequence, the last token will always get the last say and therefore the thought vector is biased/weighted towards that last word
For Machine Translation: We run the encoder backwards also to help mitigate this problem Need a more systematic approach
![Page 123: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/123.jpg)
Attention Mechanism
![Page 124: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/124.jpg)
124
psyllium psychologyvs.
Attention Mechanism
![Page 125: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/125.jpg)
Attention Mechanism
Helps to solve the “long-sequence” and alignment problem
Replace single thought vector (and only as an initial context) with: Each decoding step directly use information from the encoder All of the hidden states from the encoder are available to us (instead of just the final one); and The decoder learns which weighted sum of hidden states, given the current context and input, to use
How is it done: Learn which encoder hidden states are important given current context and input; and Augment the decoder’s current hidden state with information from those states
![Page 126: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/126.jpg)
Attention Mechanism
Key Idea: Learn which encoder states are important given current context and input
1. Compute similarity between different encoder states w.r.t. a given decoder state Dot product between ℎ𝑖 and 𝑑
Cosine distance between ℎ𝑖 and 𝑑
Projected similarity given by𝑢𝑖 = 𝑣𝑇 tanh 𝑊1ℎ𝑖 +𝑊2𝑑
Where ℎ𝑖 is the hidden state for each encoding RNN unit and 𝑑 is the corresponding decoder stateNote: v is a learnable vector parameter; W1 and W2 are learnable matrix parameters
Finally the attention score for a given comparison can be computed as:𝑎𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑢𝑖
![Page 127: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/127.jpg)
2. Augment the decoder’s current hidden state with information from those states
Create a vector in the same space as the hidden states that consists of a weighted sum of the encoder hidden states
𝑑′ =
𝑖=1
𝑇𝐴
𝑎𝑖ℎ𝑖
New hidden state for predicting current word:𝐷 = 𝑑 + 𝑑′
Attention Mechanism
attention
![Page 128: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/128.jpg)
Attention Mechanism
![Page 129: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/129.jpg)
Decoding with Attention
Take a greedy approach and output the most probable word at each step Does not render well in practice
Consider every single combination at each step However, that is generally computationally intractable
Strike a compromise using beam search Instead, we use a beam search decoder with a given depth The depth parameter considers how many best candidate solutions to keep at each step This results in a heuristic for the global optimal that works quite well Indeed, a beam search of 3 gives very good results in most situations.
![Page 130: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/130.jpg)
Microsoft
Cognitive
Toolkit
![Page 131: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/131.jpg)
![Page 132: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/132.jpg)
ReasoNet: Learning to Stop Reading in Machine Comprehension
Yelong Shen, Po-Sen Huang, Jianfeng Gao, Weizhu Chen
Microsoft Research
CNTK Tutorial: Pengcheng He, Amit Agarwal, Sayan Pathak
![Page 133: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/133.jpg)
Problem Definition
• Machine Comprehension• Teach machine to answer questions given an input passage
Query Who is the producer of Doctor Who?
Passage Doctor Who is a British science-fiction television programme produced by the BBC since 1963. The programme depicts the adventures of the Doctor, a Time Lord—a space and time-travelling humanoid alien. He explores the universe in his TARDIS, a sentient time-travelling space ship. Its exterior appears as a blue British police box, which was a common sight in Britain in 1963 when the series first aired. Accompanied by companions, the Doctor combats a variety of foes, while working to save civilisations and help people in need.
Answer BBC
![Page 134: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/134.jpg)
Related Work
Single Step Reasoning[Kadlec et al. 2016, Chen et al. 2016]
Multiple Step Reasoning[Hill et al. 2016, Trischler et al. 2016 , Dhingra et al. 2016, Sordoni et al. 2016, Kumar et al. 2016]
How many steps?
QueryQuery
Xt
Passage
Attention
S1 St St+1 St+2
Query
Xtfatt(θx) Xt+1fatt(θx)
Passage
Attention
![Page 135: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/135.jpg)
Different levels of complexity
Query Who was the 2015 NFL MVP?
Passage The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP).
Answer Cam Newton
Easier
Harder Query Who was the #2 pick in the 2011 NFL Draft?
Passage Manning was the #1 selection of the 1998 NFL draft, while Newton was picked first in 2011. The matchup also pits the top two picks of the 2011 draft against each other: Newton for Carolina and Von Miller for Denver.
Answer Von Miller
![Page 136: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/136.jpg)
ReasoNet: Learning to Stop Reading
• Dynamic termination based on the complexity of query and passage
• Instance-based RL objectives
S1 St St+1 St+2
Query
Xt
Tt Tt+1
ftg(θtg) ftg(θtg) False
True
fa(θa)
True
at
fa(θa)
at+1
fatt(θx) Xt+1fatt(θx)
False
Passage
Termination
Answer
Attention
Controller
![Page 137: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/137.jpg)
ReasoNet Architecture
QueryQuery
Passage
137
![Page 138: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/138.jpg)
ReasoNet Architecture
QueryQuery
Passage
6
![Page 139: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/139.jpg)
ReasoNet Architecture
S1 St St+1
QueryQuery
Xtfatt(θx)
Passage
Attention
Controller
6
![Page 140: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/140.jpg)
ReasoNet Architecture
S1 St St+1
Query
Xt
Tt
ftg(θtg) False
True
fa(θa)
fatt(θx)
Passage
Termination
Attention
Controller
6
![Page 141: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/141.jpg)
ReasoNet Architecture
S1 St St+1
QueryQuery
Xt
Tt
ftg(θtg) False
True
fa(θa)
at
fatt(θx)
Passage
Termination
Answer
Attention
Controller
6
![Page 142: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/142.jpg)
ReasoNet Architecture
S1 St St+1 St+2
QueryQuery
Xt
Tt Tt+1
ftg(θtg) ftg(θtg) False
True
fa(θa)
True
at
fa(θa)
at+1
fatt(θx) Xt+1fatt(θx)
False
Passage
Termination
Answer
Attention
Controller
6
![Page 143: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/143.jpg)
RL Objectives
• Action: termination, answer
• Reward: 1 if the answer is correct, 0 otherwise (Delayed Reward)
• Expected total reward
• REINFORCE algorithm
S1 ST-1 ST
Query
XT-1
TT-1 TT
ftg(θtg) ftg(θtg) False
True
fa(θa)
aT
fatt(θx)
Passage
Termination
Answer
Attention
Controller
Instance-based baseline7
![Page 144: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/144.jpg)
CNN / Daily Mail Reading Comprehension Task
Query passenger @placeholder , 36 , died at the scene
Passage ( @entity0 ) what was supposed to be a fantasy sports car ride at @entity3 turned deadly when a @entity4 crashed into a guardrail . the crash took place sunday at the @entity8 , which bills itself as a chance to drive your dream car on a racetrack . the @entity4 's passenger , 36 - year - old @entity14 of @entity15 , @entity16 , died at the scene , @entity13 said . the driver of the @entity4 , 24 -year - old @entity18 of @entity19 , @entity16 , lost control of the vehicle , the @entity13 said . he was hospitalized with minor injuries . @entity24 , which operates the @entity8 at @entity3 , released a statement sunday night about the crash . " on behalf of everyone in the organization , it is with a very heavy heart that we extend our deepest sympathies to those involved in today 's tragic accident in @entity36 , " the company said . @entity24 also operates the @entity3 -- a chance to drive or ride in @entity39 race cars named for the winningest driver in the sport 's history . @entity0 's @entity43 and @entity44 contributed to this report .
Answer @entity14
[Hermann et al. 2015]
![Page 145: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/145.jpg)
Termination Step Histogram
0
500
1000
1500
2000
2500
1 2 3 4 5
Step
CNN Dataset
![Page 146: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/146.jpg)
Multiple Steps CNN Daily Mail
Iterative AR [Sordoni et al. 16] 73.3 -
EpiReader [Trischler et al. 16] 74.0 -
GA Reader [Dhingra et al. 16] 73.8 75.7
ReasoNet 74.7 76.6
BIDAF [Seo et al. 16] (Nov 5 2016) 77.1 78.3
Results
Accuracy (%) CNN Daily Mail
Attentive Reader [Hermann et al. 15] 63.0 69.0
AS Reader [Kadlec et al. 16] 69.5 73.9
Stanford AR [Chen et al. 16] 72.4 75.8Single step
Multiple steps
(Sep 17 2016)
![Page 147: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/147.jpg)
![Page 148: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/148.jpg)
Microsoft
Cognitive
Toolkit
CNTK’s approach to the two key questions:
• efficient network authoring• networks as function objects, well-matching the nature of DNNs
• focus on what, not how
• familiar syntax and flexibility in Python
• efficient execution• graph parallel program through automatic minibatching
• symbolic loops with dynamic scheduling
• unique parallel training algorithms (1-bit SGD, Block Momentum)
![Page 149: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/149.jpg)
Microsoft
Cognitive
Toolkit
• integration with C#/.Net, R, Keras, HDFS, and Spark
• continued C#/.Net integration; R
• Keras back-end
• HDFS
• Spark
• technology
• handle models too large for GPU
• optimized nested recurrence
• ASGD
• 16-bit support, ARM, FPGA
on our roadmap
![Page 150: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/150.jpg)
Microsoft
Cognitive
Toolkit
• ease of use• what, not how
• powerful library
• minibatching is automatic
• fast• optimized for NVidia GPUs & libraries
• easy yet best-in-class multi-GPU/multi-server support
• flexible• Python and C++ API, powerful & composable
• 1st-class on Linux and Windows
• train like MS product groups: internal=external version
Cognitive Toolkit:deep learning like Microsoft product groups
![Page 151: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/151.jpg)
Microsoft
Cognitive
Toolkit
Cognitive Toolkit: democratizing the AI tool chain
• Web site: https://cntk.ai/
• Docs: https://cntk.ai/pythondocs
• Github: https://github.com/Microsoft/CNTK
• Wiki: https://github.com/Microsoft/CNTK/wiki
Ask Questions: www.stackoverflow.com with cntk tag
![Page 152: With many contributors - cntk.ai€¦ · With many contributors: A. Agarwal, E. Akchurin, ... arbitrary neural networks by composing simple ... MNIST Handwritten Digits](https://reader031.vdocuments.site/reader031/viewer/2022021822/5b2c167e7f8b9a163e8bbabd/html5/thumbnails/152.jpg)