common design of deep learning frameworks
TRANSCRIPT
![Page 1: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/1.jpg)
Tutorial:Deep Learning Implementations
and FrameworksSeiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+
*Preferred Networks, Inc. (PFN){tokui,oono}@preferred.jp
+National Institute of Advanced Industrial Science and Technology (AIST)[email protected], [email protected]
1
![Page 2: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/2.jpg)
Overview of this tutorial
•1st session (KO, 8:30 ‒ 10:00)• Introduction•Basics of neural networks•Common design of neural network implementations
•2nd session (ST, 10:30 ‒ 12:30)•Differences of deep learning frameworks•Coding examples of frameworks•Conclusion
![Page 3: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/3.jpg)
Common Design ofDeep Learning FrameworksKenta Oono <[email protected]>Preferred Networks Inc.
2016/4/19 3DLIF Tutorial @ PAKDD2016
![Page 4: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/4.jpg)
Objective of this part
• How deep learning frameworks represent various neural networks.
• How deep learning frameworks realize the training procedure of neural networks.
• Technology stack that is common to most of deep learning frameworks.
2016/4/19 4DLIF Tutorial @ PAKDD2016
![Page 5: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/5.jpg)
Steps for training neural networks
Prepare the training dataset
Repeat until meeting some criterionPrepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)Update the NN parameters
2016/4/19 5DLIF Tutorial @ PAKDD2016
![Page 6: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/6.jpg)
Technology stack of DL framework
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
2016/4/19 6DLIF Tutorial @ PAKDD2016
![Page 7: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/7.jpg)
Technology stack of DL framework
2016/4/19 7DLIF Tutorial @ PAKDD2016
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
![Page 8: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/8.jpg)
Neural Network as a Computational Graph
• In simplest form, NN is represented as a computational graph (CG) that is a stack of bipartite DAGs (Directed Acyclic Graph) consisting of data nodes and operator nodes.
y = x1 * x2z = y - x3
x1 mul suby
x3
z
x2
data node
operator node2016/4/19 8DLIF Tutorial @ PAKDD2016
![Page 9: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/9.jpg)
Example: Multi-layer Perceptron (MLP)
x Affine
W1 b1
h1 ReLU a1
Affine
W2 b2
h2 ReLU a2
Softmax y Cross
EntropyLoss
t
It is choice of implementation if CG includes weights and biases.
2016/4/19 9DLIF Tutorial @ PAKDD2016
![Page 10: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/10.jpg)
Example: Recurrent Neural Network (RNN)
x1
RNNUnit h1
RNNUnit
x2
h2RNNUnit
xT
h0 ・・・ hT
RNN unit can be :• Affine + activation function• LSTM (Long Short-Term
Memory)• GRU (Gated Recurrent Unit)
x h y
xt
ht-1
ht
W b
2016/4/19 10DLIF Tutorial @ PAKDD2016
![Page 11: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/11.jpg)
Example: Stacked RNN
x1
RNNUnit h1
RNNUnit
x2
h2RNNUnit
xT
h0 ・・・ hT
RNNUnit z1
RNNUnit z2
RNNUnitz0 ・・・ zT
SoftmaxAffine y
2016/4/19 11DLIF Tutorial @ PAKDD2016
![Page 12: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/12.jpg)
Example: RNN with control flow nodes
loopenter s
i
predicate
pred
s
h0
x
switch s
RNNUnit
s’update
loopend y
pred=True
pred=False
• TensorFlow has control flow nodes (e.g. cond, switch, while)
• As CG has a loop, some mechanism is necessary that resolves he dependency of nodes to schedule the order of calculation.
W
b
2016/4/19 12DLIF Tutorial @ PAKDD2016
![Page 13: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/13.jpg)
Automatic Differentiation
• Computes gradient of some specified data nodes (e.g. loss) with respect to each data node.
• Each operator node must have backward operation to calculate gradients w.r.t. its inputs from gradients w.r.t. its outputs (realization of chain rule).
• e.g. Function class of Chainer has backwardmethod.• e.g. Each layer classes of Caffe has Backward_cpu and Backward_gpumethods
• e.g. Autograd has a thin wrapper that adds gradient methods as a closure to most of NumPy methods.
2016/4/19 13DLIF Tutorial @ PAKDD2016
![Page 14: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/14.jpg)
Backprop through CG
∇y z∇x1 z ∇z z = 1
y = x1 * x2z = y - x3
x1 mul suby
x3
z
x2
2016/4/19 14DLIF Tutorial @ PAKDD2016
![Page 15: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/15.jpg)
Backprop as extended graphs
x1 mul suby
x3
z
x2
dzid
neg
mul
mul
dy
dx3
dx1
dx2
forwardpropagation
backwardpropagation
y = x1 * x2z = y - x3
2016/4/19 15DLIF Tutorial @ PAKDD2016
![Page 16: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/16.jpg)
Example: Theano
2016/4/19 16DLIF Tutorial @ PAKDD2016
![Page 17: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/17.jpg)
Technology stack of DL framework
2016/4/19 17DLIF Tutorial @ PAKDD2016
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
![Page 18: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/18.jpg)
Numerical optimizer
• Many gradient-based optimization algorithms are implemented.• Stochastic Gradient Descent (SGD) is implemented in most DL
frameworks.• It depends on concrete tasks which optimizer works best.
w: parameters of neural networkθ: states of optimizerL: loss functionΓ: optimizer-specific function
initialize w, θuntil meet the criteria:
get data (x, y)calculate ∇w L(x, y; w)w, θ← Γ(w, θ, ∇w L)
2016/4/19 18DLIF Tutorial @ PAKDD2016
![Page 19: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/19.jpg)
Serialization
• Save/Load the snapshot of training process in specified format (e.g. hdf5, npz, protobuf)• Models to be trained (= architectures and parameters of NNs)• States of training procedure (e.g. epoch, learning rate, momentum)
• Serialization enhance the portability of models.• Publish pre-trained model (e.g. Model Zoo (Caffe), MXNet, TensorFlow)• Import pre-trained model of other DL frameworks
• e.g. Chainer supports BVLC-official reference models of Caffe.
2016/4/19 19DLIF Tutorial @ PAKDD2016
![Page 20: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/20.jpg)
Computational optimizer
• Convert CGs to make them simplified and efficient.
e.g. Theanoy = x1 * x2z = y - x3
2016/4/19 20DLIF Tutorial @ PAKDD2016
![Page 21: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/21.jpg)
Abstraction of ML workflow• Offers typical training/validation/evaluation procedures as APIs.• Users should call a single API and do not have to write the procedure
manually.• e.g. fit, evaluatemethods of Model class in Keras.
2016/4/19 21DLIF Tutorial @ PAKDD2016
Prepare the training dataset
Repeat until meeting some criterionPrepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)Update the NN parameters
![Page 22: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/22.jpg)
Graphical interface
• Computational graph management• Editor, Visualizer
• Visualization of training procedure• Visualization of feature maps, output of NNs etc.• Transition of error and accuracy
• Performance monitor• e.g. Throughput, latency, memory usage
2016/4/19 22DLIF Tutorial @ PAKDD2016
![Page 23: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/23.jpg)
Technology stack of DL framework
2016/4/19 23DLIF Tutorial @ PAKDD2016
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
![Page 24: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/24.jpg)
GPU support
• CUDA: Computing platform for GPGPU on NVIDIA GPU • language extension, compiler, library etc.
• DL frameworks prepare wrappers for CUDA.• GPU-array library that utilizes cuBLAS, cuRAND etc.• Layer implementation with cuDNN (e.g. Convolution, sigmoid, LSTM)
• Designed to switch CPU and GPU easily.• e.g. Users can write CPU-GPU agnostic code.• e.g. Switch CPU/GPU with environment variables.
• Some framework supports Open CL as a GPU environment, but CUDA is more popular for now.
2016/4/19 24DLIF Tutorial @ PAKDD2016
![Page 25: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/25.jpg)
Multi-dimensional array library (CPU / GPU)
• In charge of concrete calculation of data nodes.• Heavily depends on BLAS (CPU) or CUDA / CUDA Toolkits
(GPU)
• CPU• Third-party library: Eigen::Tensor, NumPy• Scratch: ND4J (DL4J), mshadow (MXNet)
• GPU• Third-party library: Eigen::Tensor, PyCUDA, gpuarray• Scratch: ND4J (DL4J), mshadow (MXNet), CuPy (Chainer)
2016/4/19 25DLIF Tutorial @ PAKDD2016
![Page 26: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/26.jpg)
Which device to use?
• GPU is (by far) faster than CPU in most case. • Most of tensor calculation consists of element-wise calculation,
matrix multiplications and convolutions.
• Exceptional cases• Difficult to apply mini-batch technique.
• e.g. variable-length training dataset• e.g. The architecture of NN depends on the training data.
• GPU calculation cannot hide transfer of data to GPU.• e.g. Minibatch size is too small.
2016/4/19 26DLIF Tutorial @ PAKDD2016
![Page 27: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/27.jpg)
Technology stack of Chainer
cuDNN
Chainer
NumPy CuPy
BLAS cuBLAS, cuRAND
CPU GPU
2016/4/19 27DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
![Page 28: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/28.jpg)
Technology stack of TensorFlow
cuDNN
TensorFlow
Eigen::Tensor
BLAS cuBLAS, cuRAND
CPU GPU
2016/4/19 28DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
TensorBoard
TF Learn
![Page 29: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/29.jpg)
Technology stack of Theano
CUDA, OpenCLCUDAToolkit
Theano
BLAS
CPU GPU
2016/4/19 29DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
libgpuarrayNumPy
![Page 30: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/30.jpg)
Technology stack of Keras
2016/4/19 30DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
Keras
TensorFlowTheano
TechnologyStack of Theano
Technology Stack of TF
![Page 31: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/31.jpg)
Summary
• Most DL frameworks have many components in common and can be organized as a similar technology stack.
• At upper layer of the stack, frameworks are designed to support users to follow typical ML workflows.• At middle layer, manipulations on computational graphs are
automated.• At lower layer, optimized tensor calculations are
implemented.
• Realization of these components differ between frameworks, as we will see in the following part.
2016/4/19 31DLIF Tutorial @ PAKDD2016
![Page 32: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/32.jpg)
memorandum
2016/4/19 32DLIF Tutorial @ PAKDD2016
![Page 33: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/33.jpg)
Training of Neural Networks
• L is designed so that its value gets small as the prediction more “accurate”• In deep learning context
• L : represented by neural networks• w : parameters of neural networks
argminw∑(x, y) L(x, y; w)w: parametersx: feature vectory: training labelL: loss function
e.g.: Classification problem
332016/4/19 DLIF Tutorial @ PAKDD2016
![Page 34: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/34.jpg)
Layer = function + data nodes
• Layers (e.g. Fully connected layer, convolutional layer) can be considered as a function with parameters to be optimized.• In most of modern frameworks, parameters of layers can be
considered as data nodes in a computational graph.• Framework need to be differentiate which data nodes are
parameters to be optimized or data point.
342016/4/19 DLIF Tutorial @ PAKDD2016
![Page 35: Common Design of Deep Learning Frameworks](https://reader031.vdocuments.site/reader031/viewer/2022030223/588219e11a28ab3f4c8b62f3/html5/thumbnails/35.jpg)
Execution Engine
• It calculates the dependency between data node and schedules the execution of parts of computational graph (especially in multi-node or multi-GPU setting)
352016/4/19 DLIF Tutorial @ PAKDD2016