common design of deep learning frameworks

Tutorial:Deep Learning Implementations

and FrameworksSeiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+

*Preferred Networks, Inc. (PFN){tokui,oono}@preferred.jp

+National Institute of Advanced Industrial Science and Technology (AIST)[email protected], [email protected]

1

Overview of this tutorial

•1st session (KO, 8:30 ‒ 10:00)• Introduction•Basics of neural networks•Common design of neural network implementations

•2nd session (ST, 10:30 ‒ 12:30)•Differences of deep learning frameworks•Coding examples of frameworks•Conclusion

Common Design ofDeep Learning FrameworksKenta Oono <[email protected]>Preferred Networks Inc.

2016/4/19 3DLIF Tutorial @ PAKDD2016

Objective of this part

• How deep learning frameworks represent various neural networks.

• How deep learning frameworks realize the training procedure of neural networks.

• Technology stack that is common to most of deep learning frameworks.


Steps for training neural networks

Prepare the training dataset

Repeat until meeting some criterionPrepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the Neural Network (NN) parameters

Save the NN parameters

Define how to compute the loss of this batch

Compute the gradient (backprop)Update the NN parameters


Technology stack of DL framework

name functions example

Graphical interface DIGITS, TensorBoard

Machine learning workflowmanagement

Dataset ManagementTraining Loop

Keras, LasagneBlocks, TF Learn

Computational graph management

Build computational graphForward prop/Backprop

Theano, TensorFlowTorch.nn

Multi-dimensionalarray library

Linear algebra NumPy, CuPyEigen, torch (core)

Numerical computationpackage

Matrix operationConvolution

BLAS, cuBLAS, cuDNN

Hardware CPU, GPU
















BLAS, cuBLAS, cuDNN

Hardware CPU, GPU

Neural Network as a Computational Graph

• In simplest form, NN is represented as a computational graph (CG) that is a stack of bipartite DAGs (Directed Acyclic Graph) consisting of data nodes and operator nodes.

y = x1 * x2z = y - x3

x1 mul suby

x3

z

x2

data node

operator node2016/4/19 8DLIF Tutorial @ PAKDD2016

Example: Multi-layer Perceptron (MLP)

x Affine

W1 b1

h1 ReLU a1

Affine

W2 b2

h2 ReLU a2

Softmax y Cross

EntropyLoss

t

It is choice of implementation if CG includes weights and biases.


Example: Recurrent Neural Network (RNN)

x1

RNNUnit h1

RNNUnit

x2

h2RNNUnit

xT

h0 ・・・ hT

RNN unit can be :• Affine + activation function• LSTM (Long Short-Term

Memory)• GRU (Gated Recurrent Unit)

x h y

xt

ht-1

ht

W b


Example: Stacked RNN

x1

RNNUnit h1

RNNUnit

x2

h2RNNUnit

xT

h0 ・・・ hT

RNNUnit z1

RNNUnit z2

RNNUnitz0 ・・・ zT

SoftmaxAffine y


Example: RNN with control flow nodes

loopenter s

i

predicate

pred

s

h0

x

switch s

RNNUnit

s’update

loopend y

pred=True

pred=False

• TensorFlow has control flow nodes (e.g. cond, switch, while)

• As CG has a loop, some mechanism is necessary that resolves he dependency of nodes to schedule the order of calculation.

W

b


Automatic Differentiation

• Computes gradient of some specified data nodes (e.g. loss) with respect to each data node.

• Each operator node must have backward operation to calculate gradients w.r.t. its inputs from gradients w.r.t. its outputs (realization of chain rule).

• e.g. Function class of Chainer has backwardmethod.• e.g. Each layer classes of Caffe has Backward_cpu and Backward_gpumethods

• e.g. Autograd has a thin wrapper that adds gradient methods as a closure to most of NumPy methods.


Backprop through CG

∇y z∇x1 z ∇z z = 1

y = x1 * x2z = y - x3

x1 mul suby

x3

z

x2


Backprop as extended graphs

x1 mul suby

x3

z

x2

dzid

neg

mul

mul

dy

dx3

dx1

dx2

forwardpropagation

backwardpropagation

y = x1 * x2z = y - x3


Example: Theano
















BLAS, cuBLAS, cuDNN

Hardware CPU, GPU

Numerical optimizer

• Many gradient-based optimization algorithms are implemented.• Stochastic Gradient Descent (SGD) is implemented in most DL

frameworks.• It depends on concrete tasks which optimizer works best.

w: parameters of neural networkθ: states of optimizerL: loss functionΓ: optimizer-specific function

initialize w, θuntil meet the criteria:

get data (x, y)calculate ∇w L(x, y; w)w, θ← Γ(w, θ, ∇w L)


Serialization

• Save/Load the snapshot of training process in specified format (e.g. hdf5, npz, protobuf)• Models to be trained (= architectures and parameters of NNs)• States of training procedure (e.g. epoch, learning rate, momentum)

• Serialization enhance the portability of models.• Publish pre-trained model (e.g. Model Zoo (Caffe), MXNet, TensorFlow)• Import pre-trained model of other DL frameworks

• e.g. Chainer supports BVLC-official reference models of Caffe.


Computational optimizer

• Convert CGs to make them simplified and efficient.

e.g. Theanoy = x1 * x2z = y - x3


Abstraction of ML workflow• Offers typical training/validation/evaluation procedures as APIs.• Users should call a single API and do not have to write the procedure

manually.• e.g. fit, evaluatemethods of Model class in Keras.


Prepare the training dataset

Repeat until meeting some criterionPrepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the Neural Network (NN) parameters

Save the NN parameters

Define how to compute the loss of this batch

Compute the gradient (backprop)Update the NN parameters

Graphical interface

• Computational graph management• Editor, Visualizer

• Visualization of training procedure• Visualization of feature maps, output of NNs etc.• Transition of error and accuracy

• Performance monitor• e.g. Throughput, latency, memory usage
















BLAS, cuBLAS, cuDNN

Hardware CPU, GPU

GPU support

• CUDA: Computing platform for GPGPU on NVIDIA GPU • language extension, compiler, library etc.

• DL frameworks prepare wrappers for CUDA.• GPU-array library that utilizes cuBLAS, cuRAND etc.• Layer implementation with cuDNN (e.g. Convolution, sigmoid, LSTM)

• Designed to switch CPU and GPU easily.• e.g. Users can write CPU-GPU agnostic code.• e.g. Switch CPU/GPU with environment variables.

• Some framework supports Open CL as a GPU environment, but CUDA is more popular for now.


Multi-dimensional array library (CPU / GPU)

• In charge of concrete calculation of data nodes.• Heavily depends on BLAS (CPU) or CUDA / CUDA Toolkits

(GPU)

• CPU• Third-party library: Eigen::Tensor, NumPy• Scratch: ND4J (DL4J), mshadow (MXNet)

• GPU• Third-party library: Eigen::Tensor, PyCUDA, gpuarray• Scratch: ND4J (DL4J), mshadow (MXNet), CuPy (Chainer)


Which device to use?

• GPU is (by far) faster than CPU in most case. • Most of tensor calculation consists of element-wise calculation,

matrix multiplications and convolutions.

• Exceptional cases• Difficult to apply mini-batch technique.

• e.g. variable-length training dataset• e.g. The architecture of NN depends on the training data.

• GPU calculation cannot hide transfer of data to GPU.• e.g. Minibatch size is too small.


Technology stack of Chainer

cuDNN

Chainer

NumPy CuPy

BLAS cuBLAS, cuRAND

CPU GPU


name

Graphical interface

Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage

Hardware

Technology stack of TensorFlow

cuDNN

TensorFlow

Eigen::Tensor

BLAS cuBLAS, cuRAND

CPU GPU


name

Graphical interface


Hardware

TensorBoard

TF Learn

Technology stack of Theano

CUDA, OpenCLCUDAToolkit

Theano

BLAS

CPU GPU


name

Graphical interface


Hardware

libgpuarrayNumPy

Technology stack of Keras


name

Graphical interface


Hardware

Keras

TensorFlowTheano

TechnologyStack of Theano

Technology Stack of TF

Summary

• Most DL frameworks have many components in common and can be organized as a similar technology stack.

• At upper layer of the stack, frameworks are designed to support users to follow typical ML workflows.• At middle layer, manipulations on computational graphs are

automated.• At lower layer, optimized tensor calculations are

implemented.

• Realization of these components differ between frameworks, as we will see in the following part.


memorandum


Training of Neural Networks

• L is designed so that its value gets small as the prediction more “accurate”• In deep learning context

• L : represented by neural networks• w : parameters of neural networks

argminw∑(x, y) L(x, y; w)w: parametersx: feature vectory: training labelL: loss function

e.g.: Classification problem

332016/4/19 DLIF Tutorial @ PAKDD2016

Layer = function + data nodes

• Layers (e.g. Fully connected layer, convolutional layer) can be considered as a function with parameters to be optimized.• In most of modern frameworks, parameters of layers can be

considered as data nodes in a computational graph.• Framework need to be differentiate which data nodes are

parameters to be optimized or data point.


Execution Engine

• It calculates the dependency between data node and schedules the execution of parts of computational graph (especially in multi-node or multi-GPU setting)