introduction to chainer: a flexible framework for deep learning
TRANSCRIPT
Introduction to Chainer:A Flexible Framework for Deep Learning
2015-‐‑‒06-‐‑‒18 PFI/PFN Weekly SeminarSeiya Tokui (Preferred Networks)
Self-‐‑‒Introduction
l Seiya Tokui @beam2d (Twitter, GitHub)
l Researcher at Preferred Networks
l Main focus: machine learning
– Learning to Hash (master degree)
– Deep Learning, Representation Learning (current focus)
2
3
A Powerful, Flexible, and Intuitive Framework of Neural Networks
Today I will introduce:
l The features of Chainer
l How to use Chainer
l Some planned features
l (Slide in English, talk in Japanese)
: The Concept
5
Chainer is a framework of neural networks
l Official site: http://chainer.org
l Repository: https://github.com/pfnet/chainer
l Provided as a Python library (PyPI: chainer)
l Main features
– Powerful:Supports CUDA and multi-‐‑‒GPU capability
– Flexible: Support almost arbitrary architectures
– Intuitive: Forward prop can be written as a regular Python code
Elements of a neural network framework
l Multi-‐‑‒dimensional array implementations
l Layer implementations
– Called in various names (layers, modules, blocks, primitives, etc...)
– The smallest units of automatic differentiation
– Contain forward and backward implementations
l Optimizer implementations
l Other stuffs (data loading scheme, training loop, etc...)
– These are also very important, though Chainer currently does not provide their abstraction (future work)
7
Forward prop / Backprop
l Forward prop is how we want to process the input data
l Backprop computes its gradient for the learnable parameters
l Given backward procedures of all layers, backprop can be written as their combination (a.k.a. reverse-‐‑‒mode automatic differentiation)
8
input hidden output groundtruth
loss func
gradgradgrad
hidden
Backprop Implementation Paradigm (1)
Define-‐‑‒and-‐‑‒Run
l First, a computational graph is constructed. Then, it is periodically fed with minibatches to do forward/backward
l The computational graph can be seen as a program and the forward/backward computation is done by its interpreter
u Caffe: the program is written by Prototxt
u Torch: the program is constructed by Lua scripts
u Theano-‐‑‒based frameworks: the program is constructed by Python scripts
Backprop Implementation Paradigm (2)
Define-‐‑‒and-‐‑‒Run (cont.)
l Pros
– (Almost) No need of memory management
– The computational graph can be implicitly optimized (cf. Theano)
l Cons
– The program is fixed within the training loop
– The interpreter must have capability of defining various forward computations, including control-‐‑‒flow statements like if and for
u Theano has the dedicated functions for them (ifelse and scan), which are unintuitive and not Pythonic
– Network definition is hard to debug, since an error occurs at the forward computation that is far apart from the network definition
Backprop Implementation Paradigm (3)
Define-‐‑‒by-‐‑‒Run
l The forward computation is written as a regular program code with special variables and operators, executing which simultaneously involves the forward computation and the graph construction (just by storing the order of operations).
l The graph is used for the backward computation.
l This paradigm enables us to use arbitrary control flow statements in the forward computation
– No need of a mini language and its interpreter
l It also makes the forward computation intuitive and easy to debug
Backprop Implementation Paradigm (4)
Define-‐‑‒by-‐‑‒Run (cont.)
l The computational graph can be modified within each iteration
l Example: Truncated BPTT (BackProp Through Time)
– BPTT: Backprop on a recurrent net
– Truncated BPTT: Truncate the backprop at some time point
– Truncation is one type of modification of the computational graph
Truncated
Features of Chainer
l Define-‐‑‒by-‐‑‒Run scheme
– Forward computation can contain any Python code
u if-else, for-else, break, continue, try-except-finally, list, dict, class, etc...
– User can modify the graph within the loop
u E.g. truncation can be done by unchain_̲backward (which unchains the graph backward from some variable)
u See the tutorial on recurrent netshttp://docs.chainer.org/en/latest/tutorial/recurrentnet.html
l Predefined functions
l Support GPU(s) via PyCUDA
Example: Training a multi-‐‑‒layer perceptron in one page
Full code is in the tutorial and the example directory.
# Model definition
model = FunctionSet(
l1=F.Linear(784, 100),
l2=F.Linear(100, 100),
l3=F.Linear(100, 10))
opt = optimizers.SGD()
opt.setup(
model.collect_parameters())
# Forward computation
def forward(x, t):
h1 = F.relu(model.l1(x))
h2 = F.relu(model.l2(h1))
y = model.l3(h2)
return F.softmax_cross_entropy(y, t)
# Training loop
for epoch in xrange(n_epoch):
for i in xrange(0, N, batchsize):
x = Variable(...)
t = Variable(...)
opt.zero_grads()
loss = forward(x, t)
loss.backward()
opt.update()
Example: Recurrent net language model in one page
Full code is in the tutorial and the example directory.
# Model definition
model = FunctionSet(
emb=F.EmbedID(1000, 100),
x2h=F.Linear( 100, 50),
h2h=F.Linear( 50, 50),
h2y=F.Linear( 50, 1000))
opt = optimizers.SGD()
opt.setup(
model.collect_parameters())
# Forward computation of one step
def fwd1step(h, w, t):
x = F.tanh(model.emb(w))
h = F.tanh(model.x2h(x) + model.h2h(h))
y = model.h2y(h)
return h, F.softmax_cross_entropy(y, t)
# Full RNN forward computation
def forward(seq):
h = Variable(...) # init state
loss = 0
for curw, nextw in \
zip(seq, seq[1:]):
x = Variable(curw)
t = Variable(nextw)
h, new_loss = fwd1step(h, x, t)
loss += new_loss
return loss
: How to Use It
16
Install Chainer
l Prepare a Python 2.7 environment with pip
– (Pyenv+)Anaconda is recommended
l Install Chainer just bypip install chainer
l If you want to use GPU(s), do:
– Install CUDA and the corresponding NVIDIA driver
– Install dependent packages bypip install chainer-cuda-deps
– You may have to update the six packagepip install –U six
Run the MNIST example (quick start)
l Require scikit-‐‑‒learn installed: pip install scikits.learn
l Clone the repository of Chainer: git clone https://github.com/pfnet/chainer
l Go to the example directory at examples/mnist
l Then, run python train_mnist.py
– Run on GPU by passing --gpu=0
l Other examples can be similarly executed (some needs manual preparation of datasets)
Read the documents
l Read the documents at http://docs.chainer.org
l It includes:
– Tutorial
– Reference manual
l All features given in this talk are introduced by the tutorial, so please try it if you want to know the detail.
Basic concepts (1)
l Essential part of Chainer: Variable and Function
l Variable is a wrapper of n-‐‑‒dimensional arrays (ndarray and GPUArray)
l Function is an operation on Variables
– Function application is memorized by the returned Variable(s)
– All operations for which you want to backprop must be done by Functions on Variables
l Making a Variable object is simple: just pass an arrayx = chainer.Variable(numpy.ndarray(...))
– The array is stored in data attribute (x.data)
Basic concepts (2)
l Example of the computational graph constructionx = chainer.Variable(...)
y = chainer.Variable(...)
z = x**2 + 2*x*y + y
l Gradient of z(x, y) can be computed by z.backward()
l Results are stored in x.grad and y.grad
x
y
_ ** 2
2 * _ _ * _ _ + _ z
_ + _
Actually, Split nodes are automatically inserted (they accumulate the gradients on backprop)
Basic concepts (3)
l Chainer provides many functions in chainer.functions subpackage
– This package is often abbreviated to F
l Parameterized functions are provided as classes
– Linear, Convolution2D, EmbedID, PReLU, BatchNormalization, etc.
– Their instances should be shared across all iterations
l Non-‐‑‒parameterized functions are provided as Python functions
– Activation functions, pooling, array manipulation, etc.
Basic concepts (4)
l Use FunctionSet to manage parameterized functions
– It is an object with Function attributes
– Easy to migrate functions onto GPU devices
– Easy to collect parameters and gradients (collect_̲parameters)
l Use Optimizer for numerical optimization
– Major algorithms are provided:SGD, MomentumSGD, AdaGrad, RMSprop, ADADELTA, Adam
– Some parameter/gradient manipulations are done via this class:weight decay, gradient clip,
Easy to debug!
l If the forward computation has a bug, then an error occurs immediately at the appropriate line of the forward definition
l Example
– This code has inconsistency of the array size:
x = Variable(np.ndarray((3, 4), dtype=np.float32)
y = Variable(np.ndarray((3, 3), dtype=np.float32)
a = x ** 2 + x
b = a + y * 2
c = b + x * 2
– Since an exception is raised at the appropriate line, we can easily find the cause of bug (this is one big difference from Define-‐‑‒and-‐‑‒Run frameworks)
← an exception is raised at this line
Graph manipulation (1)
l Backward unchaining: y.unchain_backward()
– It purges the nodes backward from y
– It is useful to implement truncated BPTT (see PTB example)
x f y g z
y g z
y.unchain_backward()
Graph manipulation (2)
l Volatile variables: x = Variable(..., volatile=True)
– Volatile variable does not build a graph
– Volatility can be accessed directly by x.volatile
x = Variable(..., volatile=True)
y = f(x)
y.volatile = False
z = h(y)
x f y g z
Example: Training a multi-‐‑‒layer perceptron in one page
Note: F = chainer.functions
# Model definition
model = FunctionSet(
l1=F.Linear(784, 100),
l2=F.Linear(100, 100),
l3=F.Linear(100, 10))
opt = optimizers.SGD()
opt.setup(
model.collect_parameters())
# Forward computation
def forward(x, t):
h1 = F.relu(model.l1(x))
h2 = F.relu(model.l2(h1))
y = model.l3(h2)
return F.softmax_cross_entropy(y, t)
# Training loop
for epoch in xrange(n_epoch):
for i in xrange(0, N, batchsize):
x = Variable(...)
t = Variable(...)
opt.zero_grads()
loss = forward(x, t)
loss.backward()
opt.update()
Example: Recurrent net language model in one page
# Model definition
model = FunctionSet(
emb=F.EmbedID(1000, 100),
x2h=F.Linear( 100, 50),
h2h=F.Linear( 50, 50),
h2y=F.Linear( 50, 1000))
opt = optimizers.SGD()
opt.setup(
model.collect_parameters())
# Forward computation of one step
def fwd1step(h, w, t):
x = F.tanh(model.emb(w))
h = F.tanh(model.x2h(x) + model.h2h(h))
y = model.h2y(h)
return h, F.softmax_cross_entropy(y, t)
# Full RNN forward computation
def forward(seq):
h = Variable(...) # init state
loss = 0
for curw, nextw in \
zip(seq, seq[1:]):
x = Variable(curw)
t = Variable(nextw)
h, new_loss = fwd1step(h, x, t)
loss += new_loss
return loss
CUDA support (1)
l Chainer supports CUDA computation
l Installation
– Install CUDA 6.5+
– Install CUDA-‐‑‒related packages bypip install chainer-cuda-deps
u Build of PyCUDA may fail if you install CUDA into non-‐‑‒standard path. In such case, you have to install PyCUDA from source code with appropriate configuration.
CUDA support (2)
l Call cuda.init() before any CUDA-‐‑‒related operations
l Converts numpy.ndarray into GPUArray by chainer.cuda.to_gpu data_gpu = chainer.cuda.to_gpu(data_cpu)
l A GPUArray object can be passed to the Variable constructorx = Variable(data_gpu)
l Most functions support GPU Variables
– Parameterized functions must be sent to GPU beforehand by Function.to_gpu or FunctionSet.to_gpu
l Extracts the results to host memory by chainer.cuda.to_cpu
l All examples support CUDA (pass --gpu=N, where N is the GPU ID)
MLP example for CUDA
# Model definition
model = FunctionSet(
l1=F.Linear(784, 100),
l2=F.Linear(100, 100),
l3=F.Linear(100, 10)).to_gpu() opt = optimizers.SGD()
opt.setup(
model.collect_parameters())
# Forward computation
def forward(x, t):
h1 = F.relu(model.l1(x))
h2 = F.relu(model.l2(h1))
y = model.l3(h2)
return F.softmax_cross_entropy(y, t)
# Training loop
for epoch in xrange(n_epoch):
for i in xrange(0, N, batchsize):
x = Variable(to_gpu(...)) t = Variable(to_gpu(...))
opt.zero_grads()
loss = forward(x, t)
loss.backward()
opt.update()
CUDA support (3)
l Chainer also supports computation on multiple GPUs (easily!)
l Model parallel
– Send FunctionSets to appropriate devices (to_̲gpu accepts GPU ID)model_0 = FunctionSet(...).to_gpu(0)
model_1 = FunctionSet(...).to_gpu(1)
– Copy Variable objects across GPUs by copy functionx_1 = F.copy(x_0, 1)
u This copy is tracked by the computational graph, so you donʼ’t need to deal with it on backprop
CUDA support (4)
l Chainer also supports computation on multiple GPUs
l Data parallel
– FunctionSet can be copied by copy.copy model = FunctionSet(...)
model_0 = copy.copy(model_0).to_gpu(0)
model_1 = model_1.to_gpu(1)
– Set up the optimizer only for the master modelopt.setup(model_0.collect_parameters())
– After data-‐‑‒parallel gradient computation, gather themopt.accumulate_grads(model_1.gradients)
– After the update, share them across model copiesmodel_1.copy_parameters_from(model_0.parameters)
Model Zoo support (in the near future)
l Model Zoo is a place that pretrained models are registered
– Provided by BVLC Caffe team
– It contains the Caffe reference models
l We are planning to support the Caffe reference models in three weeks (the next minor release)
– Current design (it may be changed):f = CaffeFunction(‘path/to/model.caffemodel’)
x, t = Variable(...), Variable(...)
y = f(inputs={‘data’: x, ‘label’: t}, outputs=[‘loss’])
– It emulates Caffe networks by Chainerʼ’s functions
Note: development process
l Schedule
– We are planning to release updates biweekly
– Updates are classified into three groups
u Revision: bug fixes, updates without adding/modifying interfaces
u Minor: Updates that add/modify interfaces without lacking backward compatibility
u Major: Updates that are not backward-‐‑‒compatible
l We are using the GitHub-‐‑‒flow process
l We welcome your PRs!
– Please send them to the master branch
Wrap up
l Chainer is a powerful, flexible, and intuitive framework of neural networks in Python
l It is based on Define-‐‑‒by-‐‑‒Run scheme, which makes it intuitive and flexible
l Chainer is a very young project and immature
– Its development started at mid. April (just two months ago)
– We will add many functionailities (especially more functions)
– We may add some abstraction of whole learning processes