introduction to chainer: a flexible framework for deep learning

Introduction to Chainer:A Flexible Framework for Deep Learning

2015-‐‑‒06-‐‑‒18 PFI/PFN Weekly SeminarSeiya Tokui (Preferred Networks)

Self-‐‑‒Introduction

l  Seiya Tokui @beam2d (Twitter, GitHub)

l  Researcher at Preferred Networks

l  Main focus: machine learning

–  Learning to Hash (master degree)

–  Deep Learning, Representation Learning (current focus)

2

3

A Powerful, Flexible, and Intuitive Framework of Neural Networks

Today I will introduce:

l  The features of Chainer

l  How to use Chainer

l  Some planned features

l  (Slide in English, talk in Japanese)

: The Concept

5

Chainer is a framework of neural networks

l  Official site: http://chainer.org

l  Repository: https://github.com/pfnet/chainer

l  Provided as a Python library (PyPI: chainer)

l  Main features

–  Powerful:Supports CUDA and multi-‐‑‒GPU capability

–  Flexible: Support almost arbitrary architectures

–  Intuitive: Forward prop can be written as a regular Python code

Elements of a neural network framework

l  Multi-‐‑‒dimensional array implementations

l  Layer implementations

–  Called in various names (layers, modules, blocks, primitives, etc...)

–  The smallest units of automatic differentiation

–  Contain forward and backward implementations

l  Optimizer implementations

l  Other stuffs (data loading scheme, training loop, etc...)

–  These are also very important, though Chainer currently does not provide their abstraction (future work)

7

Forward prop / Backprop

l  Forward prop is how we want to process the input data

l  Backprop computes its gradient for the learnable parameters

l  Given backward procedures of all layers, backprop can be written as their combination (a.k.a. reverse-‐‑‒mode automatic differentiation)

8

input hidden output groundtruth

loss func

gradgradgrad

hidden

Backprop Implementation Paradigm (1)

Define-‐‑‒and-‐‑‒Run

l  First, a computational graph is constructed. Then, it is periodically fed with minibatches to do forward/backward

l  The computational graph can be seen as a program and the forward/backward computation is done by its interpreter

u  Caffe: the program is written by Prototxt

u  Torch: the program is constructed by Lua scripts

u  Theano-‐‑‒based frameworks: the program is constructed by Python scripts


Define-‐‑‒and-‐‑‒Run (cont.)

l  Pros

–  (Almost) No need of memory management

–  The computational graph can be implicitly optimized (cf. Theano)

l  Cons

–  The program is fixed within the training loop

–  The interpreter must have capability of defining various forward computations, including control-‐‑‒flow statements like if and for

u  Theano has the dedicated functions for them (ifelse and scan), which are unintuitive and not Pythonic

–  Network definition is hard to debug, since an error occurs at the forward computation that is far apart from the network definition


Define-‐‑‒by-‐‑‒Run

l  The forward computation is written as a regular program code with special variables and operators, executing which simultaneously involves the forward computation and the graph construction (just by storing the order of operations).

l  The graph is used for the backward computation.

l  This paradigm enables us to use arbitrary control flow statements in the forward computation

–  No need of a mini language and its interpreter

l  It also makes the forward computation intuitive and easy to debug


Define-‐‑‒by-‐‑‒Run (cont.)

l  The computational graph can be modified within each iteration

l  Example: Truncated BPTT (BackProp Through Time)

–  BPTT: Backprop on a recurrent net

–  Truncated BPTT: Truncate the backprop at some time point

–  Truncation is one type of modification of the computational graph

Truncated

Features of Chainer

l  Define-‐‑‒by-‐‑‒Run scheme

–  Forward computation can contain any Python code

u  if-else, for-else, break, continue, try-except-finally, list, dict, class, etc...

–  User can modify the graph within the loop

u  E.g. truncation can be done by unchain_̲backward (which unchains the graph backward from some variable)

u  See the tutorial on recurrent netshttp://docs.chainer.org/en/latest/tutorial/recurrentnet.html

l  Predefined functions

l  Support GPU(s) via PyCUDA

Example: Training a multi-‐‑‒layer perceptron in one page

Full code is in the tutorial and the example directory.

# Model definition

model = FunctionSet(

l1=F.Linear(784, 100),

l2=F.Linear(100, 100),

l3=F.Linear(100, 10))

opt = optimizers.SGD()

opt.setup(

model.collect_parameters())

# Forward computation

def forward(x, t):

h1 = F.relu(model.l1(x))

h2 = F.relu(model.l2(h1))

y = model.l3(h2)

return F.softmax_cross_entropy(y, t)

# Training loop

for epoch in xrange(n_epoch):

for i in xrange(0, N, batchsize):

x = Variable(...)

t = Variable(...)

opt.zero_grads()

loss = forward(x, t)

loss.backward()

opt.update()

Example: Recurrent net language model in one page

Full code is in the tutorial and the example directory.

# Model definition


emb=F.EmbedID(1000, 100),

x2h=F.Linear( 100, 50),

h2h=F.Linear( 50, 50),

h2y=F.Linear( 50, 1000))


opt.setup(


# Forward computation of one step

def fwd1step(h, w, t):

x = F.tanh(model.emb(w))

h = F.tanh(model.x2h(x) + model.h2h(h))

y = model.h2y(h)

return h, F.softmax_cross_entropy(y, t)

# Full RNN forward computation

def forward(seq):

h = Variable(...) # init state

loss = 0

for curw, nextw in \

zip(seq, seq[1:]):

x = Variable(curw)

t = Variable(nextw)

h, new_loss = fwd1step(h, x, t)

loss += new_loss

return loss

: How to Use It

16

Install Chainer

l  Prepare a Python 2.7 environment with pip

–  (Pyenv+)Anaconda is recommended

l  Install Chainer just bypip install chainer

l  If you want to use GPU(s), do:

–  Install CUDA and the corresponding NVIDIA driver

–  Install dependent packages bypip install chainer-cuda-deps

–  You may have to update the six packagepip install –U six

Run the MNIST example (quick start)

l  Require scikit-‐‑‒learn installed: pip install scikits.learn

l  Clone the repository of Chainer: git clone https://github.com/pfnet/chainer

l  Go to the example directory at examples/mnist

l  Then, run python train_mnist.py

–  Run on GPU by passing --gpu=0

l  Other examples can be similarly executed (some needs manual preparation of datasets)

Read the documents

l  Read the documents at http://docs.chainer.org

l  It includes:

–  Tutorial

–  Reference manual

l  All features given in this talk are introduced by the tutorial, so please try it if you want to know the detail.

Basic concepts (1)

l  Essential part of Chainer: Variable and Function

l  Variable is a wrapper of n-‐‑‒dimensional arrays (ndarray and GPUArray)

l  Function is an operation on Variables

–  Function application is memorized by the returned Variable(s)

–  All operations for which you want to backprop must be done by Functions on Variables

l  Making a Variable object is simple: just pass an arrayx = chainer.Variable(numpy.ndarray(...))

–  The array is stored in data attribute (x.data)

Basic concepts (2)

l  Example of the computational graph constructionx = chainer.Variable(...)

y = chainer.Variable(...)

z = x**2 + 2*x*y + y

l  Gradient of z(x, y) can be computed by z.backward()

l  Results are stored in x.grad and y.grad

x

y

_ ** 2

2 * _ _ * _ _ + _ z

_ + _

Actually, Split nodes are automatically inserted (they accumulate the gradients on backprop)

Basic concepts (3)

l  Chainer provides many functions in chainer.functions subpackage

–  This package is often abbreviated to F

l  Parameterized functions are provided as classes

–  Linear, Convolution2D, EmbedID, PReLU, BatchNormalization, etc.

–  Their instances should be shared across all iterations

l  Non-‐‑‒parameterized functions are provided as Python functions

–  Activation functions, pooling, array manipulation, etc.

Basic concepts (4)

l  Use FunctionSet to manage parameterized functions

–  It is an object with Function attributes

–  Easy to migrate functions onto GPU devices

–  Easy to collect parameters and gradients (collect_̲parameters)

l  Use Optimizer for numerical optimization

–  Major algorithms are provided:SGD, MomentumSGD, AdaGrad, RMSprop, ADADELTA, Adam

–  Some parameter/gradient manipulations are done via this class:weight decay, gradient clip,

Easy to debug!

l  If the forward computation has a bug, then an error occurs immediately at the appropriate line of the forward definition

l  Example

–  This code has inconsistency of the array size:

x = Variable(np.ndarray((3, 4), dtype=np.float32)

y = Variable(np.ndarray((3, 3), dtype=np.float32)

a = x ** 2 + x

b = a + y * 2

c = b + x * 2

–  Since an exception is raised at the appropriate line, we can easily find the cause of bug (this is one big difference from Define-‐‑‒and-‐‑‒Run frameworks)

← an exception is raised at this line

Graph manipulation (1)

l  Backward unchaining: y.unchain_backward()

–  It purges the nodes backward from y

–  It is useful to implement truncated BPTT (see PTB example)

x f y g z

y g z

y.unchain_backward()

Graph manipulation (2)

l  Volatile variables: x = Variable(..., volatile=True)

–  Volatile variable does not build a graph

–  Volatility can be accessed directly by x.volatile

x = Variable(..., volatile=True)

y = f(x)

y.volatile = False

z = h(y)

x f y g z

Example: Training a multi-‐‑‒layer perceptron in one page

Note: F = chainer.functions

# Model definition


l1=F.Linear(784, 100),

l2=F.Linear(100, 100),

l3=F.Linear(100, 10))


opt.setup(



def forward(x, t):



y = model.l3(h2)


# Training loop



x = Variable(...)

t = Variable(...)

opt.zero_grads()


loss.backward()

opt.update()

Example: Recurrent net language model in one page

# Model definition


emb=F.EmbedID(1000, 100),

x2h=F.Linear( 100, 50),

h2h=F.Linear( 50, 50),

h2y=F.Linear( 50, 1000))


opt.setup(


# Forward computation of one step

def fwd1step(h, w, t):

x = F.tanh(model.emb(w))

h = F.tanh(model.x2h(x) + model.h2h(h))

y = model.h2y(h)

return h, F.softmax_cross_entropy(y, t)

# Full RNN forward computation

def forward(seq):

h = Variable(...) # init state

loss = 0

for curw, nextw in \

zip(seq, seq[1:]):

x = Variable(curw)

t = Variable(nextw)

h, new_loss = fwd1step(h, x, t)

loss += new_loss

return loss

CUDA support (1)

l  Chainer supports CUDA computation

l  Installation

–  Install CUDA 6.5+

–  Install CUDA-‐‑‒related packages bypip install chainer-cuda-deps

u  Build of PyCUDA may fail if you install CUDA into non-‐‑‒standard path. In such case, you have to install PyCUDA from source code with appropriate configuration.

CUDA support (2)

l  Call cuda.init() before any CUDA-‐‑‒related operations

l  Converts numpy.ndarray into GPUArray by chainer.cuda.to_gpu data_gpu = chainer.cuda.to_gpu(data_cpu)

l  A GPUArray object can be passed to the Variable constructorx = Variable(data_gpu)

l  Most functions support GPU Variables

–  Parameterized functions must be sent to GPU beforehand by Function.to_gpu or FunctionSet.to_gpu

l  Extracts the results to host memory by chainer.cuda.to_cpu

l  All examples support CUDA (pass --gpu=N, where N is the GPU ID)

MLP example for CUDA

# Model definition


l1=F.Linear(784, 100),

l2=F.Linear(100, 100),

l3=F.Linear(100, 10)).to_gpu() opt = optimizers.SGD()

opt.setup(



def forward(x, t):



y = model.l3(h2)


# Training loop



x = Variable(to_gpu(...)) t = Variable(to_gpu(...))

opt.zero_grads()


loss.backward()

opt.update()

CUDA support (3)

l  Chainer also supports computation on multiple GPUs (easily!)

l  Model parallel

–  Send FunctionSets to appropriate devices (to_̲gpu accepts GPU ID)model_0 = FunctionSet(...).to_gpu(0)

model_1 = FunctionSet(...).to_gpu(1)

–  Copy Variable objects across GPUs by copy functionx_1 = F.copy(x_0, 1)

u  This copy is tracked by the computational graph, so you donʼ’t need to deal with it on backprop

CUDA support (4)

l  Chainer also supports computation on multiple GPUs

l  Data parallel

–  FunctionSet can be copied by copy.copy model = FunctionSet(...)

model_0 = copy.copy(model_0).to_gpu(0)

model_1 = model_1.to_gpu(1)

–  Set up the optimizer only for the master modelopt.setup(model_0.collect_parameters())

–  After data-‐‑‒parallel gradient computation, gather themopt.accumulate_grads(model_1.gradients)

–  After the update, share them across model copiesmodel_1.copy_parameters_from(model_0.parameters)

Model Zoo support (in the near future)

l  Model Zoo is a place that pretrained models are registered

–  Provided by BVLC Caffe team

–  It contains the Caffe reference models

l  We are planning to support the Caffe reference models in three weeks (the next minor release)

–  Current design (it may be changed):f = CaffeFunction(‘path/to/model.caffemodel’)

x, t = Variable(...), Variable(...)

y = f(inputs={‘data’: x, ‘label’: t}, outputs=[‘loss’])

–  It emulates Caffe networks by Chainerʼ’s functions

Note: development process

l  Schedule

–  We are planning to release updates biweekly

–  Updates are classified into three groups

u  Revision: bug fixes, updates without adding/modifying interfaces

u  Minor: Updates that add/modify interfaces without lacking backward compatibility

u  Major: Updates that are not backward-‐‑‒compatible

l  We are using the GitHub-‐‑‒flow process

l  We welcome your PRs!

–  Please send them to the master branch

Wrap up

l  Chainer is a powerful, flexible, and intuitive framework of neural networks in Python

l  It is based on Define-‐‑‒by-‐‑‒Run scheme, which makes it intuitive and flexible

l  Chainer is a very young project and immature

–  Its development started at mid. April (just two months ago)

–  We will add many functionailities (especially more functions)

–  We may add some abstraction of whole learning processes

introduction to chainer: a flexible framework for deep learning

Software