ディープラーニングフレームワークとchainerの実装

Chainer

2016/03/07 PPL2016@

Preferred Networks

-2014

2014

2014-

ChainerCuPy

v1.7(2016/3/1)

2/50

Define and RunDefine by Run

CuPyChainer

Chainer

3/50

x1

xN

h1

hH

kM

k1

yM

y1

Forward

Backward

50%

5/50

DLDeepLearning

2012

201420151500*

DeepMind AlphaGo 40Lee Sedol

201422NN [Google]

*http://memkite.com/deep-learning-bibliography/

2015152NN [MSRA]

6/50

ILSVRC

28.2

25.8

16.4

11.7

6.7 5.985.1 4.94 4.82

3.56

0

5

10

15

20

25

30

Deep Learning

7/50

chainer-DCGAN 300NNhttps://github.com/mattya/chainer-DCGAN

8/50

2

9/50

1

10/50

DAG =

13/50

z = x ** 2 + 2 * x * y + y

x

y

_ ** 2

2 * _ _ * _ _ + _ z

_ + _

14/50

1.

2.

3.

16/50

Linear

L2

Linear

L1

MNIST & Chainer

3 L1 = L.Linear(784, n_units)L2 = L.Linear(n_units, 10))

def forward(self, x):

h1 = F.relu(L1(x))

return L2(h1)

h1

W bias

W bias

ReLU

17/50

Recurrent Net

t=T t=T-1 t=T

T

T-1

T

18/50

Recurrent Net

DAG

DAG Backprop

Through Time

t=1

t=2

t=3

t=4

19/50

SGDAdam

21/50

2015

C/C++, Python, R, Matlab, Julia

GPU CUDAOpenCL

GPU/

22/50

DL

Python C++ Lua Python C++/Python

Preferred Networks/Infrastructure

BVLC Idiap Research

Institute,

DeepMind

Univ. of

Montreal

Google

()

RNN/LSTM

DSL (prototxt)

DSL (YAML)

DSLPython

LuaJIT

GPUgRPC

23/50

SymbolicImperative

Theano, TensorFlow

Caffe, cxxnet

Chainer

Torch, Minerva

#A = Variable('A')

B = Variable('B')

C = B * A

D = C + Constant(1)

# compiles the function

f = compile(D)

d = f(A=numpy.ones(10),

B=numpy.ones(10) * 2)

#a = numpy.ones(10)

b = numpy.ones(10) * 2

c = b * a

d = c + 1

24/50

1

GPU

TensorFlow

#A = Variable('A')

B = Variable('B')

C = B * A

D = C + Constant(1)A

DB

B * A + 1

A

B

DC

B * A C + 1

25/50

a = numpy.ones(10)


c = b * a

d = 0

for i in range(c):

d += c[i] + i

26/50

Q. A. Recursive Neural Network

Recurrent NetRecursive Net

ChainerExample

x1 x2

p1

x3

p2

p1 = f(x1, x2)

p2 = f(p1, x3)

27/50

DSL

or

MXNet

28/50

Chainer

Define by Run

backward

Define and Run

TensorFlow, Theano, Torch nn

a = numpy.ones(10)


c = b * a

d = c + 1

a

b

dc

b * a c + 1

Define by Run

29/50

Define by Run

NN

NN

OK

NN

30/50

NN

GPUTesla K80@24GB

150

GPU

31/50

NN

1001 Nvidia

t=1

t=2

t=3

t=4

t=1

t=2

A4 B 2

32/50

Define and RunDefine by Run

GPU

33/50

CuPyChainer

CuPy

CUDANumPy

Chainer v1.5.0 174

cuBLAS

reshape

elementwise, reduction

PythonGPU

PythonNumPy

PCGPUCUDA

CUDANumPy CuPy35/50

CuPy

Torch(), Eigen::Tensor

BLAS(OpenBLAS, MKL)

CUDA(cuBLAS, cuDNN)

DL

36/50

CuPy

numpy cupy

CPU/GPU NumPy CuPy logsumexp

def logsumexp(x, axis=None):

xp = cuda.get_array_module(x) #x_max = x.max(axis)

exp_sum = xp.exp(x - x_max).sum(axis)

return x_max + xp.log(exp_sum)

37/50

CuPy

def test(xp):

a = xp.arange(1000000).reshape(1000, -1)

return a.T * 2

test(numpy)

t1 = datetime.datetime.now()

for i in range(1000):

test(numpy)


print(t2 -t1)

test(cupy)


for i in range(1000):

test(cupy)


print(t2 -t1)

[ms]

NumPy 2929 1.0

CuPy 585 5.0

CuPy +Memory Pool

123 23.8

Intel Core i7-4790 @3.60GHz,

32GB, GeForce GTX 970

38/50

CuPy1/2

Cython

PythonC200

NumPy

NumPy

NumPy

83

NumPy

39/50

CuPy2/2

CUDA

nvcc

A+B=C

0~25 27

8311^3=1331

Chainer11595

40/50

CuPy

3

cupy.core cupy.cuda Cython

CUDA(cuBLAS, cuRNAD, cuDNN)

ndarray

ufunc, elementwise, reduction

CUDA Python wrapper cupy.cuda

cupy.core

cupy

41/50

Pythonndarray C++CArray

template //

class CArray {

private:

T* data_; //GPUint size_; //CArrayint shape_[ndim]; //int strides_[ndim]; //

} //

42/50

GPU

PythonC++

ElementwiseKernel

ReductionKernel

Reduce

ufunc

NumPyElementwise

43/50

ElementwiseKernel

2

squared_diff = cupy.ElementwiseKernel(

float32 x, float32 y, #float32 z, #z = (x - y) * (x - y), #squared_diff) #

squared_diff(cupy.arange(10), 10)

44/50

1

squared_diff = cupy.ElementwiseKernel(

T x, T y, //T z, //z = (x - y) * (x - y), //squared_diff) //

45/50

Elementwise

Python

${preamble}

extern "C" __global__ void ${name}(${params})

{

${loop_prep};

CUPY_FOR(i, _ind.size()){ //_ind.set(i); //${operation}; //

}

${after_loop};

}

46/50

cupy.add(x, y)

NumPy

args, kwdargs

device

add

CUDA

CUDA

CUDA

47/50

CuPy

GPU

Malloc, Free

NumPy

Chainer

48/50

CuPy

Chainer

NumPy

49/50

MXNet

Programming Models for Deep Learning

http://mxnet.readthedocs.org/en/latest/program_model.html

GPU

http://www.slideshare.net/NVIDIAJapan/gpu-51812528

CuPy

http://www.slideshare.net/ryokuta/cupy

NumPy

http://www.slideshare.net/ryokuta/numpy-57587130

We are Hiring!

https://www.preferred-networks.jp/job_ja50/50

http://mxnet.readthedocs.org/en/latest/program_model.htmlhttp://www.slideshare.net/NVIDIAJapan/gpu-51812528http://www.slideshare.net/ryokuta/cupyhttp://www.slideshare.net/ryokuta/numpy-57587130

ディープラーニングフレームワーク とchainerの実装

Engineering

ディープラーニングフレームワークとchainerの実装