ディープラーニングフレームワーク とchainerの実装
TRANSCRIPT
-
Chainer
2016/03/07 PPL2016@
Preferred Networks
-
-2014
2014
2014-
ChainerCuPy
v1.7(2016/3/1)
2/50
-
Define and RunDefine by Run
CuPyChainer
Chainer
3/50
-
x1
xN
h1
hH
kM
k1
yM
y1
Forward
Backward
50%
5/50
-
DLDeepLearning
2012
201420151500*
DeepMind AlphaGo 40Lee Sedol
201422NN [Google]
*http://memkite.com/deep-learning-bibliography/
2015152NN [MSRA]
6/50
-
ILSVRC
28.2
25.8
16.4
11.7
6.7 5.985.1 4.94 4.82
3.56
0
5
10
15
20
25
30
Deep Learning
7/50
-
chainer-DCGAN 300NNhttps://github.com/mattya/chainer-DCGAN
8/50
-
2
9/50
-
1
10/50
-
11/50
-
12/50
-
DAG =
13/50
-
z = x ** 2 + 2 * x * y + y
x
y
_ ** 2
2 * _ _ * _ _ + _ z
_ + _
14/50
-
15/50
-
1.
2.
3.
16/50
-
Linear
L2
Linear
L1
MNIST & Chainer
3 L1 = L.Linear(784, n_units)L2 = L.Linear(n_units, 10))
def forward(self, x):
h1 = F.relu(L1(x))
return L2(h1)
h1
W bias
W bias
ReLU
17/50
-
Recurrent Net
t=T t=T-1 t=T
T
T-1
T
18/50
-
Recurrent Net
DAG
DAG Backprop
Through Time
t=1
t=2
t=3
t=4
19/50
-
SGDAdam
21/50
-
2015
C/C++, Python, R, Matlab, Julia
GPU CUDAOpenCL
GPU/
22/50
-
DL
Python C++ Lua Python C++/Python
Preferred Networks/Infrastructure
BVLC Idiap Research
Institute,
DeepMind
Univ. of
Montreal
Google
()
RNN/LSTM
DSL (prototxt)
DSL (YAML)
DSLPython
LuaJIT
GPUgRPC
23/50
-
SymbolicImperative
Theano, TensorFlow
Caffe, cxxnet
Chainer
Torch, Minerva
#A = Variable('A')
B = Variable('B')
C = B * A
D = C + Constant(1)
# compiles the function
f = compile(D)
d = f(A=numpy.ones(10),
B=numpy.ones(10) * 2)
#a = numpy.ones(10)
b = numpy.ones(10) * 2
c = b * a
d = c + 1
24/50
-
1
GPU
TensorFlow
#A = Variable('A')
B = Variable('B')
C = B * A
D = C + Constant(1)A
DB
B * A + 1
A
B
DC
B * A C + 1
25/50
-
a = numpy.ones(10)
b = numpy.ones(10) * 2
c = b * a
d = 0
for i in range(c):
d += c[i] + i
26/50
-
Q. A. Recursive Neural Network
Recurrent NetRecursive Net
ChainerExample
x1 x2
p1
x3
p2
p1 = f(x1, x2)
p2 = f(p1, x3)
27/50
-
DSL
or
MXNet
28/50
-
Chainer
Define by Run
backward
Define and Run
TensorFlow, Theano, Torch nn
a = numpy.ones(10)
b = numpy.ones(10) * 2
c = b * a
d = c + 1
a
b
dc
b * a c + 1
Define by Run
29/50
-
Define by Run
NN
NN
OK
NN
30/50
-
NN
GPUTesla K80@24GB
150
GPU
31/50
-
NN
1001 Nvidia
t=1
t=2
t=3
t=4
t=1
t=2
A4 B 2
32/50
-
Define and RunDefine by Run
GPU
33/50
-
CuPyChainer
-
CuPy
CUDANumPy
Chainer v1.5.0 174
cuBLAS
reshape
elementwise, reduction
PythonGPU
PythonNumPy
PCGPUCUDA
CUDANumPy CuPy35/50
-
CuPy
Torch(), Eigen::Tensor
BLAS(OpenBLAS, MKL)
CUDA(cuBLAS, cuDNN)
DL
36/50
-
CuPy
numpy cupy
CPU/GPU NumPy CuPy logsumexp
def logsumexp(x, axis=None):
xp = cuda.get_array_module(x) #x_max = x.max(axis)
exp_sum = xp.exp(x - x_max).sum(axis)
return x_max + xp.log(exp_sum)
37/50
-
CuPy
def test(xp):
a = xp.arange(1000000).reshape(1000, -1)
return a.T * 2
test(numpy)
t1 = datetime.datetime.now()
for i in range(1000):
test(numpy)
t2 = datetime.datetime.now()
print(t2 -t1)
test(cupy)
t1 = datetime.datetime.now()
for i in range(1000):
test(cupy)
t2 = datetime.datetime.now()
print(t2 -t1)
[ms]
NumPy 2929 1.0
CuPy 585 5.0
CuPy +Memory Pool
123 23.8
Intel Core i7-4790 @3.60GHz,
32GB, GeForce GTX 970
38/50
-
CuPy1/2
Cython
PythonC200
NumPy
NumPy
NumPy
83
NumPy
39/50
-
CuPy2/2
CUDA
nvcc
A+B=C
0~25 27
8311^3=1331
Chainer11595
40/50
-
CuPy
3
cupy.core cupy.cuda Cython
CUDA(cuBLAS, cuRNAD, cuDNN)
ndarray
ufunc, elementwise, reduction
CUDA Python wrapper cupy.cuda
cupy.core
cupy
41/50
-
Pythonndarray C++CArray
template //
class CArray {
private:
T* data_; //GPUint size_; //CArrayint shape_[ndim]; //int strides_[ndim]; //
} //
42/50
-
GPU
PythonC++
ElementwiseKernel
ReductionKernel
Reduce
ufunc
NumPyElementwise
43/50
-
ElementwiseKernel
2
squared_diff = cupy.ElementwiseKernel(
float32 x, float32 y, #float32 z, #z = (x - y) * (x - y), #squared_diff) #
squared_diff(cupy.arange(10), 10)
44/50
-
1
squared_diff = cupy.ElementwiseKernel(
T x, T y, //T z, //z = (x - y) * (x - y), //squared_diff) //
45/50
-
Elementwise
Python
${preamble}
extern "C" __global__ void ${name}(${params})
{
${loop_prep};
CUPY_FOR(i, _ind.size()){ //_ind.set(i); //${operation}; //
}
${after_loop};
}
46/50
-
cupy.add(x, y)
NumPy
args, kwdargs
device
add
CUDA
CUDA
CUDA
47/50
-
CuPy
GPU
Malloc, Free
NumPy
Chainer
48/50
-
CuPy
Chainer
NumPy
49/50
-
MXNet
Programming Models for Deep Learning
http://mxnet.readthedocs.org/en/latest/program_model.html
GPU
http://www.slideshare.net/NVIDIAJapan/gpu-51812528
CuPy
http://www.slideshare.net/ryokuta/cupy
NumPy
http://www.slideshare.net/ryokuta/numpy-57587130
We are Hiring!
https://www.preferred-networks.jp/job_ja50/50
http://mxnet.readthedocs.org/en/latest/program_model.htmlhttp://www.slideshare.net/NVIDIAJapan/gpu-51812528http://www.slideshare.net/ryokuta/cupyhttp://www.slideshare.net/ryokuta/numpy-57587130