deep learning with theano (with a case study) - yahoo … · liangliang cao 1 deep learning with...

Liangliang Cao 1

Deep Learning with Theano (with a case study)

Liangliang “Lyon” Cao

Yahoo! Labs

Liangliang Cao

Outline

•  An abstract view of deep networks

•  Implementing deep networks with Theano

•  A case study: synonym extraction with neural network

2

Liangliang Cao 3

What is deep learning?

Liangliang Cao

An abstract view of deep network

•  Estimate the output

4

•  Compute the loss function

•  Compute the gradient

C = Loss(o5 , y)

L5

L4

L3

L2

L1

x

o5 = L5( L4( L3( L2( L1(x) ) ) ) )

o1 = L1(x) o2 = L2(L1 (x))

…

Liangliang Cao

An abstract view of deep network (2)

•  Estimate the output (Forward propagation)

5

•  Compute the gradient (Backward propagation)

o5 = L5( L4( L3( L2( L1(x) ) ) ) )

Liangliang Cao

An abstract view of deep network (3)

•  Suppose a layer is in the form of

•  We can compute the gradients s.t. parameters

•  Updating parameters by gradient descent

6

Liangliang Cao

An abstract view of deep network (Summary)

7

•  There are many ways to define layers and cost functions

•  Layer definitions may differ from field to field –  Computer vision –  NLP –  Speech –  …

•  But there are only three key steps in deep network

L5

L4

L3

L2

L1

x

Liangliang Cao

An abstract view of deep network (Summary)

8

1.  Forward propagation

2.  Backward propagation

o5 = L5( L4( L3( L2( L1(x) ) ) ) )

3.  Updating

L5

L4

L3

L2

L1

x

Liangliang Cao

Gradient descent is hard for large scale learning

Very often, a machine learning model with parameter w aims to minimize

when N is big, we can see that •  The gradient becomes very expensive. •  Even worse, we may not be able to load all (xi, yi) in to

memory!

9

Liangliang Cao

StochasBc Gradient Descent (SGD)

Idea: estimate the gradient on a randomly picked sample

•  Gradient descent

•  Stochastic gradient descent

Theoretical requirement for convergence:

10

in deep learning practice we just choose a small rate and then decrease it

Liangliang Cao

SGD as a typical deep learning solver

11

For every layer, compute the gradient and update.

Liangliang Cao

SGD and GPUs

12

For every layer, compute the gradient and update.

•  Within every batch, SGD is mainly matrix multiplication: perfect task for GPU!

•  See a demo on how much a GPU can help

Liangliang Cao 13

How to implement deep models?

Liangliang Cao

Theano

•  Developed at Univ. Montreal (Yoshua Bengio’s group)

•  Users write in a language similar to Numpy, => Theano compiled them into C/CUDAC.

•  Use to be relatively slow but now catch up.

•  Easy to use: –  Easiest toolkit to compute gradient –  No need to touch the details of GPU programming

•  Limited to single machines

14

Liangliang Cao

Theano for Symbolic Math

15

Example: Sigmoid function

import theano.tensor as T x = T.scalar() y = T.scalar()

z = 3*x + 4y +5 s = 1/(1+T.exp(z))

Liangliang Cao

Theano for CompuBng Gradients

16

Example:

import theano.tensor as T x = T.scalar() gx = T.grad(x**2, x)

gx2 = T.grad(T.log(x), x) gx3 = T.grad(1/(x), x)

T.grad() is the most amazing function in theano.

Liangliang Cao

Theano for SGD

17

W = theano.shared(value=np.zeros((n_in, n_out)), name='W’)

b = theano.shared(value=np.zeros((n_out,)), name = ‘b’)

cost = hinge_loss(W*x+b, y)

g_W = T.grad(cost, W)

g_b = T.grad(cost, b)

updates_FP = [(W, W - learning_rate * g_W), (b, b - learning_rate * g_b)]

train_model = theano.function(inputs=[x,y], outputs=cost_func, updates=updates_FP)

for every epoch: for every batch (xi,yi):

train_model(xi,yi)

Liangliang Cao

Theano for SGD

18

W = theano.shared(value=np.zeros((n_in, n_out)), name='W’)

b = theano.shared(value=np.zeros((n_out,)), name = ‘b’)

cost = hinge_loss(W*x+b, y)

g_W = T.grad(cost, W)

g_b = T.grad(cost, b)

updates_FP = [(W, W - learning_rate * g_W), (b, b - learning_rate * g_b)]

train_model = theano.function(inputs=[x,y], outputs=cost_func, updates=updates_FP)

for every epoch: for every batch (xi,yi):

train_model(xi,yi)

We can extend this simple model to multi-layer nets

Liangliang Cao

Layer DefiniBons

19

class AbstractLayer(object):

def __init__(self):

self.input_layer = []; self.params = []

def set_params_values(self, param_values):

for (p,v) in zip(self.params, param_values): p.set_value(v)

def get_params_values(self):

param_values = []

for p in self.params: param_values.append(p.get_value())

return param_values

def output(self, *args, **kwargs): # child class must inherited this!

return []

def get_output_shape(self): # child class must inherited this!

return []

From https://github.com/llcao/babyl

Liangliang Cao

Layer DefiniBons

20

class HiddenLayer(AbstractLayer):

def output(self, *args, **kwargs):

input = self.input_layer.output( *args, **kwargs)

lin_output = T.dot(input, self.W) + self.b

return self.activation(lin_output)

class Conv2DLayer(AbstractLayer):

def output(self, *args, **kwargs):

conv_out = conv.conv2d(input=self.input_layer.output(), filters=self.W)

return self.activation(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))

From https://github.com/llcao/babyl

Liangliang Cao

Other Deep Learning Packages

•  For speech: –  Kaldi

•  For computer vision –  Caffe –  Torch7 –  cuda-convnect (1, 2)

•  Others –  John Canny’s BIDMach

–  deep4j –  Word2vec –  RNNLM

21

Liangliang Cao 22

A Case study: Synonym ExtracBon

Liangliang Cao

Problem of Synonym ExtracBon

•  Synonym: a word that has the same or nearly the same meaning as another word

23

Liangliang Cao

Previous Works

•  Previous studies on synonym extraction are mostly on small datasets –  [Henriksoon 2014]: 340 medical synonym pairs –  [Wang & Hirst 2009]: 80 TOEFEL synonym questions –  [Collobert & Weston 2008]: thousands of synonym pairs

•  Our IJCAI’15 –  Word2Vect + feature expansion + linear SVM –  F1 = 0.71 on a medial synonym dataset with 2.4M pairs

24

Liangliang Cao

Network-‐1

25

Liangliang Cao

Theano ImplementaBon for Network-‐1

26

input = T.matrix('input')

target = T.ivector('target’)

layers = []

layers += [InputLayer(dim, input)]

layers += [HiddenLayer(layers[-1], 100, activation = T.tanh)]

layers += [HiddenLayer(layers[-1], 1, activation = None)]

output = layers[-1].output().flatten()

cost = T.mean(T.switch ( (output-target) * target > 0.0 , 0.0, output-target)**2)

all_para = get_all_parameters()

updates = gen_updates_sgd(cost, all_para, learning_rate)

train_model = theano.function([input, target], cost, updates=updates)

Liangliang Cao

General Feature Expansion

27

Hand-assigned feature expansion

Machine learned feature expansion

Liangliang Cao

Network 2

28

Liangliang Cao

ImplementaBon of Network-‐2

29

input = T.matrix('input')

target = T.ivector('target’)

layers = []

layers += [InputTensor3Layer(inputshape=[nbatch,nfeature,3])]

layers += [TensorHiddenLayer(layers[-1], outdim=10, activation = tanh)]

layers += [FlattenLayer(layers[-1], flattendim=2)]

layers += [HiddenLayer(layers[-1], outdim=100, activation = tanh)]

layers += [HiddenLayer( layers[-1], outdim=1, activation = None)]

output = layers[-1].output().flatten()

cost = T.mean(T.switch ( (output-target) * target > 0.0 , 0.0, output-target)**2)

all_para = get_all_parameters()

updates = gen_updates_sgd(cost, all_para, learning_rate)

train_model = theano.function([input, target], cost, updates=updates)

Liangliang Cao

Performance of Deep Model For Synonym ExtracBon

30

Experiments on Medial Synonym Dataset

Experiments on WordNet Synonym Dataset

Liangliang Cao

SummarizaBon

•  We went through quickly how to implement deep learning in Theano –  Gradient –  Stochastic gradient decent –  Layers –  … and a case study

•  Hope this experience can help you learn Theano or other deep learning toolkits.

•  Let’s learn deep learning together!

31

Deep learning reading group:

https://yahoo.jiveon.com/groups/deep-learning-reading-group-nyc-labs

Liangliang Cao

Thank you!

Questions and comments?

Liangliang Cao 33

Backup Slides

deep learning with theano (with a case study) - yahoo … · liangliang cao 1 deep learning with...

Documents