deep learning with theano (with a case study) - yahoo … · liangliang cao 1 deep learning with...
TRANSCRIPT
Liangliang Cao 1
Deep Learning with Theano (with a case study)
Liangliang “Lyon” Cao
Yahoo! Labs
Liangliang Cao
Outline
• An abstract view of deep networks
• Implementing deep networks with Theano
• A case study: synonym extraction with neural network
2
Liangliang Cao 3
What is deep learning?
Liangliang Cao
An abstract view of deep network
• Estimate the output
4
• Compute the loss function
• Compute the gradient
C = Loss(o5 , y)
L5
L4
L3
L2
L1
x
o5 = L5( L4( L3( L2( L1(x) ) ) ) )
o1 = L1(x) o2 = L2(L1 (x))
…
Liangliang Cao
An abstract view of deep network (2)
• Estimate the output (Forward propagation)
5
• Compute the gradient (Backward propagation)
o5 = L5( L4( L3( L2( L1(x) ) ) ) )
Liangliang Cao
An abstract view of deep network (3)
• Suppose a layer is in the form of
• We can compute the gradients s.t. parameters
• Updating parameters by gradient descent
6
Liangliang Cao
An abstract view of deep network (Summary)
7
• There are many ways to define layers and cost functions
• Layer definitions may differ from field to field – Computer vision – NLP – Speech – …
• But there are only three key steps in deep network
L5
L4
L3
L2
L1
x
Liangliang Cao
An abstract view of deep network (Summary)
8
1. Forward propagation
2. Backward propagation
o5 = L5( L4( L3( L2( L1(x) ) ) ) )
3. Updating
L5
L4
L3
L2
L1
x
Liangliang Cao
Gradient descent is hard for large scale learning
Very often, a machine learning model with parameter w aims to minimize
when N is big, we can see that • The gradient becomes very expensive. • Even worse, we may not be able to load all (xi, yi) in to
memory!
9
Liangliang Cao
StochasBc Gradient Descent (SGD)
Idea: estimate the gradient on a randomly picked sample
• Gradient descent
• Stochastic gradient descent
Theoretical requirement for convergence:
10
in deep learning practice we just choose a small rate and then decrease it
Liangliang Cao
SGD as a typical deep learning solver
11
For every layer, compute the gradient and update.
Liangliang Cao
SGD and GPUs
12
For every layer, compute the gradient and update.
• Within every batch, SGD is mainly matrix multiplication: perfect task for GPU!
• See a demo on how much a GPU can help
Liangliang Cao 13
How to implement deep models?
Liangliang Cao
Theano
• Developed at Univ. Montreal (Yoshua Bengio’s group)
• Users write in a language similar to Numpy, => Theano compiled them into C/CUDAC.
• Use to be relatively slow but now catch up.
• Easy to use: – Easiest toolkit to compute gradient – No need to touch the details of GPU programming
• Limited to single machines
14
Liangliang Cao
Theano for Symbolic Math
15
Example: Sigmoid function
import theano.tensor as T x = T.scalar() y = T.scalar()
z = 3*x + 4y +5 s = 1/(1+T.exp(z))
Liangliang Cao
Theano for CompuBng Gradients
16
Example:
import theano.tensor as T x = T.scalar() gx = T.grad(x**2, x)
gx2 = T.grad(T.log(x), x) gx3 = T.grad(1/(x), x)
T.grad() is the most amazing function in theano.
Liangliang Cao
Theano for SGD
17
W = theano.shared(value=np.zeros((n_in, n_out)), name='W’)
b = theano.shared(value=np.zeros((n_out,)), name = ‘b’)
cost = hinge_loss(W*x+b, y)
g_W = T.grad(cost, W)
g_b = T.grad(cost, b)
updates_FP = [(W, W - learning_rate * g_W), (b, b - learning_rate * g_b)]
train_model = theano.function(inputs=[x,y], outputs=cost_func, updates=updates_FP)
for every epoch: for every batch (xi,yi):
train_model(xi,yi)
Liangliang Cao
Theano for SGD
18
W = theano.shared(value=np.zeros((n_in, n_out)), name='W’)
b = theano.shared(value=np.zeros((n_out,)), name = ‘b’)
cost = hinge_loss(W*x+b, y)
g_W = T.grad(cost, W)
g_b = T.grad(cost, b)
updates_FP = [(W, W - learning_rate * g_W), (b, b - learning_rate * g_b)]
train_model = theano.function(inputs=[x,y], outputs=cost_func, updates=updates_FP)
for every epoch: for every batch (xi,yi):
train_model(xi,yi)
We can extend this simple model to multi-layer nets
Liangliang Cao
Layer DefiniBons
19
class AbstractLayer(object):
def __init__(self):
self.input_layer = []; self.params = []
def set_params_values(self, param_values):
for (p,v) in zip(self.params, param_values): p.set_value(v)
def get_params_values(self):
param_values = []
for p in self.params: param_values.append(p.get_value())
return param_values
def output(self, *args, **kwargs): # child class must inherited this!
return []
def get_output_shape(self): # child class must inherited this!
return []
From https://github.com/llcao/babyl
Liangliang Cao
Layer DefiniBons
20
class HiddenLayer(AbstractLayer):
def output(self, *args, **kwargs):
input = self.input_layer.output( *args, **kwargs)
lin_output = T.dot(input, self.W) + self.b
return self.activation(lin_output)
class Conv2DLayer(AbstractLayer):
def output(self, *args, **kwargs):
conv_out = conv.conv2d(input=self.input_layer.output(), filters=self.W)
return self.activation(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
From https://github.com/llcao/babyl
Liangliang Cao
Other Deep Learning Packages
• For speech: – Kaldi
• For computer vision – Caffe – Torch7 – cuda-convnect (1, 2)
• Others – John Canny’s BIDMach
– deep4j – Word2vec – RNNLM
21
Liangliang Cao 22
A Case study: Synonym ExtracBon
Liangliang Cao
Problem of Synonym ExtracBon
• Synonym: a word that has the same or nearly the same meaning as another word
23
Liangliang Cao
Previous Works
• Previous studies on synonym extraction are mostly on small datasets – [Henriksoon 2014]: 340 medical synonym pairs – [Wang & Hirst 2009]: 80 TOEFEL synonym questions – [Collobert & Weston 2008]: thousands of synonym pairs
• Our IJCAI’15 – Word2Vect + feature expansion + linear SVM – F1 = 0.71 on a medial synonym dataset with 2.4M pairs
24
Liangliang Cao
Network-‐1
25
Liangliang Cao
Theano ImplementaBon for Network-‐1
26
input = T.matrix('input')
target = T.ivector('target’)
layers = []
layers += [InputLayer(dim, input)]
layers += [HiddenLayer(layers[-1], 100, activation = T.tanh)]
layers += [HiddenLayer(layers[-1], 1, activation = None)]
output = layers[-1].output().flatten()
cost = T.mean(T.switch ( (output-target) * target > 0.0 , 0.0, output-target)**2)
all_para = get_all_parameters()
updates = gen_updates_sgd(cost, all_para, learning_rate)
train_model = theano.function([input, target], cost, updates=updates)
Liangliang Cao
General Feature Expansion
27
Hand-assigned feature expansion
Machine learned feature expansion
Liangliang Cao
Network 2
28
Liangliang Cao
ImplementaBon of Network-‐2
29
input = T.matrix('input')
target = T.ivector('target’)
layers = []
layers += [InputTensor3Layer(inputshape=[nbatch,nfeature,3])]
layers += [TensorHiddenLayer(layers[-1], outdim=10, activation = tanh)]
layers += [FlattenLayer(layers[-1], flattendim=2)]
layers += [HiddenLayer(layers[-1], outdim=100, activation = tanh)]
layers += [HiddenLayer( layers[-1], outdim=1, activation = None)]
output = layers[-1].output().flatten()
cost = T.mean(T.switch ( (output-target) * target > 0.0 , 0.0, output-target)**2)
all_para = get_all_parameters()
updates = gen_updates_sgd(cost, all_para, learning_rate)
train_model = theano.function([input, target], cost, updates=updates)
Liangliang Cao
Performance of Deep Model For Synonym ExtracBon
30
Experiments on Medial Synonym Dataset
Experiments on WordNet Synonym Dataset
Liangliang Cao
SummarizaBon
• We went through quickly how to implement deep learning in Theano – Gradient – Stochastic gradient decent – Layers – … and a case study
• Hope this experience can help you learn Theano or other deep learning toolkits.
• Let’s learn deep learning together!
31
Deep learning reading group:
https://yahoo.jiveon.com/groups/deep-learning-reading-group-nyc-labs
Liangliang Cao
Thank you!
Questions and comments?
Liangliang Cao 33
Backup Slides