nicolas le roux - wordpress.com › ...deep learning tutorial nicolas le roux adam coates, yoshua...

Deep Learning tutorial

Nicolas Le Roux

Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng

DL/UFL workshop

16/12/11

Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng (DL/UFL workshop)Deep Learning tutorial 16/12/11 1 / 31

Disclaimer

A lot of the slides of this tutorial have been taken from

presentations from:

Honglak Lee

Yann LeCun

Geoff Hinton

Andrew Ng

Yoshua BengioNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 2 / 31

Outline1 Introduction

What is deep?

Training problems2 Unsupervised pretraining

Pretraining

The denoising autoencoder

Predictive sparse coding

The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 3 / 31

What does deep mean?

Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31

Properties of deep models

Many non-linearities = One non-linearity

More constrained space of transformations

Deep model = prior on the type of transformations

May need fewer computational units for the same function


Features for face recognition

Image courtesy of Honglak Lee


Features for face recognition

Image courtesy of Honglak Lee

“First-level” features are not

task-dependent

“First-level” features are easy to

learn

“Higher-level” features are

easier to learn given the

low-level features


A deep neural network

H1 = g(W1X)

H2 = g(W2H1)

Y = W3H2

g = sigmoid or tanh


Training issues

W3 is a lot easier to learn than W2 and W1

Compare

Given set of features, find combination

Given combination, find set of features


Practical results

Parameters randomly initialized

They do not change much (bad local minima)

A random transformation is not very good

Bengio et al., 2007


Practical results - TIMIT

Mohamed et al., Deep Belief Networks for phone recognition



What is deep?


Pretraining




The gist of pretraining

1 We want (potentially) deep networks

2 Training the lower layers is hard.

Solution: optimize each layer without relying on the layersabove.


The gist of pretraining

1 Learn W1 first

2 Given W1, learn W2

3 Given W1 and W2, learn W3


Greedy learning

1 Train each layer in a supervised way

2 Use alternate cost hoping that it will be useful.

Alternate cost: modeling the data well.


Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next level

A representation is good if it works well with noisy data


Denoising autoencoders

Vincent et al, 2008

Corrupt the input (e.g. set 25% of inputs to 0)

Reconstruct the uncorrupted input

Use uncorrupted encoding as input to next levelA representation is good if it works well with noisy dataNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31

DAE - A graphic explanation

Projects a “corrupted point” onto its position on the manifold

Two corrupted versions of the same point have the same representation


Stacking DAEs

1 Start with the lowest level and stack upwards

2 Train each layer on the intermediate code (features) from the layer below

3 Top layer can have a different output (e.g., softmax non-linearity) to provide an output for

classification



Objective function for sparse coding:

Z (X ) = argminZ12‖X − DZ‖2

2 + λ‖Z‖1

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward model



Objective function for sparse coding:


2 + λ‖Z‖1 + α‖Z − F (X , θ)‖22

F (X , θ) = Gtanh(WeX )

Z must:

Be sparse

Reconstruct X well

Be well predicted by a feedforward modelNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31

Stacking PSDs


2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22

F1(X , θ1) = Gtanh(W 1e X )

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier


Stacking PSDs

Z (H1) = argminZ12‖H1 − DZ‖2

2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22

F2(H1, θ2) = Gtanh(W 2e H1)

1 Train the first layer using PSD on X

2 Use H1 = |F1(X , θ)| as features

3 Train the second layer using PSD on H1

4 Use H2 = |F2(H1, θ)| as features

5 Keep going...

6 Use H37 as input to your final classifier


The mothership: the DBN

Unsupervised pretraining works by reconstructing the data

What if we tried to have a full generative model of the data?


Restricted Boltzmann Machine (RBM)

E(x,h) = −∑

ij

Wijxihj = −x>Wh

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]


Type of variables in an RBM

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

This formulation is correct when x and h are binary

variables

You have no choice over the type of x (this is your data)

You can choose what type h is

There are other formulations when x is not binary

They typically do not work that well (with notable

exceptions)


Training an RBM

P(x,h) =exp

[x>Wh

]∑x0,h0

exp[x>

0 Wh0]

We have a joint probability of data and “features”

Maximizing P(x) will give us meaningful features

There are efficient ways to do that (check my webpage)


Stacking RBMs

Learn the first layer by maximizing P(x) =∑

h1 P(x,h1)

Once trained, the features of x are EP [h1|x]

Learn the second layer by maximizing

P(h1) =∑

h2 P(h1,h2)

Keep going...Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 25 / 31

Results - 1


Results - 2



What is deep?


Pretraining




Take-home messages - Deep learning

Building a hierarchy of features can be beneficial

Training all layers at once is very hard

Pretraining alleviates the problem of bad local minima

Regularizes the model


Take-home messages - Pretraining

Pretraining requires building blocks

ALL pretraining methods share the same idea: model the

input while having somewhat invariant features

By no means is this necessarily the best method

And now for some applications...


This is the end...

Questions?


nicolas le roux - wordpress.com › ...deep learning tutorial nicolas le roux adam coates, yoshua...

Documents