deep learning tutorial, part 2 (by ruslan salakhutdinov)

Deep Learning II

Ruslan Salakhutdinov

Department of Computer Science! University of Toronto à CMU!

Canadian Institute for Advanced Research

Microso9 Machine Learning and Intelligence School

Talk Roadmap

•  Learning structured and Robust Models •  Transfer and One-‐Shot Learning

Part 1: Advanced Deep Learning Models

Part 2: MulEmodal Deep Learning

•  MulEmodal DBMs for images and text •  CapEon GeneraEon with Recurrent RNNs •  Skip-‐Thought vectors for NLP

Face RecogniEon

Due to extreme illuminaEon variaEons, deep models perform quite poorly on this dataset.

Yale B Extended Face Dataset 4 subsets of increasing illuminaEon variaEons

Consider More Structured Models: undirected + directed models.

Deep LamberEan Model

Deep Undirected

Directed

Combines the elegant properEes of the LamberEan model with the Gaussian DBM model.

Observed Image

Inferred

(Tang et. Al., ICML 2012, Tang et. al. CVPR 2012)!

��

��

LamberEan Reflectance Model •  A simple model of the image formaEon process.

•  Albedo -‐-‐ diffuse reflecEvity of a surface, material dependent, illuminaEon independent.

Image albedo

Surface normal

Light source

•  Images with different illuminaEon can be generated by varying light direcEons

•  Surface normal -‐-‐ perpendicular to the tangent plane at a point on the surface.


Observed Image

Image albedo

Surface normals

Light source


Observed Image

Image albedo

Surface normals

Light source

Gaussian Deep Boltzmann Machine Albedo DBM:

Pretrained using Toronto Face Database

Transfer Learning

Inference: VariaEonal Inference. Learning: StochasEc ApproximaEon

Inferred

Yale B Extended Face Dataset

•  38 subjects, ∼ 45 images of varying illuminaEons per subject, divided into 4 subsets of increasing illuminaEon variaEons.

•  28 subjects for training, and 10 for tesEng.

Face RelighEng

One Test Image

Face RelighEng Observed Inferred albedo

RecogniEon Results RecogniEon as funcEon of the number of training images for 10 test subjects.

One-‐Shot RecogniEon

RecogniEon Results RecogniEon as funcEon of the number of training images for 10 test subjects.

One-‐Shot RecogniEon

What about dealing with occlusions or structured

noise?

Robust Boltzmann Machines •  Build more structured models that can deal with occlusions or structured noise.

Observed Image

Inferred Truth

Inferred Binary Mask

(Tang et. Al., ICML 2012, Tang et. al. CVPR 2012) !


Gaussian RBM, modeling clean faces

Binary RBM modeling occlusions

Binary pixel-‐wise Mask

Gaussian noise model

Observed Image


Gaussian RBM, modeling clean faces

Binary RBM modeling occlusions

Binary pixel-‐wise Mask

Gaussian noise model

is a heavy-‐tailed distribuEon

Inference: VariaEonal Inference. Learning: StochasEc ApproximaEon

# of iteraEons 1 3 5 7 10 20 30 40 50

��

Inference on the test subjects

# of iteraEons

RecogniEon Results on AR Face Database

Internal states of RoBM during learning.

Inferred

# of iteraEons 1 3 5 7 10 20 30 40 50

��

Inference on the test subjects

# of iteraEons

RecogniEon Results on AR Face Database

Internal states of RoBM during learning.

Inferred

Learning Algorithm Sunglasses Scarf

Robust BM 84.5% 80.7% RBM 61.7% 32.9% Eigenfaces 66.9% 38.6% LDA 56.1% 27.0% Pixel 51.3% 17.5%

Talk Roadmap




•  MulEmodal DBMs for images and text •  CapEon GeneraEon with Recurrent RNNs •  Skip-‐Thought vectors

Transfer Learning “zarc” “segway”

How can we learn a novel concept – a high dimensional staEsEcal object – from few examples.

(Lake et. al., 2010) !

Supervised Learning

Segway Motorcycle

Test:

Learning to Learn

Test:

Millions of unlabeled images

Some labeled images

Bicycle

Elephant

Dolphin

Tractor

Background Knowledge

Learn novel concept from one example

Learn to Transfer Knowledge

One shot learning of simple visual conceptsBrenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum

Department of Brain and Cognitive SciencesMassachusetts Institute of Technology

AbstractPeople can learn visual concepts from just one example, butit remains a mystery how this is accomplished. Many authorshave proposed that transferred knowledge from more familiarconcepts is a route to one shot learning, but what is the formof this abstract knowledge? One hypothesis is that the shar-ing of parts is core to one shot learning, and we evaluate thisidea in the domain of handwritten characters, using a massivenew dataset. These simple visual concepts have a rich inter-nal part structure, yet they are particularly tractable for com-putational models. We introduce a generative model of howcharacters are composed from strokes, where knowledge fromprevious characters helps to infer the latent strokes in novelcharacters. The stroke model outperforms a competing state-of-the-art character model on a challenging one shot learningtask, and it provides a good fit to human perceptual data.Keywords: category learning; transfer learning; Bayesianmodeling; neural networks

A hallmark of human cognition is learning from just a fewexamples. For instance, a person only needs to see one Seg-way to acquire the concept and be able to discriminate futureSegways from other vehicles like scooters and unicycles (Fig.1 left). Similarly, children can acquire a new word from oneencounter (Carey & Bartlett, 1978). How is one shot learningpossible?

New concepts are almost never learned in a vacuum. Pastexperience with other concepts in a domain can support therapid learning of novel concepts, by showing the learner whatmatters for generalization. Many authors have suggested thisas a route to one shot learning: transfer of abstract knowledgefrom old to new concepts, often called transfer learning, rep-resentation learning, or learning to learn. But what is thenature of the learned abstract knowledge that lets humans ac-quire new object concepts so quickly?

The most straightforward proposals invoke attentionallearning (Smith, Jones, Landau, Gershkoff-Stowe, & Samuel-son, 2002) or overhypotheses (Kemp, Perfors, & Tenenbaum,2007; Dewar & Xu, in press), like the shape bias in wordlearning. Prior experience with concepts that are clearly orga-nized along one dimension (e.g., shape, as opposed to color ormaterial) draws a learner’s attention to that same dimension(Smith et al., 2002) – or increases the prior probability of newconcepts concentrating on that same dimension (Kemp et al.,2007). But this approach is limited since it requires that therelevant dimensions of similarity be defined in advance.

For many real-world concepts, the relevant dimensions ofsimilarity may be constructed in the course of learning tolearn. For instance, when we first see a Segway, we mayparse it into a structure of familiar parts arranged in a novelconfiguration: it has two wheels, connected by a platform,supporting a motor and a central post at the top of which aretwo handlebars. These parts and their relations comprise a

Figure 1: Test yourself on one shot learning. From the exampleboxed in red, can you find the others in the array? On the left is aSegway and on the right is the first character of the Bengali alphabet.

AnswerfortheBengalicharacter:Row2,Column3;Row4,Column2.

Figure 2: Examples from a new 1600 character database.

useful representational basis for many different vehicle andartifact concepts – a representation that is likely learned inthe course of learning the concepts that they support. Severalpapers from the recent machine learning and computer visionliterature argue for such an approach: joint learning of manyconcepts and a high-level part vocabulary that underlies thoseconcepts (e.g., Torralba, Murphy, & Freeman, 2007; Fei-Fei,Fergus, & Perona, 2006). Another recently popular machinelearning approach is based on deep learning (Salakhutdinov& Hinton, 2009): unsupervised learning of hierarchies of dis-tributed feature representations in neural-network-style prob-abilistic generative models. These models do not specify ex-plicit parts and structural relations, but they can still constructmeaningful representations of what makes two objects deeplysimilar that go substantially beyond low-level image features.

These approaches from machine learning may be com-pelling ways to understand how humans learn so quickly,but there is little experimental evidence that directly supportsthem. Models that construct parts or features from sensorydata (pixels) while learning object concepts have been testedin elegant behavioral experiments with very simple stimuliand a very small number of concepts (Austerweil & Griffiths,2009; Schyns, Goldstone, & Thibaut, 1998). But there havebeen few systematic comparisons of multiple state-of-the-artcomputational approaches to representation learning with hu-

Hierarchical-‐Deep Model

HD Models: Integrate hierarchical Bayesian models with deep models.

Hierarchical Bayes: •  Learn hierarchies of categories for sharing abstract knowledge.

Super-‐category

One-‐Shot Learning

Deep Models: •  Learn hierarchies of features. •  Unsupervised feature learning – no need to rely on human-‐cra9ed input features.

Shared higher-‐level features

Shared low-‐level features

(Salakhutdinov, Tenenbaum, Torralba, NIPS 2011, PAMI 2013)!

Lower-‐level generic features:

DBM Model

Images

horse cow car van truck Higher-‐level class-‐sensi@ve features:

Hierarchical Dirichlet Process Model

•  capture disEncEve perceptual structure of a specific concept

•  edges, combinaEon of edges


Lower-‐level generic features:

DBM Model

Images

Higher-‐level class-‐sensi@ve features:

Hierarchical Organiza@on of Categories:

•  modular data-‐parameter relaEons

•  capture disEncEve perceptual structure of a specific concept

•  edges, combinaEon of edges

•  express priors on the features that are typical of different kinds of concepts

horse cow car van truck

“animal” “vehicle”

Hierarchical Dirichlet Process Model



“animal” “vehicle” (Nested Chinese Restaurant Process) prior: a nonparametric prior over tree structures

Tree hierarchy of classes is learned





(Nested Chinese Restaurant Process) prior: a nonparametric prior over tree structures

(Hierarchical Dirichlet Process) prior: a nonparametric prior allowing categories to share higher-‐level features, or parts.




(Nested Chinese Restaurant Process) prior: a nonparametric prior over tree structures

Deep Boltzmann Machine

Enforce (approximate) global consistency through many local constraints.


(Hierarchical Dirichlet Process) prior: a nonparametric prior allowing categories to share higher-‐level features, or parts.



CIFAR Object RecogniEon


50,000 images of 100 classes

4 million unlabeled images

32 x 32 pixels x 3 RGB

Lower-‐level generic features

Higher-‐level class sensiEve features


Inference: Markov chain Monte Carlo



Lower-‐level generic features

Higher-‐level features


4 million Images

Each image is made up of learned high-‐level features features.

Each higher-‐level feature is made up of lower-‐level features.

CIFAR Object RecogniEon (Salakhutdinov, Tenenbaum, Torralba, NIPS 2011, PAMI 2013)!

Learning the Hierarchy

woman

The model learns how to share the knowledge across many visual categories.

“human” “fruit” “aqua@c animal”

Learned super-‐class hierarchy

shark ray turtle dolphin baby man girl sunflower orange apple Basic level class

Learned low-‐level generic features

“global”

…

…

Learned higher-‐level class-‐sensi@ve features

Sharing Features Shape Color

Sunfl

ower

orange apple sunflower

Learning to Learn

“fruit”

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dete

ctio

n ra

te

false alarm rate

5  3 1 ex’s Sunflower ROC curve

Pixel-‐space distance

Learning a hierarchy for sharing parameters – rapid learning of a novel concept.

Dolphin

Apple

Real Reconst-‐ rucEons

Orange

Learning from 3 Examples

Given only 3 Examples

Generated Samples

Willow Tree Rocket

Learned higher-‐level features

Learned lower-‐level features

char 1 char 2 char 3 char 4 char 5

“alphabet 1”

“alphabet 2”

Edges

Strokes

Handwrinen Character RecogniEon

25,000 characters

SimulaEng New Characters

Global

Super class 1

Super class 2

Real data within super class

Class 1 Class 2 New class

Simulated new characters

SimulaEng New Characters Real data within super class

Class 1 Class 2

Super class 1

Super class 2

Global

New class


SimulaEng New Characters Real data within super class

Class 1 Class 2

Super class 1

Super class 2

Global

New class


The same model can be applied to speech, text, video, or any other high-‐dimensional data.

DiscriminaEve Transfer Learning with Deep Nets

Transfer Learning with Shared Priors

Srivastava and Salakhutdinov, NIPS 2013

Talk Roadmap





Data – CollecEon of ModaliEes •  MulEmedia content on the web -‐ image + text + audio.

•  Product recommendaEon systems.

•  RoboEcs applicaEons.

Audio Vision

Touch sensors Motor control

sunset, pacificocean, bakerbeach, seashore, ocean

car, automobile

Shared Concept “Modality-‐free” representaEon

“Modality-‐full” representaEon

“Concept”

sunset, pacific ocean, baker beach, seashore,

ocean

•  Improve ClassificaEon

MulE-‐Modal Input

pentax, k10d, kangarooisland southaustralia, sa australia australiansealion 300mm

SEA / NOT SEA

•  Retrieve data from one modality when queried using data from another modality

beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves

•  Fill in Missing ModaliEes beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves

Challenges -‐ I

Very different input representaEons

Image Text

sunset, pacific ocean, baker beach, seashore,

ocean •  Images – real-‐valued, dense

Difficult to learn cross-‐modal features from low-‐level representaEons.

Dense

•  Text – discrete, sparse

Sparse

Challenges -‐ II

Noisy and missing data

Image Text pentax, k10d, pentaxda50200, kangarooisland, sa, australiansealion

mickikrimmel, mickipedia, headshot

unseulpixel, naturey, crap

< no text>

Challenges -‐ II Image Text Text generated by the model

beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves

portrait, girl, woman, lady, blonde, preny, gorgeous, expression, model

night, none, traffic, light, lights, parking, darkness, lowlight, nacht, glow

fall, autumn, trees, leaves, foliage, forest, woods, branches, path

pentax, k10d, pentaxda50200, kangarooisland, sa, australiansealion

mickikrimmel, mickipedia, headshot

unseulpixel, naturey, crap

< no text>

A Simple MulEmodal Model •  Use a joint binary hidden layer. •  Problem: Inputs have very different staEsEcal properEes.

•  Difficult to learn cross-‐modal features.

0 0 1 0

0 Real-‐valued

1-‐of-‐K

0 0 1 0

0 Dense, real-‐valued image features

Gaussian model Replicated So9max

MulEmodal DBM

Word counts

(Srivastava & Salakhutdinov, NIPS 2012)

MulEmodal DBM

0 0 1 0



Word counts



0 0 1 0

0

MulEmodal DBM

Word counts

Dense, real-‐valued image features


0 0 1 0


Word counts


MulEmodal DBM

Bonom-‐up +

Top-‐down

0 0 1 0


Word counts

Gaussian RBM Replicated So9max

MulEmodal DBM

Bonom-‐up +

Top-‐down

Text Generated from Images

canada, nature, sunrise, ontario, fog, mist, bc, morning

insect, bunerfly, insects, bug, bunerflies, lepidoptera

graffiE, streetart, stencil, sEcker, urbanart, graff, sanfrancisco

portrait, child, kid, ritrano, kids, children, boy, cute, boys, italy

dog, cat, pet, kinen, puppy, ginger, tongue, kiny, dogs, furry

sea, france, boat, mer, beach, river, bretagne, plage, brinany

Given Generated Given Generated

Text Generated from Images Given Generated

water, glass, beer, bonle, drink, wine, bubbles, splash, drops, drop

portrait, women, army, soldier, mother, postcard, soldiers

obama, barackobama, elecEon, poliEcs, president, hope, change, sanfrancisco, convenEon, rally

Images from Text

water, red, sunset

nature, flower, red, green

blue, green, yellow, colors

chocolate, cake

Given Retrieved

MIR-‐Flickr Dataset

Huiskes et. al.

•  1 million images along with user-‐assigned tags.

sculpture, beauty, stone

nikon, green, light, photoshop, apple, d70

white, yellow, abstract, lines, bus, graphic

sky, geotagged, reflecEon, cielo, bilbao, reflejo

food, cupcake, vegan

d80

anawesomeshot, theperfectphotographer, flash, damniwishidtakenthat, spiritofphotography

nikon, abigfave, goldstaraward, d80, nikond80

Data and Architecture 12 Million parameters

2048

1024

1024

2000

1024

1024

3857

•  200 most frequent tags.

•  25K labeled subset (15K training, 10K tesEng)

•  38 classes -‐ sky, tree, baby, car, cloud …

•  AddiEonal 1 million unlabeled data

Results •  LogisEc regression on top-‐level representaEon. •  MulEmodal Inputs

Learning Algorithm MAP Precision@50

Random 0.124 0.124 LDA [Huiskes et. al.] 0.492 0.754 SVM [Huiskes et. al.] 0.475 0.758 DBM-‐Labelled 0.526 0.791

Mean Average Precision

Labeled 25K examples

Results •  LogisEc regression on top-‐level representaEon. •  MulEmodal Inputs

Learning Algorithm MAP Precision@50

Random 0.124 0.124 LDA [Huiskes et. al.] 0.492 0.754 SVM [Huiskes et. al.] 0.475 0.758 DBM-‐Labelled 0.526 0.791 Deep Belief Net 0.638 0.867 Autoencoder 0.638 0.875 DBM 0.641 0.873

Mean Average Precision

Labeled 25K examples

+ 1 Million unlabelled

GeneraEng Sentences

Input

A man skiing down the snow covered mountain with a dark sky in the background.

Output

•  More challenging problem. •  How can we generate complete descripEons of images?

Talk Roadmap





Encode-‐Decode Framework

•  Decoder: A neural language model that combines structure and content vectors for generaEng a sequence of words

•  Encoder: CNN and Recurrent Neural Net for a joint image-‐sentence embedding.

RepresentaEon of Words •  Key Idea: Each word w is represented as a D-‐dimensional

real-‐valued vector rw 2 RK.

Dimension 2

Dimen

sion 2

SemanEc Space

table

chair

dolphin whale

November

Bengio et.al., 2003, Mnih et. al., 2008, Mikolov et. al., 2009, Kiros et.al. 2014!

An Image-‐Text Encoder Joint Feature space

Encoder: ConvNet

ship water

Socher 2013, Frome 2013, Kiros 2014

•  Learn a joint embedding space of images and text: -  Can condiEon on anything (images, words, phrases, etc) -  Natural definiEon of a scoring funcEon (inner products in the

joint space).

An Image-‐Text Encoder

Recurrent Neural Network (LSTM)

1-‐of-‐V encoding of words

w1 w2 w3

ConvoluEonal Neural Network

Sentence RepresentaEon Image

RepresentaEon

See Skip-Thought Vectors, (Kiros et.al. arXiv 2015)

An Image-‐Text Encoder Joint Feature space

A castle and reflecEng water

A ship sailing in the ocean

Images:

Text:

A plane flying in the sky

Minimize the following objecEve:

Retrieving Sentences for Images

Tagging and Retrieval mosque, tower, building, cathedral, dome, castle

kitchen, stove, oven, refrigerator, microwave

ski, skiing, skiers, skiiers, snowmobile

bowl, cup, soup, cups, coffee

beach

snow

Retrieval with AdjecEves

fluffy

delicious

MulEmodal LinguisEc RegulariEes Nearest Images!

(Kiros, Salakhutdinov, Zemel, TACL 2015)

Can you solve Zero-‐Shot Problem?

Can you solve Zero-‐Shot Problem?

Canada Warbler Yellow Warbler Sharpe-‐tailed Sparrow

Model Architecture

CNNMLP

Classscore

Dotproduct

Wikipedia article

TF-IDF

Image

gf

The Cardinals or Cardinalidae are a family of passerine birds found in North and South AmericaThe South American cardinals in the genus…

fam

ily

north

genu

s

bird

s

sout

h

amer

ica

…

Cxk 1xk

1xC

AlternaEve View Joint SemanEc Feature space

•  Minimize the cross-‐entropy or hinge-‐loss objecEve.

L. Ba, K. Swersky, S. Fidler, Salakhutdinov, arXiv 2015

The Model •  Consider binary one vs. all classifier for class c:

•  Assume that we are given an addiEon text feature .

•  We can use this idea to predict the output weights of a classifier (both the fully connected and convoluEonal layers of a CNN).

•  Simple idea: Instead of learning a staEc weight vector w, the text feature can be used to predict the weights:

where is a mapping that transforms the text features to the visual image feature space.

Problem Setup •  The training set contains N images and their associated class labels . There are C disEnct class labels.

•  During test Eme, we are given addiEonal number of the previously unseen classes, such that .

•  Our goal is to predict previously unseen classes and perform well on the previously seen classes.

•  InterpretaEon: Learning a good similarity kernel between images and encyclopedia arEcles .

Caltech UCSD Bird and Oxford Flower Datasets

•  The CUB200-‐2010 contains 6,033 images from 200 bird species (about 30 images per class).

•  There is one Wikipedia arEcle associated with each bird class (200 arEcles in total). The average number of words per arEcles is 400.

•  Out of 200 classes, 40 classes are defined as unseen and the remaining 160 classes are defined as seen.

•  The Oxford Flower-‐102 dataset contains 102 classes with a total of 8189 images.

•  Each class contains 40 to 260 images. 82 flower classes are used for training and 20 classes are used as unseen during tesEng.

Results: Area Under ROC

Learning Algorithm Unseen Seen Mean

DA [Elhoseiny, et.al., 2013] 0.59 -‐ -‐ DA (VGG features) 0.62 -‐ -‐ Ours (fc) 0.82 0.96 0.934 Ours (conv) 0.73 0.96 0.91 Ours (fc+conv) 0.80 0.987 0.95

M. Elhoseiny, B. Saleh, and A. Elgammal, ICCV, 2013.

The CUB200-‐2010 Dataset

Results: Area Under ROC

Learning Algorithm Unseen Seen Mean

DA [Elhoseiny, et.al.] 0.62 -‐ -‐ GPR+DA [Elhoseiny, et.al.] 0.68 -‐ -‐ Ours (fc) 0.70 0.987 0.90 Ours (conv) 0.65 0.97 0.85 Ours (fc+conv) 0.71 0.989 0.93

M. Elhoseiny, B. Saleh, and A. Elgammal, ICCV, 2013.

The Oxford Flower Dataset

-‐  Tanagers -‐ Thraupidae -‐  Scarlet -‐ Cardinalidae -‐ Tanagers

Anribute Discovery Word sensiEviEes of unseen classes

Nearest Neighbors

•  Wikipedia arEcle for each class is projected onto its feature space and the nearest image neighbors from the test-‐set are shown.

-‐  rhizome -‐ depth -‐  freezing -‐ compost -‐ iris

Anribute Discovery Word sensiEviEes of unseen classes

Nearest Neighbors

•  Wikipedia arEcle for each class is projected onto its feature space and the nearest image neighbors from the test-‐set are shown.

Bearded Iris

How About GeneraEng Sentences!

Input

A man skiing down the snow covered mountain with a dark sky in the background.

Output

We want to model:

Log-‐bilinear Neural Language Model

•  Each word w is represented as a K-‐dim real-‐valued vector rw 2 RK.

•  Feedforward neural network with a single linear hidden layer.

•  R denote the V £ K matrix of word representaEon vectors, where V is the vocabulary size.

•  (w1, …, wn-‐1) is tuple of n-‐1 words, where n-‐1 is the context size. The next word representaEon becomes:

K £ K context parameter matrices


rw1 rw2 rw3

r

w4

w1 w2 w3

Log-‐bilinear Neural Language Model

•  The condiEonal probability of the next word given by:

Predicted representaEon of rwn.


rw1 rw2 rw3

r

w4

w1 w2 w3

Can be expensive to compute Bengio et.al. 2003

MulEplicaEve Model •  We represent words as a tensor:

where G is the number of tensor slices.

•  Re-‐represent Tensor in terms of 3 lower-‐rank matrices (where F is the number of pre-‐chosen factors):

•  Given an anribute vector u 2 RG (e.g. image features), we can compute anribute-‐gated word representaEons as:

(Kiros, Zemel, Salakhutdinov, NIPS 2014)

•  Let denote a folded K £ V matrix of word embeddings.

MulEplicaEve Log-‐bilinear Model

•  Then the predicted next word representaEon is:

•  Given next word representaEon r, the factor outputs are:

Component-‐wise product


Ew1 Ew2 Ew3

Low rank

r

w4

f u

GaEng anributes

Low rank

w1 w2 w3

(Kiros, Zemel, Salakhutdinov, NIPS 2014)

•  The condiEonal probability of the next word given by:


Ew1 Ew2 Ew3

Low rank

r

w4

f u

GaEng anributes

MulEplicaEve Log-‐bilinear Model

Low rank

w1 w2 w3

Decoding: Neural Language Model

•  Image features are gaEng the hidden-‐to-‐output connecEons when predicEng the next word!

•  We can also condiEon on POS tags when generaEng a sentence.

(Kiros, Salakhutdinov, Zemel, ICML 2014)

Decoding: Structured NLM

CapEon GeneraEon

Filling in the Blanks

CapEon GeneraEon

Model Samples

•  Two men in a room talking on a table . •  Two men are si|ng next to each other . •  Two men are having a conversaEon at a table . •  Two men si|ng at a desk next to each other .

colleagues waiters waiter entrepreneurs busboy

TAGS:

Results

•  R@K is Recall@K (high is good). •  Med r is the median rank (low is good).

CapEon GeneraEon with Visual AnenEon

A woman is throwing a frisbee in a park.

Xu et.al., ICML 2015


��

��

��

��

�� !��

��"�

��

��#!�� !��$��


Xu et.al., ICML 2015

•  Montreal/Toronto team takes 3rd place on Microso9 COCO capEon generaEon compeEEon, finishing slightly behind Google and Microso9. This is based on the human evaluaEon results.

Recurrent AnenEon Model

Coarse Image

ClassificaEon or CapEon GeneraEon

Recurrent Neural Network

Sample acEon:

•  Hard AYen@on: Sample acEon (latent gaze locaEon)

•  SoZ AYen@on: Take expectaEon instead of sampling

RelaEonship To Helmholtz Machines

Coarse image

StochasEc Units

DeterminisEc Units

•  Goal: Maximize the probability of correct class (sequence of words) by marginalizing over the acEons (or latent gaze locaEons):

StochasEc Units

•  Neural Network with stochasEc and determinisEc units.

Ba et.al., 2015


Coarse image

DeterminisEc Units

StochasEc Units

•  View this model as a condiEonal Helmholtz Machine with stochasEc and determinisEc units.

•  Can use Wake-‐Sleep, variaEonal autoencoders, and their variants to learns good anenEon policy!

Input data

h3

h2

h1

v

W3

W2

W1

Helmholtz Machine

Ba et.al., 2015


Coarse image

DeterminisEc Units

StochasEc Units

•  View this model as a condiEonal Helmholtz Machine with stochasEc and determinisEc units.

•  Can use Wake-‐Sleep, variaEonal autoencoders, and their variants to learns good anenEon policy!

Input data

h3

h2

h1

v

W3

W2

W1

Helmholtz Machine

Wake-‐Sleep Recurrent AnenEon Models

Ba et.al., 2015

Improving AcEon RecogniEon with Visual AnenEon

Cycling Horse back riding Basketball ShooEng Soccer juggling

Sharma et.al., 2015

Talk Roadmap





MoEvaEon •  Goal: Unsupervised learning of high-‐quality sentence

representaEons.

•  Previous approaches: Learning composiEon operators that map word vectors to sentence vectors, including recursive networks, recurrent net, etc.

•  Instead of using a word to predict its surrounding context, we encode a sentence to predict the sentences around it.

•  We develop an objecEve that abstracts the skip-‐gram model to the sentence level.

Decoder

Sequence to Sequence Learning

•  RNN Encoder-‐Decoders for Machine TranslaEon (Sutskever et al. 2014; Cho et al. 2014; Kalchbrenner et al. 2013, Srivastava et.al., 2015) Input Sequence

Encoder

Learned Representa@on

Output Sequence

Skip-‐Thought Model

•  Given a tuple of conEguous sentences: -  the sentence is encoded using LSTM. -  the sentence anempts to reconstruct the previous sentence and next sentence .

•  The input is the sentence triplet: -  I got back home. -  I could see the cat on the steps. -  This was strange.

Kiros et.al., arXiv 2015

Encoder

Sentence Generate Forward Sentence

Generate Previous Sentence

Skip-‐Thought Model

Learning ObjecEve •  We are given a tuple of conEguous sentences.

•  ObjecEve: The sum of the log-‐probabiliEes for the next and previous sentences condiEoned on the encoder representaEon:

representaEon of encoder

Forward sentence Previous sentence

Book 11K corpus

•  Query sentence along with its nearest neighbor from 500K sentences using cosine similarity:

-  He ran his hand inside his coat, double-‐checking that the unopened lener was sEll there.

-  He slipped his hand between his coat and his shirt, where the folded copies lay in a brown envelope.

Book 11K corpus

•  Query sentence along with its nearest neighbor from 500K sentences using cosine similarity:

SemanEc Relatedness •  SemEval 2014 Task 1: semanEc relatedness SICK dataset:

Given two sentences, produce a score of how semanEcally related these sentences are based on human generated scores (1 to 5).

•  The dataset comes with a predefined split of 4500 training pairs, 500 development pairs and 4927 tesEng pairs.

•  Using skip-‐thought vectors for each sentence, train a linear regression to predict semanEc relatedness. -  For pair of sentences, compute component-‐wise features

between pairs (e.g. |u-‐v|).

SemanEc Relatedness

•  Our models outperform all previous systems from the SemEval 2014 compeEEon. This is remarkable, given the simplicity of our approach and the lack of feature engineering.

SemEval 2014 sub-‐missions

Results reported by Tai et.al.

Ours

SemanEc Relatedness

•  Example predicEons from the SICK test set. GT is the ground truth relatedness, scored between 1 and 5.

•  The last few results: slight changes in sentences result in large changes in relatedness that we are unable to score correctly.

Paraphrase DetecEon •  Microso9 Research Paraphrase Corpus: For two sentences one

must predict whether or not they are paraphrases.

•  The training set contains 4076 sentence pairs (2753 are posiEve)

•  The test set contains 1725 pairs (1147 are posiEve).

Recursive Auto-‐encoders

Best published results

Ours

ClassificaEon Benchmarks •  5 datasets: movie review senEment (MR), customer product

reviews (CR), subjecEvity/objecEvity classificaEon (SUBJ), opinion polarity (MPQA) and quesEon-‐type classificaEon (TREC).

Bag-‐of-‐words

Super-‐vised

Ours

Summary •  We believe our model for learning skip-‐thought vectors only

scratches the surface of possible objecEves.

•  Many variaEons have yet to be explored, including -  deep encoders and decoders -  larger context windows -  encoding and decoding paragraphs -  other encoders

•  It is likely the case that more exploraEon of this space will result in even higher quality sentence representaEons.

•  Code and Data are available online! hYp://www.cs.toronto.edu/~mbweb/

MulE-‐Modal Models

Laser scans

Images

Video

Text & Language

Time series data

Speech & Audio

Develop learning systems that come closer to displaying human like intelligence

Summary •  Efficient learning algorithms for Hierarchical GeneraEve Models.

Learning more adapEve, robust, and structured representaEons.

•  Deep models can improve current state-‐of-‐the art in many applicaEon domains: Ø  Object recogniEon and detecEon, text and image retrieval, handwrinen

character and speech recogniEon, and others.

Text & image retrieval / Object recogni@on

Learning a Category Hierarchy

Dealing with missing/occluded data

HMM decoder

Speech Recogni@on

sunset, pacific ocean, beach, seashore

Mul@modal Data Cap@on Genera@on

Thank you

Code is available at: hnp://deeplearning.cs.toronto.edu/

References •  J. Ba, V. Mnih, K. KavukcuogluMulEple Object RecogniEon with Visual AnenEonI, CLR, 2015. •  L. Ba, K. Swersky, S. Fidler, Salakhutdinov,PredicEng Deep Zero-‐Shot ConvoluEonal Neural

Networks using Textual DescripEons, arXiv 2015 •  Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A Neural ProbabilisEc Language Model, JMLR

2003 •  Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI,Large-‐Scale Kernel Machines, MIT

Press, 2007 •  Y. Bengo, Learning Deep Architectures for AI, FoundaEons and Trends in Machine Learning,

2009 •  Y. Burda, R. Grosse, R. Salakhutdinov, Accurate and ConservaEve EsEmates of MRF Log-‐

likelihood using Reverse Annealing, In AI and StaEsEcs (AISTATS), 2015 •  K Cho, B van Merrienboer, C Gulcehre, F Bougares, H Schwenk, Y. Bengio, Learning Phrase

RepresentaEons using RNN Encoder-‐Decoder for StaEsEcal Machine TranslaEon, EMLP 2014 •  A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, DeViSE: A Deep

Visual-‐SemanEc Embedding Model, NIPS 2013 •  G. Hinton, R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks,

Science, 2006. •  G. Hinton, S. Osindero, Y. The, A fast learning algorithm for deep belief nets, Neural

ComputaEon, 18, 2006G. •  Hinton, Training Products of Experts by Minimizing ContrasEve Divergence, Neural

ComputaEon, 2002.

References •  H. Lee, A. Banle, R. Raina, A. Ng, Efficient sparse coding algorithms, NIPS 2007 •  H. Lee, R. Grosse, R. Ranganath, A. Y. Ng, ConvoluEonal deep belief networks for scalable

unsupervised learning of hierarchical representaEon, .ICML 2009N. •  Kalchbrenner, P. Blunsom, Recurrent ConEnuous TranslaEon Models, EMNLP 2013 •  R. Kiros, R. Salakhutdinov, R. Zemel,MulEmodal, Neural Language Models, ICML 2014 •  R. Kiros, R. Salakhutdinov, R. Zemel.Unifying, Visual-‐SemanEc Embeddings with MulEmodal

Neural Language Models, TACL 2015. •  R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun,S. Fidler, Skip-‐Thought

Vectors, arXiv 2015Y. •  LeCun, L. Bonou, Y. Bengio, P. Haffner, Gradient-‐Based Learning Applied to Document

RecogniEon, Proceedings of the IEEE, 1998 •  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representaEons of words

and phrases and their composiEonality, NIPS 2013 •  R. Salakhutdinov, J. Tenenbaum, A. Torralba, Learning with Hierarchical-‐Deep Models, IEEE

PAMI, vol. 35, no. 8, Aug. 2013, •  R. Salakhutdinov, Learning Deep GeneraEve Models, Annual Review of StaEsEcs and Its

ApplicaEon, Vol. 2, 2015 •  R. Salakhutdinov, G. Hinton, An Efficient Learning Procedure for Deep Boltzmann Machines,

Neural ComputaEon, 2012, Vol. 24, No. 8. •  R. Salakhutdinov, H. Larochelle, Efficient Learning of Deep Boltzmann Machines, AI and

StaEsEcs, 2010

References •  R. Salakhutdinov, G. Hinton, Replicated So9max: an Undirected Topic Model, NIPS 2010 •  R. Salakhutdinov, A. Mnih, G. Hinton, Restricted Boltzmann Machines for CollaboraEve

Filtering, ICML 2007 •  N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised Learning of Video

RepresentaEons using LSTMs, ICML, 2015 •  N. Srivastava, R. Salakhutdinov, MulEmodal Learning with Deep Boltzmann Machines, JMLR,

2014 •  I. Sutskever, O. Vinyals, Q. Le,, Sequence to Sequence Learning with Neural Networks, NIPS

2014 •  R. Socher, M. Ganjoo, C. Manning, A. Ng, Zero-‐Shot Learning Through Cross-‐Modal Transfer,

NIPS 2013 •  Y. Tang, R. Salakhutdinov, G. Hinton, Deep LamberEan Networks, ICM 2012 •  Y. Tang, R. Salakhutdinov, G. Hinton, Robust Boltzmann Machines for RecogniEon and

Denoising, CVPR 2012 •  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,R. Zemel, Y. Bengio, Show, Anend

and Tell: Neural Image CapEon GeneraEon with Visual AnenEon, ICML, 2015

deep learning tutorial, part 2 (by ruslan salakhutdinov)

Science

binarymask tang et

undirected directedmodels

canadian institute

material dependent

university of toronto