deep learning tutorial, part 2 (by ruslan salakhutdinov)
TRANSCRIPT
Deep Learning II
Ruslan Salakhutdinov
Department of Computer Science! University of Toronto à CMU!
Canadian Institute for Advanced Research
Microso9 Machine Learning and Intelligence School
Talk Roadmap
• Learning structured and Robust Models • Transfer and One-‐Shot Learning
Part 1: Advanced Deep Learning Models
Part 2: MulEmodal Deep Learning
• MulEmodal DBMs for images and text • CapEon GeneraEon with Recurrent RNNs • Skip-‐Thought vectors for NLP
Face RecogniEon
Due to extreme illuminaEon variaEons, deep models perform quite poorly on this dataset.
Yale B Extended Face Dataset 4 subsets of increasing illuminaEon variaEons
Consider More Structured Models: undirected + directed models.
Deep LamberEan Model
Deep Undirected
Directed
Combines the elegant properEes of the LamberEan model with the Gaussian DBM model.
Observed Image
Inferred
(Tang et. Al., ICML 2012, Tang et. al. CVPR 2012)!
������������� �� ������
������
LamberEan Reflectance Model • A simple model of the image formaEon process.
• Albedo -‐-‐ diffuse reflecEvity of a surface, material dependent, illuminaEon independent.
Image albedo
Surface normal
Light source
• Images with different illuminaEon can be generated by varying light direcEons
• Surface normal -‐-‐ perpendicular to the tangent plane at a point on the surface.
Deep LamberEan Model
Observed Image
Image albedo
Surface normals
Light source
Deep LamberEan Model
Observed Image
Image albedo
Surface normals
Light source
Gaussian Deep Boltzmann Machine Albedo DBM:
Pretrained using Toronto Face Database
Transfer Learning
Inference: VariaEonal Inference. Learning: StochasEc ApproximaEon
Inferred
Yale B Extended Face Dataset
• 38 subjects, ∼ 45 images of varying illuminaEons per subject, divided into 4 subsets of increasing illuminaEon variaEons.
• 28 subjects for training, and 10 for tesEng.
Face RelighEng
One Test Image
Face RelighEng Observed Inferred albedo
RecogniEon Results RecogniEon as funcEon of the number of training images for 10 test subjects.
One-‐Shot RecogniEon
RecogniEon Results RecogniEon as funcEon of the number of training images for 10 test subjects.
One-‐Shot RecogniEon
What about dealing with occlusions or structured
noise?
Robust Boltzmann Machines • Build more structured models that can deal with occlusions or structured noise.
Observed Image
Inferred Truth
Inferred Binary Mask
(Tang et. Al., ICML 2012, Tang et. al. CVPR 2012) !
Robust Boltzmann Machines • Build more structured models that can deal with occlusions or structured noise.
Gaussian RBM, modeling clean faces
Binary RBM modeling occlusions
Binary pixel-‐wise Mask
Gaussian noise model
Observed Image
Robust Boltzmann Machines • Build more structured models that can deal with occlusions or structured noise.
Gaussian RBM, modeling clean faces
Binary RBM modeling occlusions
Binary pixel-‐wise Mask
Gaussian noise model
is a heavy-‐tailed distribuEon
Inference: VariaEonal Inference. Learning: StochasEc ApproximaEon
# of iteraEons 1 3 5 7 10 20 30 40 50
�������������������� � � � � � ��
Inference on the test subjects
# of iteraEons
RecogniEon Results on AR Face Database
Internal states of RoBM during learning.
Inferred
# of iteraEons 1 3 5 7 10 20 30 40 50
�������������������� � � � � � ��
Inference on the test subjects
# of iteraEons
RecogniEon Results on AR Face Database
Internal states of RoBM during learning.
Inferred
# of iteraEons 1 3 5 7 10 20 30 40 50
�������������������� � � � � � ��
Inference on the test subjects
# of iteraEons
RecogniEon Results on AR Face Database
Internal states of RoBM during learning.
Inferred
Learning Algorithm Sunglasses Scarf
Robust BM 84.5% 80.7% RBM 61.7% 32.9% Eigenfaces 66.9% 38.6% LDA 56.1% 27.0% Pixel 51.3% 17.5%
Talk Roadmap
• Learning structured and Robust Models • Transfer and One-‐Shot Learning
Part 1: Advanced Deep Learning Models
Part 2: MulEmodal Deep Learning
• MulEmodal DBMs for images and text • CapEon GeneraEon with Recurrent RNNs • Skip-‐Thought vectors
Transfer Learning “zarc” “segway”
How can we learn a novel concept – a high dimensional staEsEcal object – from few examples.
(Lake et. al., 2010) !
Supervised Learning
Segway Motorcycle
Test:
Learning to Learn
Test:
Millions of unlabeled images
Some labeled images
Bicycle
Elephant
Dolphin
Tractor
Background Knowledge
Learn novel concept from one example
Learn to Transfer Knowledge
One shot learning of simple visual conceptsBrenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum
Department of Brain and Cognitive SciencesMassachusetts Institute of Technology
AbstractPeople can learn visual concepts from just one example, butit remains a mystery how this is accomplished. Many authorshave proposed that transferred knowledge from more familiarconcepts is a route to one shot learning, but what is the formof this abstract knowledge? One hypothesis is that the shar-ing of parts is core to one shot learning, and we evaluate thisidea in the domain of handwritten characters, using a massivenew dataset. These simple visual concepts have a rich inter-nal part structure, yet they are particularly tractable for com-putational models. We introduce a generative model of howcharacters are composed from strokes, where knowledge fromprevious characters helps to infer the latent strokes in novelcharacters. The stroke model outperforms a competing state-of-the-art character model on a challenging one shot learningtask, and it provides a good fit to human perceptual data.Keywords: category learning; transfer learning; Bayesianmodeling; neural networks
A hallmark of human cognition is learning from just a fewexamples. For instance, a person only needs to see one Seg-way to acquire the concept and be able to discriminate futureSegways from other vehicles like scooters and unicycles (Fig.1 left). Similarly, children can acquire a new word from oneencounter (Carey & Bartlett, 1978). How is one shot learningpossible?
New concepts are almost never learned in a vacuum. Pastexperience with other concepts in a domain can support therapid learning of novel concepts, by showing the learner whatmatters for generalization. Many authors have suggested thisas a route to one shot learning: transfer of abstract knowledgefrom old to new concepts, often called transfer learning, rep-resentation learning, or learning to learn. But what is thenature of the learned abstract knowledge that lets humans ac-quire new object concepts so quickly?
The most straightforward proposals invoke attentionallearning (Smith, Jones, Landau, Gershkoff-Stowe, & Samuel-son, 2002) or overhypotheses (Kemp, Perfors, & Tenenbaum,2007; Dewar & Xu, in press), like the shape bias in wordlearning. Prior experience with concepts that are clearly orga-nized along one dimension (e.g., shape, as opposed to color ormaterial) draws a learner’s attention to that same dimension(Smith et al., 2002) – or increases the prior probability of newconcepts concentrating on that same dimension (Kemp et al.,2007). But this approach is limited since it requires that therelevant dimensions of similarity be defined in advance.
For many real-world concepts, the relevant dimensions ofsimilarity may be constructed in the course of learning tolearn. For instance, when we first see a Segway, we mayparse it into a structure of familiar parts arranged in a novelconfiguration: it has two wheels, connected by a platform,supporting a motor and a central post at the top of which aretwo handlebars. These parts and their relations comprise a
Figure 1: Test yourself on one shot learning. From the exampleboxed in red, can you find the others in the array? On the left is aSegway and on the right is the first character of the Bengali alphabet.
AnswerfortheBengalicharacter:Row2,Column3;Row4,Column2.
Figure 2: Examples from a new 1600 character database.
useful representational basis for many different vehicle andartifact concepts – a representation that is likely learned inthe course of learning the concepts that they support. Severalpapers from the recent machine learning and computer visionliterature argue for such an approach: joint learning of manyconcepts and a high-level part vocabulary that underlies thoseconcepts (e.g., Torralba, Murphy, & Freeman, 2007; Fei-Fei,Fergus, & Perona, 2006). Another recently popular machinelearning approach is based on deep learning (Salakhutdinov& Hinton, 2009): unsupervised learning of hierarchies of dis-tributed feature representations in neural-network-style prob-abilistic generative models. These models do not specify ex-plicit parts and structural relations, but they can still constructmeaningful representations of what makes two objects deeplysimilar that go substantially beyond low-level image features.
These approaches from machine learning may be com-pelling ways to understand how humans learn so quickly,but there is little experimental evidence that directly supportsthem. Models that construct parts or features from sensorydata (pixels) while learning object concepts have been testedin elegant behavioral experiments with very simple stimuliand a very small number of concepts (Austerweil & Griffiths,2009; Schyns, Goldstone, & Thibaut, 1998). But there havebeen few systematic comparisons of multiple state-of-the-artcomputational approaches to representation learning with hu-
Hierarchical-‐Deep Model
HD Models: Integrate hierarchical Bayesian models with deep models.
Hierarchical Bayes: • Learn hierarchies of categories for sharing abstract knowledge.
Super-‐category
One-‐Shot Learning
Deep Models: • Learn hierarchies of features. • Unsupervised feature learning – no need to rely on human-‐cra9ed input features.
Shared higher-‐level features
Shared low-‐level features
(Salakhutdinov, Tenenbaum, Torralba, NIPS 2011, PAMI 2013)!
Lower-‐level generic features:
DBM Model
Images
horse cow car van truck Higher-‐level class-‐sensi@ve features:
Hierarchical Dirichlet Process Model
• capture disEncEve perceptual structure of a specific concept
• edges, combinaEon of edges
Hierarchical-‐Deep Model
Lower-‐level generic features:
DBM Model
Images
Higher-‐level class-‐sensi@ve features:
Hierarchical Organiza@on of Categories:
• modular data-‐parameter relaEons
• capture disEncEve perceptual structure of a specific concept
• edges, combinaEon of edges
• express priors on the features that are typical of different kinds of concepts
horse cow car van truck
“animal” “vehicle”
Hierarchical Dirichlet Process Model
Hierarchical-‐Deep Model
horse cow car van truck
Hierarchical-‐Deep Model
horse cow car van truck
“animal” “vehicle” (Nested Chinese Restaurant Process) prior: a nonparametric prior over tree structures
Tree hierarchy of classes is learned
Hierarchical-‐Deep Model
horse cow car van truck
“animal” “vehicle”
Tree hierarchy of classes is learned
(Nested Chinese Restaurant Process) prior: a nonparametric prior over tree structures
(Hierarchical Dirichlet Process) prior: a nonparametric prior allowing categories to share higher-‐level features, or parts.
Hierarchical-‐Deep Model
“animal” “vehicle”
Tree hierarchy of classes is learned
(Nested Chinese Restaurant Process) prior: a nonparametric prior over tree structures
Deep Boltzmann Machine
Enforce (approximate) global consistency through many local constraints.
horse cow car van truck
(Hierarchical Dirichlet Process) prior: a nonparametric prior allowing categories to share higher-‐level features, or parts.
Hierarchical-‐Deep Model
horse cow car van truck
CIFAR Object RecogniEon
“animal” “vehicle”
50,000 images of 100 classes
4 million unlabeled images
32 x 32 pixels x 3 RGB
Lower-‐level generic features
Higher-‐level class sensiEve features
Tree hierarchy of classes is learned
Inference: Markov chain Monte Carlo
horse cow car van truck
“animal” “vehicle”
Lower-‐level generic features
Higher-‐level features
Tree hierarchy of classes is learned
4 million Images
Each image is made up of learned high-‐level features features.
Each higher-‐level feature is made up of lower-‐level features.
CIFAR Object RecogniEon (Salakhutdinov, Tenenbaum, Torralba, NIPS 2011, PAMI 2013)!
Learning the Hierarchy
woman
The model learns how to share the knowledge across many visual categories.
“human” “fruit” “aqua@c animal”
Learned super-‐class hierarchy
shark ray turtle dolphin baby man girl sunflower orange apple Basic level class
Learned low-‐level generic features
“global”
…
…
Learned higher-‐level class-‐sensi@ve features
Sharing Features Shape Color
Sunfl
ower
orange apple sunflower
Learning to Learn
“fruit”
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
dete
ctio
n ra
te
false alarm rate
5 3 1 ex’s Sunflower ROC curve
Pixel-‐space distance
Learning a hierarchy for sharing parameters – rapid learning of a novel concept.
Dolphin
Apple
Real Reconst-‐ rucEons
Orange
Learning from 3 Examples
Given only 3 Examples
Generated Samples
Willow Tree Rocket
Learned higher-‐level features
Learned lower-‐level features
char 1 char 2 char 3 char 4 char 5
“alphabet 1”
“alphabet 2”
Edges
Strokes
Handwrinen Character RecogniEon
25,000 characters
SimulaEng New Characters
Global
Super class 1
Super class 2
Real data within super class
Class 1 Class 2 New class
Simulated new characters
SimulaEng New Characters Real data within super class
Class 1 Class 2
Super class 1
Super class 2
Global
New class
Simulated new characters
SimulaEng New Characters Real data within super class
Class 1 Class 2
Super class 1
Super class 2
Global
New class
Simulated new characters
SimulaEng New Characters Real data within super class
Class 1 Class 2
Super class 1
Super class 2
Global
New class
Simulated new characters
SimulaEng New Characters Real data within super class
Class 1 Class 2
Super class 1
Super class 2
Global
New class
Simulated new characters
SimulaEng New Characters Real data within super class
Class 1 Class 2
Super class 1
Super class 2
Global
New class
Simulated new characters
SimulaEng New Characters Real data within super class
Class 1 Class 2
Super class 1
Super class 2
Global
New class
Simulated new characters
The same model can be applied to speech, text, video, or any other high-‐dimensional data.
DiscriminaEve Transfer Learning with Deep Nets
Transfer Learning with Shared Priors
Srivastava and Salakhutdinov, NIPS 2013
Talk Roadmap
• Learning structured and Robust Models • Transfer and One-‐Shot Learning
Part 1: Advanced Deep Learning Models
Part 2: MulEmodal Deep Learning
• MulEmodal DBMs for images and text • CapEon GeneraEon with Recurrent RNNs • Skip-‐Thought vectors
Data – CollecEon of ModaliEes • MulEmedia content on the web -‐ image + text + audio.
• Product recommendaEon systems.
• RoboEcs applicaEons.
Audio Vision
Touch sensors Motor control
sunset, pacificocean, bakerbeach, seashore, ocean
car, automobile
Shared Concept “Modality-‐free” representaEon
“Modality-‐full” representaEon
“Concept”
sunset, pacific ocean, baker beach, seashore,
ocean
• Improve ClassificaEon
MulE-‐Modal Input
pentax, k10d, kangarooisland southaustralia, sa australia australiansealion 300mm
SEA / NOT SEA
• Retrieve data from one modality when queried using data from another modality
beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves
• Fill in Missing ModaliEes beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves
Challenges -‐ I
Very different input representaEons
Image Text
sunset, pacific ocean, baker beach, seashore,
ocean • Images – real-‐valued, dense
Difficult to learn cross-‐modal features from low-‐level representaEons.
Dense
• Text – discrete, sparse
Sparse
Challenges -‐ II
Noisy and missing data
Image Text pentax, k10d, pentaxda50200, kangarooisland, sa, australiansealion
mickikrimmel, mickipedia, headshot
unseulpixel, naturey, crap
< no text>
Challenges -‐ II Image Text Text generated by the model
beach, sea, surf, strand, shore, wave, seascape, sand, ocean, waves
portrait, girl, woman, lady, blonde, preny, gorgeous, expression, model
night, none, traffic, light, lights, parking, darkness, lowlight, nacht, glow
fall, autumn, trees, leaves, foliage, forest, woods, branches, path
pentax, k10d, pentaxda50200, kangarooisland, sa, australiansealion
mickikrimmel, mickipedia, headshot
unseulpixel, naturey, crap
< no text>
A Simple MulEmodal Model • Use a joint binary hidden layer. • Problem: Inputs have very different staEsEcal properEes.
• Difficult to learn cross-‐modal features.
0 0 1 0
0 Real-‐valued
1-‐of-‐K
0 0 1 0
0 Dense, real-‐valued image features
Gaussian model Replicated So9max
MulEmodal DBM
Word counts
(Srivastava & Salakhutdinov, NIPS 2012)
MulEmodal DBM
0 0 1 0
0 Dense, real-‐valued image features
Gaussian model Replicated So9max
Word counts
(Srivastava & Salakhutdinov, NIPS 2012)
Gaussian model Replicated So9max
0 0 1 0
0
MulEmodal DBM
Word counts
Dense, real-‐valued image features
(Srivastava & Salakhutdinov, NIPS 2012)
0 0 1 0
0 Dense, real-‐valued image features
Word counts
Gaussian model Replicated So9max
MulEmodal DBM
Bonom-‐up +
Top-‐down
0 0 1 0
0 Dense, real-‐valued image features
Word counts
Gaussian RBM Replicated So9max
MulEmodal DBM
Bonom-‐up +
Top-‐down
Text Generated from Images
canada, nature, sunrise, ontario, fog, mist, bc, morning
insect, bunerfly, insects, bug, bunerflies, lepidoptera
graffiE, streetart, stencil, sEcker, urbanart, graff, sanfrancisco
portrait, child, kid, ritrano, kids, children, boy, cute, boys, italy
dog, cat, pet, kinen, puppy, ginger, tongue, kiny, dogs, furry
sea, france, boat, mer, beach, river, bretagne, plage, brinany
Given Generated Given Generated
Text Generated from Images Given Generated
water, glass, beer, bonle, drink, wine, bubbles, splash, drops, drop
portrait, women, army, soldier, mother, postcard, soldiers
obama, barackobama, elecEon, poliEcs, president, hope, change, sanfrancisco, convenEon, rally
Images from Text
water, red, sunset
nature, flower, red, green
blue, green, yellow, colors
chocolate, cake
Given Retrieved
MIR-‐Flickr Dataset
Huiskes et. al.
• 1 million images along with user-‐assigned tags.
sculpture, beauty, stone
nikon, green, light, photoshop, apple, d70
white, yellow, abstract, lines, bus, graphic
sky, geotagged, reflecEon, cielo, bilbao, reflejo
food, cupcake, vegan
d80
anawesomeshot, theperfectphotographer, flash, damniwishidtakenthat, spiritofphotography
nikon, abigfave, goldstaraward, d80, nikond80
Data and Architecture 12 Million parameters
2048
1024
1024
2000
1024
1024
3857
• 200 most frequent tags.
• 25K labeled subset (15K training, 10K tesEng)
• 38 classes -‐ sky, tree, baby, car, cloud …
• AddiEonal 1 million unlabeled data
Results • LogisEc regression on top-‐level representaEon. • MulEmodal Inputs
Learning Algorithm MAP Precision@50
Random 0.124 0.124 LDA [Huiskes et. al.] 0.492 0.754 SVM [Huiskes et. al.] 0.475 0.758 DBM-‐Labelled 0.526 0.791
Mean Average Precision
Labeled 25K examples
Results • LogisEc regression on top-‐level representaEon. • MulEmodal Inputs
Learning Algorithm MAP Precision@50
Random 0.124 0.124 LDA [Huiskes et. al.] 0.492 0.754 SVM [Huiskes et. al.] 0.475 0.758 DBM-‐Labelled 0.526 0.791 Deep Belief Net 0.638 0.867 Autoencoder 0.638 0.875 DBM 0.641 0.873
Mean Average Precision
Labeled 25K examples
+ 1 Million unlabelled
GeneraEng Sentences
Input
A man skiing down the snow covered mountain with a dark sky in the background.
Output
• More challenging problem. • How can we generate complete descripEons of images?
Talk Roadmap
• Learning structured and Robust Models • Transfer and One-‐Shot Learning
Part 1: Advanced Deep Learning Models
Part 2: MulEmodal Deep Learning
• MulEmodal DBMs for images and text • CapEon GeneraEon with Recurrent RNNs • Skip-‐Thought vectors
Encode-‐Decode Framework
• Decoder: A neural language model that combines structure and content vectors for generaEng a sequence of words
• Encoder: CNN and Recurrent Neural Net for a joint image-‐sentence embedding.
RepresentaEon of Words • Key Idea: Each word w is represented as a D-‐dimensional
real-‐valued vector rw 2 RK.
Dimension 2
Dimen
sion 2
SemanEc Space
table
chair
dolphin whale
November
Bengio et.al., 2003, Mnih et. al., 2008, Mikolov et. al., 2009, Kiros et.al. 2014!
An Image-‐Text Encoder Joint Feature space
Encoder: ConvNet
ship water
Socher 2013, Frome 2013, Kiros 2014
• Learn a joint embedding space of images and text: - Can condiEon on anything (images, words, phrases, etc) - Natural definiEon of a scoring funcEon (inner products in the
joint space).
An Image-‐Text Encoder
Recurrent Neural Network (LSTM)
1-‐of-‐V encoding of words
w1 w2 w3
ConvoluEonal Neural Network
Sentence RepresentaEon Image
RepresentaEon
See Skip-Thought Vectors, (Kiros et.al. arXiv 2015)
An Image-‐Text Encoder Joint Feature space
A castle and reflecEng water
A ship sailing in the ocean
Images:
Text:
A plane flying in the sky
Minimize the following objecEve:
Retrieving Sentences for Images
Tagging and Retrieval mosque, tower, building, cathedral, dome, castle
kitchen, stove, oven, refrigerator, microwave
ski, skiing, skiers, skiiers, snowmobile
bowl, cup, soup, cups, coffee
beach
snow
Retrieval with AdjecEves
fluffy
delicious
MulEmodal LinguisEc RegulariEes Nearest Images!
(Kiros, Salakhutdinov, Zemel, TACL 2015)
MulEmodal LinguisEc RegulariEes Nearest Images!
(Kiros, Salakhutdinov, Zemel, TACL 2015)
Can you solve Zero-‐Shot Problem?
Can you solve Zero-‐Shot Problem?
Canada Warbler Yellow Warbler Sharpe-‐tailed Sparrow
Model Architecture
CNNMLP
Classscore
Dotproduct
Wikipedia article
TF-IDF
Image
gf
The Cardinals or Cardinalidae are a family of passerine birds found in North and South AmericaThe South American cardinals in the genus…
fam
ily
north
genu
s
bird
s
sout
h
amer
ica
…
Cxk 1xk
1xC
AlternaEve View Joint SemanEc Feature space
• Minimize the cross-‐entropy or hinge-‐loss objecEve.
L. Ba, K. Swersky, S. Fidler, Salakhutdinov, arXiv 2015
The Model • Consider binary one vs. all classifier for class c:
• Assume that we are given an addiEon text feature .
• We can use this idea to predict the output weights of a classifier (both the fully connected and convoluEonal layers of a CNN).
• Simple idea: Instead of learning a staEc weight vector w, the text feature can be used to predict the weights:
where is a mapping that transforms the text features to the visual image feature space.
Problem Setup • The training set contains N images and their associated class labels . There are C disEnct class labels.
• During test Eme, we are given addiEonal number of the previously unseen classes, such that .
• Our goal is to predict previously unseen classes and perform well on the previously seen classes.
• InterpretaEon: Learning a good similarity kernel between images and encyclopedia arEcles .
Caltech UCSD Bird and Oxford Flower Datasets
• The CUB200-‐2010 contains 6,033 images from 200 bird species (about 30 images per class).
• There is one Wikipedia arEcle associated with each bird class (200 arEcles in total). The average number of words per arEcles is 400.
• Out of 200 classes, 40 classes are defined as unseen and the remaining 160 classes are defined as seen.
• The Oxford Flower-‐102 dataset contains 102 classes with a total of 8189 images.
• Each class contains 40 to 260 images. 82 flower classes are used for training and 20 classes are used as unseen during tesEng.
Results: Area Under ROC
Learning Algorithm Unseen Seen Mean
DA [Elhoseiny, et.al., 2013] 0.59 -‐ -‐ DA (VGG features) 0.62 -‐ -‐ Ours (fc) 0.82 0.96 0.934 Ours (conv) 0.73 0.96 0.91 Ours (fc+conv) 0.80 0.987 0.95
M. Elhoseiny, B. Saleh, and A. Elgammal, ICCV, 2013.
The CUB200-‐2010 Dataset
Results: Area Under ROC
Learning Algorithm Unseen Seen Mean
DA [Elhoseiny, et.al.] 0.62 -‐ -‐ GPR+DA [Elhoseiny, et.al.] 0.68 -‐ -‐ Ours (fc) 0.70 0.987 0.90 Ours (conv) 0.65 0.97 0.85 Ours (fc+conv) 0.71 0.989 0.93
M. Elhoseiny, B. Saleh, and A. Elgammal, ICCV, 2013.
The Oxford Flower Dataset
-‐ Tanagers -‐ Thraupidae -‐ Scarlet -‐ Cardinalidae -‐ Tanagers
Anribute Discovery Word sensiEviEes of unseen classes
Nearest Neighbors
• Wikipedia arEcle for each class is projected onto its feature space and the nearest image neighbors from the test-‐set are shown.
-‐ rhizome -‐ depth -‐ freezing -‐ compost -‐ iris
Anribute Discovery Word sensiEviEes of unseen classes
Nearest Neighbors
• Wikipedia arEcle for each class is projected onto its feature space and the nearest image neighbors from the test-‐set are shown.
Bearded Iris
How About GeneraEng Sentences!
Input
A man skiing down the snow covered mountain with a dark sky in the background.
Output
We want to model:
Log-‐bilinear Neural Language Model
• Each word w is represented as a K-‐dim real-‐valued vector rw 2 RK.
• Feedforward neural network with a single linear hidden layer.
• R denote the V £ K matrix of word representaEon vectors, where V is the vocabulary size.
• (w1, …, wn-‐1) is tuple of n-‐1 words, where n-‐1 is the context size. The next word representaEon becomes:
K £ K context parameter matrices
1-‐of-‐V encoding of words
rw1 rw2 rw3
r
w4
w1 w2 w3
Log-‐bilinear Neural Language Model
• The condiEonal probability of the next word given by:
Predicted representaEon of rwn.
1-‐of-‐V encoding of words
rw1 rw2 rw3
r
w4
w1 w2 w3
Can be expensive to compute Bengio et.al. 2003
MulEplicaEve Model • We represent words as a tensor:
where G is the number of tensor slices.
• Re-‐represent Tensor in terms of 3 lower-‐rank matrices (where F is the number of pre-‐chosen factors):
• Given an anribute vector u 2 RG (e.g. image features), we can compute anribute-‐gated word representaEons as:
(Kiros, Zemel, Salakhutdinov, NIPS 2014)
• Let denote a folded K £ V matrix of word embeddings.
MulEplicaEve Log-‐bilinear Model
• Then the predicted next word representaEon is:
• Given next word representaEon r, the factor outputs are:
Component-‐wise product
1-‐of-‐V encoding of words
Ew1 Ew2 Ew3
Low rank
r
w4
f u
GaEng anributes
Low rank
w1 w2 w3
(Kiros, Zemel, Salakhutdinov, NIPS 2014)
• The condiEonal probability of the next word given by:
1-‐of-‐V encoding of words
Ew1 Ew2 Ew3
Low rank
r
w4
f u
GaEng anributes
MulEplicaEve Log-‐bilinear Model
Low rank
w1 w2 w3
Decoding: Neural Language Model
• Image features are gaEng the hidden-‐to-‐output connecEons when predicEng the next word!
• We can also condiEon on POS tags when generaEng a sentence.
(Kiros, Salakhutdinov, Zemel, ICML 2014)
Decoding: Structured NLM
Decoding: Structured NLM
Decoding: Structured NLM
Decoding: Structured NLM
Decoding: Structured NLM
CapEon GeneraEon
CapEon GeneraEon
CapEon GeneraEon
Filling in the Blanks
CapEon GeneraEon
Model Samples
• Two men in a room talking on a table . • Two men are si|ng next to each other . • Two men are having a conversaEon at a table . • Two men si|ng at a desk next to each other .
colleagues waiters waiter entrepreneurs busboy
TAGS:
Results
• R@K is Recall@K (high is good). • Med r is the median rank (low is good).
CapEon GeneraEon with Visual AnenEon
A woman is throwing a frisbee in a park.
Xu et.al., ICML 2015
CapEon GeneraEon with Visual AnenEon
A woman is throwing a frisbee in a park.
Xu et.al., ICML 2015
CapEon GeneraEon with Visual AnenEon
����������������
������������������������������
��������������������
����
�������� !�����
�������������"�
���������������������
����#!���������� ��!��$��� ��
CapEon GeneraEon with Visual AnenEon
Xu et.al., ICML 2015
• Montreal/Toronto team takes 3rd place on Microso9 COCO capEon generaEon compeEEon, finishing slightly behind Google and Microso9. This is based on the human evaluaEon results.
Recurrent AnenEon Model
Coarse Image
ClassificaEon or CapEon GeneraEon
Recurrent Neural Network
Sample acEon:
• Hard AYen@on: Sample acEon (latent gaze locaEon)
• SoZ AYen@on: Take expectaEon instead of sampling
RelaEonship To Helmholtz Machines
Coarse image
StochasEc Units
DeterminisEc Units
• Goal: Maximize the probability of correct class (sequence of words) by marginalizing over the acEons (or latent gaze locaEons):
StochasEc Units
• Neural Network with stochasEc and determinisEc units.
Ba et.al., 2015
RelaEonship To Helmholtz Machines
Coarse image
DeterminisEc Units
StochasEc Units
• View this model as a condiEonal Helmholtz Machine with stochasEc and determinisEc units.
• Can use Wake-‐Sleep, variaEonal autoencoders, and their variants to learns good anenEon policy!
Input data
h3
h2
h1
v
W3
W2
W1
Helmholtz Machine
Ba et.al., 2015
RelaEonship To Helmholtz Machines
Coarse image
DeterminisEc Units
StochasEc Units
• View this model as a condiEonal Helmholtz Machine with stochasEc and determinisEc units.
• Can use Wake-‐Sleep, variaEonal autoencoders, and their variants to learns good anenEon policy!
Input data
h3
h2
h1
v
W3
W2
W1
Helmholtz Machine
Wake-‐Sleep Recurrent AnenEon Models
Ba et.al., 2015
Improving AcEon RecogniEon with Visual AnenEon
Cycling Horse back riding Basketball ShooEng Soccer juggling
Sharma et.al., 2015
Talk Roadmap
• Learning structured and Robust Models • Transfer and One-‐Shot Learning
Part 1: Advanced Deep Learning Models
Part 2: MulEmodal Deep Learning
• MulEmodal DBMs for images and text • CapEon GeneraEon with Recurrent RNNs • Skip-‐Thought vectors
MoEvaEon • Goal: Unsupervised learning of high-‐quality sentence
representaEons.
• Previous approaches: Learning composiEon operators that map word vectors to sentence vectors, including recursive networks, recurrent net, etc.
• Instead of using a word to predict its surrounding context, we encode a sentence to predict the sentences around it.
• We develop an objecEve that abstracts the skip-‐gram model to the sentence level.
Decoder
Sequence to Sequence Learning
• RNN Encoder-‐Decoders for Machine TranslaEon (Sutskever et al. 2014; Cho et al. 2014; Kalchbrenner et al. 2013, Srivastava et.al., 2015) Input Sequence
Encoder
Learned Representa@on
Output Sequence
Skip-‐Thought Model
• Given a tuple of conEguous sentences: - the sentence is encoded using LSTM. - the sentence anempts to reconstruct the previous sentence and next sentence .
• The input is the sentence triplet: - I got back home. - I could see the cat on the steps. - This was strange.
Kiros et.al., arXiv 2015
Encoder
Sentence Generate Forward Sentence
Generate Previous Sentence
Skip-‐Thought Model
Learning ObjecEve • We are given a tuple of conEguous sentences.
• ObjecEve: The sum of the log-‐probabiliEes for the next and previous sentences condiEoned on the encoder representaEon:
representaEon of encoder
Forward sentence Previous sentence
Book 11K corpus
• Query sentence along with its nearest neighbor from 500K sentences using cosine similarity:
- He ran his hand inside his coat, double-‐checking that the unopened lener was sEll there.
- He slipped his hand between his coat and his shirt, where the folded copies lay in a brown envelope.
Book 11K corpus
• Query sentence along with its nearest neighbor from 500K sentences using cosine similarity:
SemanEc Relatedness • SemEval 2014 Task 1: semanEc relatedness SICK dataset:
Given two sentences, produce a score of how semanEcally related these sentences are based on human generated scores (1 to 5).
• The dataset comes with a predefined split of 4500 training pairs, 500 development pairs and 4927 tesEng pairs.
• Using skip-‐thought vectors for each sentence, train a linear regression to predict semanEc relatedness. - For pair of sentences, compute component-‐wise features
between pairs (e.g. |u-‐v|).
SemanEc Relatedness
• Our models outperform all previous systems from the SemEval 2014 compeEEon. This is remarkable, given the simplicity of our approach and the lack of feature engineering.
SemEval 2014 sub-‐missions
Results reported by Tai et.al.
Ours
SemanEc Relatedness
• Example predicEons from the SICK test set. GT is the ground truth relatedness, scored between 1 and 5.
• The last few results: slight changes in sentences result in large changes in relatedness that we are unable to score correctly.
Paraphrase DetecEon • Microso9 Research Paraphrase Corpus: For two sentences one
must predict whether or not they are paraphrases.
• The training set contains 4076 sentence pairs (2753 are posiEve)
• The test set contains 1725 pairs (1147 are posiEve).
Recursive Auto-‐encoders
Best published results
Ours
ClassificaEon Benchmarks • 5 datasets: movie review senEment (MR), customer product
reviews (CR), subjecEvity/objecEvity classificaEon (SUBJ), opinion polarity (MPQA) and quesEon-‐type classificaEon (TREC).
Bag-‐of-‐words
Super-‐vised
Ours
Summary • We believe our model for learning skip-‐thought vectors only
scratches the surface of possible objecEves.
• Many variaEons have yet to be explored, including - deep encoders and decoders - larger context windows - encoding and decoding paragraphs - other encoders
• It is likely the case that more exploraEon of this space will result in even higher quality sentence representaEons.
• Code and Data are available online! hYp://www.cs.toronto.edu/~mbweb/
MulE-‐Modal Models
Laser scans
Images
Video
Text & Language
Time series data
Speech & Audio
Develop learning systems that come closer to displaying human like intelligence
Summary • Efficient learning algorithms for Hierarchical GeneraEve Models.
Learning more adapEve, robust, and structured representaEons.
• Deep models can improve current state-‐of-‐the art in many applicaEon domains: Ø Object recogniEon and detecEon, text and image retrieval, handwrinen
character and speech recogniEon, and others.
Text & image retrieval / Object recogni@on
Learning a Category Hierarchy
Dealing with missing/occluded data
HMM decoder
Speech Recogni@on
sunset, pacific ocean, beach, seashore
Mul@modal Data Cap@on Genera@on
Thank you
Code is available at: hnp://deeplearning.cs.toronto.edu/
References • J. Ba, V. Mnih, K. KavukcuogluMulEple Object RecogniEon with Visual AnenEonI, CLR, 2015. • L. Ba, K. Swersky, S. Fidler, Salakhutdinov,PredicEng Deep Zero-‐Shot ConvoluEonal Neural
Networks using Textual DescripEons, arXiv 2015 • Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A Neural ProbabilisEc Language Model, JMLR
2003 • Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI,Large-‐Scale Kernel Machines, MIT
Press, 2007 • Y. Bengo, Learning Deep Architectures for AI, FoundaEons and Trends in Machine Learning,
2009 • Y. Burda, R. Grosse, R. Salakhutdinov, Accurate and ConservaEve EsEmates of MRF Log-‐
likelihood using Reverse Annealing, In AI and StaEsEcs (AISTATS), 2015 • K Cho, B van Merrienboer, C Gulcehre, F Bougares, H Schwenk, Y. Bengio, Learning Phrase
RepresentaEons using RNN Encoder-‐Decoder for StaEsEcal Machine TranslaEon, EMLP 2014 • A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, DeViSE: A Deep
Visual-‐SemanEc Embedding Model, NIPS 2013 • G. Hinton, R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks,
Science, 2006. • G. Hinton, S. Osindero, Y. The, A fast learning algorithm for deep belief nets, Neural
ComputaEon, 18, 2006G. • Hinton, Training Products of Experts by Minimizing ContrasEve Divergence, Neural
ComputaEon, 2002.
References • H. Lee, A. Banle, R. Raina, A. Ng, Efficient sparse coding algorithms, NIPS 2007 • H. Lee, R. Grosse, R. Ranganath, A. Y. Ng, ConvoluEonal deep belief networks for scalable
unsupervised learning of hierarchical representaEon, .ICML 2009N. • Kalchbrenner, P. Blunsom, Recurrent ConEnuous TranslaEon Models, EMNLP 2013 • R. Kiros, R. Salakhutdinov, R. Zemel,MulEmodal, Neural Language Models, ICML 2014 • R. Kiros, R. Salakhutdinov, R. Zemel.Unifying, Visual-‐SemanEc Embeddings with MulEmodal
Neural Language Models, TACL 2015. • R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun,S. Fidler, Skip-‐Thought
Vectors, arXiv 2015Y. • LeCun, L. Bonou, Y. Bengio, P. Haffner, Gradient-‐Based Learning Applied to Document
RecogniEon, Proceedings of the IEEE, 1998 • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representaEons of words
and phrases and their composiEonality, NIPS 2013 • R. Salakhutdinov, J. Tenenbaum, A. Torralba, Learning with Hierarchical-‐Deep Models, IEEE
PAMI, vol. 35, no. 8, Aug. 2013, • R. Salakhutdinov, Learning Deep GeneraEve Models, Annual Review of StaEsEcs and Its
ApplicaEon, Vol. 2, 2015 • R. Salakhutdinov, G. Hinton, An Efficient Learning Procedure for Deep Boltzmann Machines,
Neural ComputaEon, 2012, Vol. 24, No. 8. • R. Salakhutdinov, H. Larochelle, Efficient Learning of Deep Boltzmann Machines, AI and
StaEsEcs, 2010
References • R. Salakhutdinov, G. Hinton, Replicated So9max: an Undirected Topic Model, NIPS 2010 • R. Salakhutdinov, A. Mnih, G. Hinton, Restricted Boltzmann Machines for CollaboraEve
Filtering, ICML 2007 • N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised Learning of Video
RepresentaEons using LSTMs, ICML, 2015 • N. Srivastava, R. Salakhutdinov, MulEmodal Learning with Deep Boltzmann Machines, JMLR,
2014 • I. Sutskever, O. Vinyals, Q. Le,, Sequence to Sequence Learning with Neural Networks, NIPS
2014 • R. Socher, M. Ganjoo, C. Manning, A. Ng, Zero-‐Shot Learning Through Cross-‐Modal Transfer,
NIPS 2013 • Y. Tang, R. Salakhutdinov, G. Hinton, Deep LamberEan Networks, ICM 2012 • Y. Tang, R. Salakhutdinov, G. Hinton, Robust Boltzmann Machines for RecogniEon and
Denoising, CVPR 2012 • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,R. Zemel, Y. Bengio, Show, Anend
and Tell: Neural Image CapEon GeneraEon with Visual AnenEon, ICML, 2015