deep learning tutorial, part 2 (by ruslan salakhutdinov)

136
Deep Learning II Ruslan Salakhutdinov Department of Computer Science University of Toronto CMU Canadian Institute for Advanced Research Microso9 Machine Learning and Intelligence School

Upload: anton-konushin

Post on 14-Aug-2015

206 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Deep  Learning  II  

Ruslan  Salakhutdinov  

Department of Computer Science! University of Toronto à CMU!

Canadian Institute for Advanced Research  

Microso9  Machine  Learning  and    Intelligence  School  

Page 2: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Talk  Roadmap  

•  Learning  structured  and  Robust  Models  •  Transfer  and  One-­‐Shot  Learning  

Part  1:  Advanced  Deep  Learning  Models  

Part  2:  MulEmodal  Deep  Learning  

•  MulEmodal  DBMs  for  images  and  text  •  CapEon  GeneraEon  with  Recurrent  RNNs  •  Skip-­‐Thought  vectors  for  NLP  

Page 3: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Face  RecogniEon  

Due  to  extreme  illuminaEon  variaEons,  deep  models  perform  quite  poorly  on  this  dataset.    

Yale  B  Extended  Face  Dataset  4  subsets  of  increasing  illuminaEon  variaEons  

Page 4: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Consider  More  Structured  Models:  undirected  +  directed  models.  

Deep  LamberEan  Model  

Deep  Undirected  

Directed  

Combines  the  elegant  properEes  of  the  LamberEan  model  with  the  Gaussian  DBM  model.  

Observed  Image  

Inferred  

(Tang et. Al., ICML 2012, Tang et. al. CVPR 2012)!

Page 5: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

������������� �� ������

������

LamberEan  Reflectance  Model  •   A  simple  model  of  the  image  formaEon  process.  

•   Albedo  -­‐-­‐  diffuse  reflecEvity  of  a  surface,  material  dependent,  illuminaEon  independent.  

Image  albedo  

Surface    normal  

Light    source  

•   Images  with  different  illuminaEon  can  be  generated  by  varying  light  direcEons  

•   Surface  normal  -­‐-­‐  perpendicular  to  the  tangent  plane  at  a  point  on  the  surface.  

Page 6: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Deep  LamberEan  Model  

Observed  Image  

Image  albedo  

Surface    normals  

Light    source  

Page 7: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Deep  LamberEan  Model  

Observed  Image  

Image  albedo  

Surface    normals  

Light    source  

Gaussian  Deep    Boltzmann  Machine   Albedo  DBM:  

Pretrained  using  Toronto  Face  Database  

Transfer  Learning  

Inference:  VariaEonal  Inference.  Learning:  StochasEc  ApproximaEon  

Inferred  

Page 8: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Yale  B  Extended  Face  Dataset  

•   38  subjects,  ∼  45  images  of  varying  illuminaEons  per  subject,  divided  into  4  subsets  of  increasing  illuminaEon  variaEons.      

•   28  subjects  for  training,  and  10  for  tesEng.    

Page 9: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Face  RelighEng  

One  Test  Image  

Face  RelighEng  Observed  Inferred  albedo  

Page 10: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

RecogniEon  Results  RecogniEon  as  funcEon  of  the  number  of  training  images  for  10  test  subjects.    

One-­‐Shot  RecogniEon  

Page 11: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

RecogniEon  Results  RecogniEon  as  funcEon  of  the  number  of  training  images  for  10  test  subjects.    

One-­‐Shot  RecogniEon  

What  about  dealing  with  occlusions  or  structured  

noise?  

Page 12: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Robust  Boltzmann  Machines  •   Build  more  structured  models  that  can  deal  with  occlusions  or  structured  noise.    

Observed  Image  

Inferred  Truth  

Inferred  Binary  Mask  

(Tang et. Al., ICML 2012, Tang et. al. CVPR 2012) !

Page 13: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Robust  Boltzmann  Machines  •   Build  more  structured  models  that  can  deal  with  occlusions  or  structured  noise.    

Gaussian  RBM,  modeling  clean  faces  

Binary  RBM    modeling  occlusions  

Binary  pixel-­‐wise  Mask  

Gaussian  noise  model  

Observed  Image  

Page 14: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Robust  Boltzmann  Machines  •   Build  more  structured  models  that  can  deal  with  occlusions  or  structured  noise.    

Gaussian  RBM,  modeling  clean  faces  

Binary  RBM    modeling  occlusions  

Binary  pixel-­‐wise  Mask  

Gaussian  noise  model  

                             is  a  heavy-­‐tailed  distribuEon  

Inference:  VariaEonal  Inference.  Learning:  StochasEc  ApproximaEon  

Page 15: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

#  of  iteraEons          1    3          5                  7    10              20      30            40  50          

�������������������� � � � � � ��

Inference  on  the  test  subjects  

#  of  iteraEons  

RecogniEon  Results  on    AR  Face  Database  

Internal  states  of  RoBM  during  learning.    

Inferred

 

Page 16: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

#  of  iteraEons          1    3          5                  7    10              20      30            40  50          

�������������������� � � � � � ��

Inference  on  the  test  subjects  

#  of  iteraEons  

RecogniEon  Results  on    AR  Face  Database  

Internal  states  of  RoBM  during  learning.    

Inferred

 

Page 17: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

#  of  iteraEons          1    3          5                  7    10              20      30            40  50          

�������������������� � � � � � ��

Inference  on  the  test  subjects  

#  of  iteraEons  

RecogniEon  Results  on    AR  Face  Database  

Internal  states  of  RoBM  during  learning.    

Inferred

 

Learning  Algorithm   Sunglasses   Scarf  

Robust  BM   84.5%   80.7%  RBM   61.7%   32.9%  Eigenfaces   66.9%   38.6%  LDA   56.1%   27.0%  Pixel   51.3%   17.5%  

Page 18: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Talk  Roadmap  

•  Learning  structured  and  Robust  Models  •  Transfer  and  One-­‐Shot  Learning  

Part  1:  Advanced  Deep  Learning  Models  

Part  2:  MulEmodal  Deep  Learning  

•  MulEmodal  DBMs  for  images  and  text  •  CapEon  GeneraEon  with  Recurrent  RNNs  •  Skip-­‐Thought  vectors  

Page 19: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Transfer  Learning  “zarc” “segway”

How  can  we  learn  a  novel  concept  –  a  high  dimensional  staEsEcal  object  –  from  few  examples.      

(Lake et. al., 2010) !

Page 20: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Supervised  Learning  

Segway   Motorcycle  

Test:    

Page 21: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Learning  to  Learn  

Test:    

Millions  of  unlabeled  images    

Some  labeled  images  

Bicycle  

Elephant  

Dolphin  

Tractor  

Background  Knowledge  

Learn  novel  concept  from  one  example  

Learn  to  Transfer  Knowledge  

Page 22: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)
Page 23: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

One shot learning of simple visual conceptsBrenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum

Department of Brain and Cognitive SciencesMassachusetts Institute of Technology

AbstractPeople can learn visual concepts from just one example, butit remains a mystery how this is accomplished. Many authorshave proposed that transferred knowledge from more familiarconcepts is a route to one shot learning, but what is the formof this abstract knowledge? One hypothesis is that the shar-ing of parts is core to one shot learning, and we evaluate thisidea in the domain of handwritten characters, using a massivenew dataset. These simple visual concepts have a rich inter-nal part structure, yet they are particularly tractable for com-putational models. We introduce a generative model of howcharacters are composed from strokes, where knowledge fromprevious characters helps to infer the latent strokes in novelcharacters. The stroke model outperforms a competing state-of-the-art character model on a challenging one shot learningtask, and it provides a good fit to human perceptual data.Keywords: category learning; transfer learning; Bayesianmodeling; neural networks

A hallmark of human cognition is learning from just a fewexamples. For instance, a person only needs to see one Seg-way to acquire the concept and be able to discriminate futureSegways from other vehicles like scooters and unicycles (Fig.1 left). Similarly, children can acquire a new word from oneencounter (Carey & Bartlett, 1978). How is one shot learningpossible?

New concepts are almost never learned in a vacuum. Pastexperience with other concepts in a domain can support therapid learning of novel concepts, by showing the learner whatmatters for generalization. Many authors have suggested thisas a route to one shot learning: transfer of abstract knowledgefrom old to new concepts, often called transfer learning, rep-resentation learning, or learning to learn. But what is thenature of the learned abstract knowledge that lets humans ac-quire new object concepts so quickly?

The most straightforward proposals invoke attentionallearning (Smith, Jones, Landau, Gershkoff-Stowe, & Samuel-son, 2002) or overhypotheses (Kemp, Perfors, & Tenenbaum,2007; Dewar & Xu, in press), like the shape bias in wordlearning. Prior experience with concepts that are clearly orga-nized along one dimension (e.g., shape, as opposed to color ormaterial) draws a learner’s attention to that same dimension(Smith et al., 2002) – or increases the prior probability of newconcepts concentrating on that same dimension (Kemp et al.,2007). But this approach is limited since it requires that therelevant dimensions of similarity be defined in advance.

For many real-world concepts, the relevant dimensions ofsimilarity may be constructed in the course of learning tolearn. For instance, when we first see a Segway, we mayparse it into a structure of familiar parts arranged in a novelconfiguration: it has two wheels, connected by a platform,supporting a motor and a central post at the top of which aretwo handlebars. These parts and their relations comprise a

Figure 1: Test yourself on one shot learning. From the exampleboxed in red, can you find the others in the array? On the left is aSegway and on the right is the first character of the Bengali alphabet.

AnswerfortheBengalicharacter:Row2,Column3;Row4,Column2.

Figure 2: Examples from a new 1600 character database.

useful representational basis for many different vehicle andartifact concepts – a representation that is likely learned inthe course of learning the concepts that they support. Severalpapers from the recent machine learning and computer visionliterature argue for such an approach: joint learning of manyconcepts and a high-level part vocabulary that underlies thoseconcepts (e.g., Torralba, Murphy, & Freeman, 2007; Fei-Fei,Fergus, & Perona, 2006). Another recently popular machinelearning approach is based on deep learning (Salakhutdinov& Hinton, 2009): unsupervised learning of hierarchies of dis-tributed feature representations in neural-network-style prob-abilistic generative models. These models do not specify ex-plicit parts and structural relations, but they can still constructmeaningful representations of what makes two objects deeplysimilar that go substantially beyond low-level image features.

These approaches from machine learning may be com-pelling ways to understand how humans learn so quickly,but there is little experimental evidence that directly supportsthem. Models that construct parts or features from sensorydata (pixels) while learning object concepts have been testedin elegant behavioral experiments with very simple stimuliand a very small number of concepts (Austerweil & Griffiths,2009; Schyns, Goldstone, & Thibaut, 1998). But there havebeen few systematic comparisons of multiple state-of-the-artcomputational approaches to representation learning with hu-

Hierarchical-­‐Deep  Model  

HD  Models:  Integrate  hierarchical  Bayesian  models  with  deep  models.  

Hierarchical  Bayes:  •   Learn  hierarchies  of  categories  for  sharing  abstract  knowledge.  

Super-­‐category  

One-­‐Shot  Learning  

Deep  Models:  •   Learn  hierarchies  of  features.  •   Unsupervised  feature  learning  –  no  need  to  rely  on  human-­‐cra9ed  input  features.  

Shared  higher-­‐level  features  

Shared  low-­‐level  features  

(Salakhutdinov,  Tenenbaum,  Torralba,  NIPS  2011,  PAMI  2013)!

Page 24: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Lower-­‐level  generic  features:    

DBM  Model  

Images  

 horse    cow    car    van    truck  Higher-­‐level  class-­‐sensi@ve  features:  

Hierarchical  Dirichlet  Process  Model  

•   capture  disEncEve  perceptual  structure  of  a  specific  concept  

•   edges,  combinaEon  of  edges  

Hierarchical-­‐Deep  Model  

Page 25: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Lower-­‐level  generic  features:    

DBM  Model  

Images  

Higher-­‐level  class-­‐sensi@ve  features:  

Hierarchical  Organiza@on  of  Categories:  

•   modular  data-­‐parameter  relaEons        

•   capture  disEncEve  perceptual  structure  of  a  specific  concept  

•   edges,  combinaEon  of  edges  

•   express  priors  on  the  features  that  are  typical  of  different  kinds  of  concepts  

 horse    cow    car    van    truck  

“animal”  “vehicle”  

Hierarchical  Dirichlet  Process  Model  

Hierarchical-­‐Deep  Model  

Page 26: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

 horse    cow    car    van    truck  

Hierarchical-­‐Deep  Model  

Page 27: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

 horse    cow    car    van    truck  

“animal”   “vehicle”      (Nested  Chinese  Restaurant  Process)  prior:  a  nonparametric  prior  over  tree  structures  

Tree  hierarchy  of    classes  is  learned  

Hierarchical-­‐Deep  Model  

Page 28: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

 horse    cow    car    van    truck  

“animal”   “vehicle”  

Tree  hierarchy  of    classes  is  learned  

   (Nested  Chinese  Restaurant  Process)  prior:  a  nonparametric  prior  over  tree  structures  

   (Hierarchical  Dirichlet  Process)  prior:  a  nonparametric  prior  allowing  categories  to  share  higher-­‐level  features,  or  parts.  

Hierarchical-­‐Deep  Model  

Page 29: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

“animal”   “vehicle”  

Tree  hierarchy  of    classes  is  learned  

   (Nested  Chinese  Restaurant  Process)  prior:  a  nonparametric  prior  over  tree  structures  

Deep  Boltzmann  Machine  

Enforce  (approximate)  global  consistency    through  many  local  constraints.    

 horse    cow    car    van    truck  

   (Hierarchical  Dirichlet  Process)  prior:  a  nonparametric  prior  allowing  categories  to  share  higher-­‐level  features,  or  parts.  

Hierarchical-­‐Deep  Model  

Page 30: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

 horse    cow    car    van    truck  

CIFAR  Object  RecogniEon  

“animal”   “vehicle”  

50,000  images  of  100  classes  

4  million  unlabeled  images  

32  x  32  pixels  x  3    RGB  

Lower-­‐level  generic    features    

Higher-­‐level  class  sensiEve  features    

Tree  hierarchy  of    classes  is  learned  

Inference:  Markov  chain    Monte  Carlo  

Page 31: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

 horse    cow    car    van    truck  

“animal”   “vehicle”  

Lower-­‐level  generic    features    

Higher-­‐level  features    

Tree  hierarchy  of    classes  is  learned  

4  million  Images  

Each  image  is  made  up  of  learned  high-­‐level  features  features.  

Each  higher-­‐level  feature  is  made  up  of  lower-­‐level  features.  

CIFAR  Object  RecogniEon  (Salakhutdinov,  Tenenbaum,  Torralba,  NIPS  2011,  PAMI  2013)!

Page 32: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Learning  the  Hierarchy  

woman  

The  model  learns  how  to  share  the  knowledge  across  many  visual  categories.  

“human”  “fruit”  “aqua@c  animal”  

Learned  super-­‐class  hierarchy  

shark   ray  turtle  dolphin   baby   man  girl  sunflower  orange  apple   Basic  level    class  

Learned  low-­‐level  generic  features  

“global”  

…  

…  

Learned  higher-­‐level  class-­‐sensi@ve  features  

Page 33: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Sharing  Features  Shape   Color  

Sunfl

ower  

orange  apple   sunflower  

Learning  to  Learn  

“fruit”  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dete

ctio

n ra

te

false alarm rate

5  3      1  ex’s    Sunflower  ROC  curve  

Pixel-­‐space    distance  

Learning  a  hierarchy  for  sharing  parameters  –                  rapid  learning  of  a  novel  concept.  

Dolphin  

Apple  

Real  Reconst-­‐  rucEons  

Orange  

Page 34: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Learning  from  3  Examples  

Given  only  3  Examples  

Generated  Samples  

Willow  Tree   Rocket  

Page 35: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Learned  higher-­‐level    features  

Learned  lower-­‐level  features  

 char  1    char  2    char  3      char  4    char  5  

“alphabet  1”  

“alphabet  2”  

Edges  

Strokes  

Handwrinen  Character  RecogniEon  

25,000    characters  

Page 36: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  

Global  

Super  class  1  

Super  class  2  

Real  data  within  super  class  

Class  1   Class  2   New  class  

Simulated  new  characters  

Page 37: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  Real  data  within  super  class  

Class  1   Class  2  

Super  class  1  

Super  class  2  

Global  

New  class  

Simulated  new  characters  

Page 38: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  Real  data  within  super  class  

Class  1   Class  2  

Super  class  1  

Super  class  2  

Global  

New  class  

Simulated  new  characters  

Page 39: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  Real  data  within  super  class  

Class  1   Class  2  

Super  class  1  

Super  class  2  

Global  

New  class  

Simulated  new  characters  

Page 40: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  Real  data  within  super  class  

Class  1   Class  2  

Super  class  1  

Super  class  2  

Global  

New  class  

Simulated  new  characters  

Page 41: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  Real  data  within  super  class  

Class  1   Class  2  

Super  class  1  

Super  class  2  

Global  

New  class  

Simulated  new  characters  

Page 42: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SimulaEng  New  Characters  Real  data  within  super  class  

Class  1   Class  2  

Super  class  1  

Super  class  2  

Global  

New  class  

Simulated  new  characters  

The  same  model  can  be  applied  to    speech,  text,  video,  or  any  other    high-­‐dimensional  data.  

Page 43: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

DiscriminaEve  Transfer  Learning      with  Deep  Nets  

Page 44: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Transfer  Learning  with  Shared  Priors  

Srivastava and Salakhutdinov, NIPS 2013  

Page 45: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)
Page 46: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)
Page 47: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Talk  Roadmap  

•  Learning  structured  and  Robust  Models  •  Transfer  and  One-­‐Shot  Learning  

Part  1:  Advanced  Deep  Learning  Models  

Part  2:  MulEmodal  Deep  Learning  

•  MulEmodal  DBMs  for  images  and  text  •  CapEon  GeneraEon  with  Recurrent  RNNs  •  Skip-­‐Thought  vectors  

Page 48: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Data  –  CollecEon  of  ModaliEes  •   MulEmedia  content  on  the  web  -­‐  image  +  text  +  audio.  

•   Product  recommendaEon  systems.  

•   RoboEcs  applicaEons.  

Audio  Vision  

Touch  sensors  Motor  control  

sunset,  pacificocean,  bakerbeach,  seashore,  ocean  

car,  automobile  

Page 49: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Shared  Concept  “Modality-­‐free”  representaEon    

“Modality-­‐full”  representaEon    

“Concept”  

sunset,  pacific  ocean,  baker  beach,  seashore,  

ocean  

Page 50: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

•  Improve  ClassificaEon    

MulE-­‐Modal  Input  

pentax,  k10d,  kangarooisland  southaustralia,  sa  australia  australiansealion  300mm  

SEA  /  NOT  SEA  

•  Retrieve  data  from  one  modality  when  queried  using  data  from  another  modality  

beach,  sea,  surf,  strand,  shore,  wave,  seascape,  sand,  ocean,  waves  

•  Fill  in  Missing  ModaliEes  beach,  sea,  surf,  strand,  shore,  wave,  seascape,  sand,  ocean,  waves  

Page 51: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Challenges  -­‐  I    

Very  different  input  representaEons  

Image   Text  

sunset,  pacific  ocean,  baker  beach,  seashore,  

ocean   •   Images  –  real-­‐valued,  dense  

Difficult  to  learn  cross-­‐modal  features  from  low-­‐level  representaEons.  

Dense  

•   Text  –  discrete,  sparse    

Sparse  

Page 52: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Challenges  -­‐  II    

Noisy  and  missing  data  

Image   Text  pentax,  k10d,  pentaxda50200,  kangarooisland,  sa,  australiansealion  

mickikrimmel,  mickipedia,  headshot  

unseulpixel,  naturey,  crap  

<  no  text>  

Page 53: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Challenges  -­‐  II    Image   Text   Text  generated  by  the  model  

beach,  sea,  surf,  strand,  shore,  wave,  seascape,  sand,  ocean,  waves  

portrait,  girl,  woman,  lady,  blonde,  preny,  gorgeous,  expression,  model  

night,  none,  traffic,  light,  lights,  parking,  darkness,  lowlight,  nacht,  glow  

fall,  autumn,  trees,  leaves,  foliage,  forest,  woods,  branches,  path  

pentax,  k10d,  pentaxda50200,  kangarooisland,  sa,  australiansealion  

mickikrimmel,  mickipedia,  headshot  

unseulpixel,  naturey,  crap  

<  no  text>  

Page 54: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

A  Simple  MulEmodal  Model  •   Use  a  joint  binary  hidden  layer.  •   Problem:    Inputs  have  very  different  staEsEcal  properEes.  

•   Difficult  to  learn  cross-­‐modal  features.  

0  0  1  0  

0  Real-­‐valued  

1-­‐of-­‐K  

Page 55: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

0  0  1  0  

0  Dense,  real-­‐valued  image  features  

Gaussian  model  Replicated  So9max  

MulEmodal  DBM  

Word  counts  

(Srivastava & Salakhutdinov, NIPS 2012)  

Page 56: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MulEmodal  DBM  

0  0  1  0  

0  Dense,  real-­‐valued  image  features  

Gaussian  model  Replicated  So9max  

Word  counts  

(Srivastava & Salakhutdinov, NIPS 2012)  

Page 57: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Gaussian  model  Replicated  So9max  

0  0  1  0  

0  

MulEmodal  DBM  

Word  counts  

Dense,  real-­‐valued  image  features  

(Srivastava & Salakhutdinov, NIPS 2012)  

Page 58: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

0  0  1  0  

0  Dense,  real-­‐valued  image  features  

Word  counts  

Gaussian  model  Replicated  So9max  

MulEmodal  DBM  

Bonom-­‐up  +  

Top-­‐down  

Page 59: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

0  0  1  0  

0  Dense,  real-­‐valued  image  features  

Word  counts  

Gaussian  RBM  Replicated  So9max  

MulEmodal  DBM  

Bonom-­‐up  +  

Top-­‐down  

Page 60: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Text  Generated  from  Images  

canada,  nature,  sunrise,  ontario,  fog,  mist,  bc,  morning  

insect,  bunerfly,  insects,  bug,  bunerflies,  lepidoptera  

graffiE,  streetart,  stencil,  sEcker,  urbanart,  graff,  sanfrancisco  

portrait,  child,  kid,  ritrano,  kids,  children,  boy,  cute,  boys,  italy  

dog,  cat,  pet,  kinen,  puppy,  ginger,  tongue,  kiny,  dogs,  furry  

sea,  france,  boat,  mer,  beach,  river,  bretagne,  plage,  brinany  

Given Generated       Given Generated      

Page 61: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Text  Generated  from  Images  Given Generated      

water,  glass,  beer,  bonle,  drink,  wine,  bubbles,  splash,  drops,  drop  

portrait,  women,  army,  soldier,  mother,  postcard,  soldiers  

obama,  barackobama,  elecEon,  poliEcs,  president,  hope,  change,  sanfrancisco,  convenEon,  rally  

Page 62: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Images  from  Text  

water,  red,  sunset  

nature,  flower,  red,  green  

blue,  green,  yellow,  colors  

chocolate,  cake  

Given Retrieved  

Page 63: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MIR-­‐Flickr  Dataset  

Huiskes  et.  al.  

•   1  million  images  along  with  user-­‐assigned  tags.  

sculpture,  beauty,  stone  

nikon,  green,  light,  photoshop,  apple,  d70  

white,  yellow,  abstract,  lines,  bus,  graphic  

sky,  geotagged,  reflecEon,  cielo,  bilbao,  reflejo  

food,  cupcake,  vegan  

d80  

anawesomeshot,  theperfectphotographer,  flash,  damniwishidtakenthat,  spiritofphotography  

nikon,  abigfave,  goldstaraward,  d80,  nikond80  

Page 64: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Data  and  Architecture  12  Million  parameters  

2048  

1024  

1024  

2000  

1024  

1024  

3857  

•   200  most  frequent  tags.    

•   25K  labeled  subset  (15K  training,  10K  tesEng)  

•   38  classes  -­‐  sky,  tree,  baby,  car,  cloud  …  

•   AddiEonal  1  million  unlabeled  data  

Page 65: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Results  •   LogisEc  regression  on  top-­‐level  representaEon.  •   MulEmodal  Inputs  

Learning  Algorithm   MAP   Precision@50  

Random   0.124   0.124  LDA  [Huiskes  et.  al.]   0.492   0.754  SVM  [Huiskes  et.  al.]   0.475   0.758  DBM-­‐Labelled   0.526   0.791  

Mean  Average  Precision  

Labeled  25K  examples  

Page 66: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Results  •   LogisEc  regression  on  top-­‐level  representaEon.  •   MulEmodal  Inputs  

Learning  Algorithm   MAP   Precision@50  

Random   0.124   0.124  LDA  [Huiskes  et.  al.]   0.492   0.754  SVM  [Huiskes  et.  al.]   0.475   0.758  DBM-­‐Labelled   0.526   0.791  Deep  Belief  Net   0.638   0.867  Autoencoder   0.638   0.875  DBM   0.641   0.873  

Mean  Average  Precision  

Labeled  25K  examples  

+  1  Million  unlabelled  

Page 67: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

GeneraEng  Sentences  

Input  

A  man  skiing  down  the  snow    covered  mountain  with  a  dark    sky  in  the  background.      

Output  

•   More  challenging  problem.  •   How  can  we  generate  complete  descripEons  of  images?  

Page 68: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Talk  Roadmap  

•  Learning  structured  and  Robust  Models  •  Transfer  and  One-­‐Shot  Learning  

Part  1:  Advanced  Deep  Learning  Models  

Part  2:  MulEmodal  Deep  Learning  

•  MulEmodal  DBMs  for  images  and  text  •  CapEon  GeneraEon  with  Recurrent  RNNs  •  Skip-­‐Thought  vectors  

Page 69: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Encode-­‐Decode  Framework  

•   Decoder:  A  neural  language  model  that  combines  structure  and  content  vectors  for  generaEng  a  sequence  of  words    

•   Encoder:  CNN  and  Recurrent  Neural  Net  for  a  joint  image-­‐sentence  embedding.    

Page 70: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

RepresentaEon  of  Words  •  Key  Idea:  Each  word  w  is  represented  as  a  D-­‐dimensional  

real-­‐valued  vector  rw  2  RK.  

Dimension  2  

Dimen

sion  2  

SemanEc  Space  

table  

chair  

dolphin  whale  

November  

Bengio et.al., 2003, Mnih et. al., 2008, Mikolov et. al., 2009, Kiros et.al. 2014!

Page 71: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

An  Image-­‐Text  Encoder  Joint  Feature  space  

Encoder:  ConvNet    

ship   water  

Socher  2013,  Frome  2013,  Kiros  2014  

•  Learn  a  joint  embedding  space  of  images  and  text:  -  Can  condiEon  on  anything  (images,  words,  phrases,  etc)  -  Natural  definiEon  of  a  scoring  funcEon  (inner  products  in  the  

joint  space).  

Page 72: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

An  Image-­‐Text  Encoder  

Recurrent  Neural  Network  (LSTM)  

1-­‐of-­‐V  encoding  of  words  

w1   w2   w3  

ConvoluEonal  Neural  Network    

Sentence  RepresentaEon     Image  

RepresentaEon    

See Skip-Thought Vectors, (Kiros et.al. arXiv 2015)  

Page 73: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

An  Image-­‐Text  Encoder  Joint  Feature  space  

A  castle  and    reflecEng  water  

A  ship  sailing    in  the  ocean  

Images:  

Text:  

A  plane  flying  in  the  sky  

Minimize  the  following  objecEve:  

Page 74: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Retrieving  Sentences  for  Images  

Page 75: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Tagging  and  Retrieval  mosque,  tower,  building,  cathedral,  dome,  castle  

kitchen,  stove,  oven,  refrigerator,  microwave  

ski,  skiing,  skiers,  skiiers,  snowmobile  

bowl,  cup,  soup,  cups,  coffee  

beach  

snow  

Page 76: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Retrieval  with  AdjecEves  

fluffy  

delicious  

Page 77: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MulEmodal  LinguisEc  RegulariEes  Nearest Images!

(Kiros, Salakhutdinov, Zemel, TACL 2015)  

Page 78: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MulEmodal  LinguisEc  RegulariEes  Nearest Images!

(Kiros, Salakhutdinov, Zemel, TACL 2015)  

Page 79: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Can  you  solve  Zero-­‐Shot  Problem?  

Page 80: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Can  you  solve  Zero-­‐Shot  Problem?  

Canada  Warbler   Yellow  Warbler   Sharpe-­‐tailed  Sparrow  

Page 81: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Model  Architecture    

CNNMLP

Classscore

Dotproduct

Wikipedia article

TF-IDF

Image

gf

The Cardinals or Cardinalidae are a family of passerine birds found in North and South AmericaThe South American cardinals in the genus…

fam

ily

north

genu

s

bird

s

sout

h

amer

ica

Cxk 1xk

1xC

Page 82: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

AlternaEve  View    Joint  SemanEc  Feature  space  

•   Minimize  the  cross-­‐entropy  or  hinge-­‐loss  objecEve.  

L.  Ba,  K.  Swersky,  S.  Fidler,  Salakhutdinov,  arXiv  2015  

Page 83: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

The  Model  •  Consider  binary  one  vs.  all  classifier  for  class  c:  

•  Assume  that  we  are  given  an  addiEon  text  feature                                  .  

•  We  can  use  this  idea  to  predict  the  output  weights  of  a  classifier  (both  the  fully  connected  and  convoluEonal  layers  of  a  CNN).    

•  Simple  idea:  Instead  of  learning  a  staEc  weight  vector        w,  the  text  feature  can  be  used  to  predict  the  weights:  

         where                                                      is  a  mapping  that  transforms  the  text  features  to  the  visual  image  feature  space.  

Page 84: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Problem  Setup    •   The  training  set  contains  N  images                                        and  their  associated  class  labels                                                      .  There  are  C  disEnct  class  labels.  

•   During  test  Eme,  we  are  given  addiEonal              number  of  the  previously  unseen  classes,  such  that                                                                                            .  

•   Our  goal  is  to  predict  previously  unseen  classes  and  perform  well  on  the  previously  seen  classes.    

•  InterpretaEon:  Learning  a  good  similarity  kernel  between  images  and  encyclopedia  arEcles  .    

Page 85: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Caltech  UCSD  Bird  and    Oxford  Flower  Datasets  

•   The  CUB200-­‐2010  contains  6,033  images  from  200  bird  species  (about  30  images  per  class).    

•   There  is  one  Wikipedia  arEcle  associated  with  each  bird  class  (200  arEcles  in  total).  The  average  number  of  words  per  arEcles  is  400.    

•   Out  of  200  classes,  40  classes  are  defined  as  unseen  and  the  remaining  160  classes  are  defined  as  seen.  

•   The  Oxford  Flower-­‐102  dataset  contains  102  classes  with  a  total  of  8189  images.    

•   Each  class  contains  40  to  260  images.  82  flower  classes  are  used  for  training  and  20  classes  are  used  as  unseen  during  tesEng.  

Page 86: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Results:  Area  Under  ROC  

Learning  Algorithm   Unseen   Seen   Mean  

DA  [Elhoseiny,  et.al.,  2013]   0.59   -­‐   -­‐  DA  (VGG  features)   0.62   -­‐   -­‐  Ours  (fc)   0.82   0.96   0.934  Ours  (conv)   0.73   0.96   0.91  Ours  (fc+conv)   0.80   0.987   0.95  

M.  Elhoseiny,  B.  Saleh,  and  A.  Elgammal,  ICCV,  2013.  

The  CUB200-­‐2010  Dataset      

Page 87: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Results:  Area  Under  ROC  

Learning  Algorithm   Unseen   Seen   Mean  

DA  [Elhoseiny,  et.al.]   0.62   -­‐   -­‐  GPR+DA  [Elhoseiny,  et.al.]   0.68   -­‐   -­‐  Ours  (fc)   0.70   0.987   0.90  Ours  (conv)   0.65   0.97   0.85  Ours  (fc+conv)   0.71   0.989   0.93  

M.  Elhoseiny,  B.  Saleh,  and  A.  Elgammal,  ICCV,  2013.  

The  Oxford  Flower  Dataset  

Page 88: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

-­‐   Tanagers    -­‐  Thraupidae  -­‐   Scarlet    -­‐  Cardinalidae  -­‐  Tanagers  

Anribute  Discovery  Word  sensiEviEes  of  unseen  classes  

Nearest  Neighbors  

•   Wikipedia  arEcle  for  each  class  is  projected  onto  its  feature  space  and  the  nearest  image  neighbors  from  the  test-­‐set  are  shown.  

Page 89: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

-­‐   rhizome    -­‐  depth  -­‐   freezing    -­‐  compost  -­‐  iris  

Anribute  Discovery  Word  sensiEviEes  of  unseen  classes  

Nearest  Neighbors  

•   Wikipedia  arEcle  for  each  class  is  projected  onto  its  feature  space  and  the  nearest  image  neighbors  from  the  test-­‐set  are  shown.  

Bearded  Iris  

Page 90: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

How  About  GeneraEng  Sentences!  

Input  

A  man  skiing  down  the  snow    covered  mountain  with  a  dark    sky  in  the  background.      

Output  

We  want  to  model:  

Page 91: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Log-­‐bilinear  Neural  Language  Model  

•  Each  word  w  is  represented  as  a  K-­‐dim  real-­‐valued  vector  rw  2  RK.  

•  Feedforward  neural  network  with  a  single  linear  hidden  layer.  

•  R  denote  the  V  £  K  matrix  of  word  representaEon  vectors,  where  V  is  the  vocabulary  size.  

•  (w1,  …,  wn-­‐1)  is  tuple  of  n-­‐1  words,  where  n-­‐1  is  the  context  size.  The  next  word  representaEon  becomes:    

K  £  K  context  parameter  matrices        

1-­‐of-­‐V  encoding  of  words  

rw1   rw2   rw3  

r  

w4  

w1   w2   w3  

Page 92: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Log-­‐bilinear  Neural  Language  Model  

•  The  condiEonal  probability  of  the  next  word  given  by:  

Predicted  representaEon  of  rwn.    

1-­‐of-­‐V  encoding  of  words  

rw1   rw2   rw3  

r  

w4  

w1   w2   w3  

Can  be  expensive  to  compute  Bengio et.al. 2003  

Page 93: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MulEplicaEve  Model  •  We  represent  words  as  a  tensor:    

where  G  is  the  number  of  tensor  slices.    

•  Re-­‐represent  Tensor  in  terms  of  3  lower-­‐rank  matrices  (where  F  is  the  number  of  pre-­‐chosen  factors):    

•  Given  an  anribute  vector  u  2  RG  (e.g.  image  features),  we  can  compute  anribute-­‐gated  word  representaEons  as:  

(Kiros, Zemel, Salakhutdinov, NIPS 2014)  

Page 94: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

•  Let                                                                                        denote  a  folded  K  £  V  matrix  of  word  embeddings.  

MulEplicaEve  Log-­‐bilinear  Model  

•  Then  the  predicted  next  word  representaEon  is:    

•  Given  next  word  representaEon  r,  the  factor  outputs  are:  

Component-­‐wise  product  

1-­‐of-­‐V  encoding  of  words  

Ew1   Ew2   Ew3  

Low  rank  

r  

w4  

f  u  

GaEng  anributes  

Low  rank  

w1   w2   w3  

(Kiros, Zemel, Salakhutdinov, NIPS 2014)  

Page 95: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

•  The  condiEonal  probability  of  the  next  word  given  by:  

1-­‐of-­‐V  encoding  of  words  

Ew1   Ew2   Ew3  

Low  rank  

r  

w4  

f  u  

GaEng  anributes  

MulEplicaEve  Log-­‐bilinear  Model  

Low  rank  

w1   w2   w3  

Page 96: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoding:  Neural  Language  Model  

•  Image  features  are  gaEng  the  hidden-­‐to-­‐output  connecEons  when  predicEng  the  next  word!      

•  We  can  also  condiEon  on  POS  tags  when  generaEng  a  sentence.      

(Kiros, Salakhutdinov, Zemel, ICML 2014)  

Page 97: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoding:  Structured  NLM  

Page 98: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoding:  Structured  NLM  

Page 99: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoding:  Structured  NLM  

Page 100: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoding:  Structured  NLM  

Page 101: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoding:  Structured  NLM  

Page 102: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  

Page 103: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  

Page 104: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  

Page 105: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Filling  in  the  Blanks  

Page 106: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon    

Model  Samples  

•   Two  men  in  a  room  talking  on  a  table  .  •   Two  men  are  si|ng  next  to  each  other  .  •   Two  men  are  having  a  conversaEon  at  a  table  .  •   Two  men  si|ng  at  a  desk  next  to  each  other  .  

colleagues    waiters    waiter    entrepreneurs    busboy  

TAGS:  

Page 107: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Results  

•   R@K  is  Recall@K  (high  is  good).    •   Med  r  is  the  median  rank  (low  is  good).  

Page 108: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  with    Visual  AnenEon    

A  woman  is  throwing  a  frisbee  in  a  park.    

Xu  et.al.,  ICML  2015  

Page 109: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  with    Visual  AnenEon    

A  woman  is  throwing  a  frisbee  in  a  park.    

Xu  et.al.,  ICML  2015  

Page 110: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  with    Visual  AnenEon    

����������������

������������������������������

��������������������

����

�������� !�����

�������������"�

���������������������

����#!���������� ��!��$��� ��

Page 111: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

CapEon  GeneraEon  with    Visual  AnenEon    

Xu  et.al.,  ICML  2015  

•   Montreal/Toronto  team  takes  3rd  place  on  Microso9  COCO  capEon  generaEon  compeEEon,  finishing  slightly  behind  Google  and  Microso9.  This  is  based  on  the  human  evaluaEon  results.      

Page 112: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Recurrent  AnenEon  Model  

Coarse  Image  

ClassificaEon  or  CapEon  GeneraEon  

Recurrent  Neural  Network  

Sample  acEon:  

•   Hard  AYen@on:  Sample  acEon  (latent  gaze  locaEon)  

•   SoZ  AYen@on:  Take  expectaEon  instead  of  sampling  

Page 113: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

RelaEonship  To  Helmholtz  Machines  

Coarse  image  

StochasEc  Units  

DeterminisEc  Units  

•  Goal:  Maximize  the  probability  of  correct  class  (sequence  of  words)  by  marginalizing  over  the  acEons  (or  latent  gaze  locaEons):  

StochasEc  Units  

•  Neural  Network  with  stochasEc  and  determinisEc  units.    

Ba  et.al.,  2015  

Page 114: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

RelaEonship  To  Helmholtz  Machines  

Coarse  image  

DeterminisEc  Units  

StochasEc  Units  

•  View  this  model  as  a  condiEonal  Helmholtz  Machine  with  stochasEc  and  determinisEc  units.  

•  Can  use  Wake-­‐Sleep,  variaEonal  autoencoders,  and  their  variants  to  learns  good  anenEon  policy!  

Input  data  

h3

h2

h1

v

W3

W2

W1

Helmholtz  Machine  

Ba  et.al.,  2015  

Page 115: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

RelaEonship  To  Helmholtz  Machines  

Coarse  image  

DeterminisEc  Units  

StochasEc  Units  

•  View  this  model  as  a  condiEonal  Helmholtz  Machine  with  stochasEc  and  determinisEc  units.  

•  Can  use  Wake-­‐Sleep,  variaEonal  autoencoders,  and  their  variants  to  learns  good  anenEon  policy!  

Input  data  

h3

h2

h1

v

W3

W2

W1

Helmholtz  Machine  

Wake-­‐Sleep  Recurrent    AnenEon  Models  

Ba  et.al.,  2015  

Page 116: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Improving  AcEon  RecogniEon  with  Visual  AnenEon  

Cycling   Horse  back  riding   Basketball  ShooEng  Soccer  juggling  

Sharma  et.al.,  2015  

Page 117: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Talk  Roadmap  

•  Learning  structured  and  Robust  Models  •  Transfer  and  One-­‐Shot  Learning  

Part  1:  Advanced  Deep  Learning  Models  

Part  2:  MulEmodal  Deep  Learning  

•  MulEmodal  DBMs  for  images  and  text  •  CapEon  GeneraEon  with  Recurrent  RNNs  •  Skip-­‐Thought  vectors  

Page 118: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MoEvaEon    •  Goal:  Unsupervised  learning  of  high-­‐quality  sentence  

representaEons.    

•  Previous  approaches:  Learning  composiEon  operators  that  map  word  vectors  to  sentence  vectors,  including  recursive  networks,  recurrent  net,  etc.      

•  Instead  of  using  a  word  to  predict  its  surrounding  context,  we  encode  a  sentence  to  predict  the  sentences  around  it.  

•  We  develop  an  objecEve  that  abstracts  the  skip-­‐gram  model  to  the  sentence  level.  

Page 119: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Decoder  

Sequence  to  Sequence  Learning  

•  RNN  Encoder-­‐Decoders  for  Machine  TranslaEon  (Sutskever  et  al.  2014;  Cho  et  al.  2014;  Kalchbrenner  et  al.  2013,  Srivastava  et.al.,  2015)  Input  Sequence  

Encoder  

Learned  Representa@on  

Output  Sequence  

Page 120: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Skip-­‐Thought  Model    

•   Given  a  tuple                                                              of  conEguous  sentences:  -   the  sentence              is  encoded  using  LSTM.    -   the  sentence              anempts  to  reconstruct  the  previous  sentence  and  next  sentence                  .    

•   The  input  is  the  sentence  triplet:  -   I  got  back  home.    -   I  could  see  the  cat  on  the  steps.    -   This  was  strange.  

 Kiros  et.al.,  arXiv  2015  

Page 121: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Encoder  

Sentence   Generate  Forward  Sentence  

Generate  Previous  Sentence  

Skip-­‐Thought  Model    

Page 122: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Learning  ObjecEve  •   We  are  given  a  tuple                                                                  of  conEguous  sentences.    

•   ObjecEve:  The  sum  of  the  log-­‐probabiliEes  for  the  next  and  previous  sentences  condiEoned  on  the  encoder  representaEon:  

representaEon  of  encoder  

Forward  sentence     Previous  sentence    

Page 123: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Book  11K  corpus  

•   Query  sentence  along  with  its  nearest  neighbor  from  500K  sentences  using  cosine  similarity:  

-   He  ran  his  hand  inside  his  coat,  double-­‐checking  that  the  unopened  lener  was  sEll  there.  

-   He  slipped  his  hand  between  his  coat  and  his  shirt,  where  the  folded  copies  lay  in  a  brown  envelope.  

Page 124: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Book  11K  corpus  

•   Query  sentence  along  with  its  nearest  neighbor  from  500K  sentences  using  cosine  similarity:  

Page 125: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SemanEc  Relatedness    •  SemEval  2014  Task  1:  semanEc  relatedness  SICK  dataset:    

Given  two  sentences,  produce  a  score  of  how  semanEcally  related  these  sentences  are  based  on  human  generated  scores  (1  to  5).    

•  The  dataset  comes  with  a  predefined  split  of  4500  training  pairs,  500  development  pairs  and  4927  tesEng  pairs.  

•  Using  skip-­‐thought  vectors  for  each  sentence,  train  a  linear  regression  to  predict  semanEc  relatedness.    -  For  pair  of  sentences,  compute  component-­‐wise  features  

between  pairs  (e.g.  |u-­‐v|).        

Page 126: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SemanEc  Relatedness    

•   Our  models  outperform  all  previous  systems  from  the  SemEval  2014  compeEEon.  This  is  remarkable,  given  the  simplicity  of  our  approach  and  the  lack  of  feature  engineering.  

SemEval  2014  sub-­‐missions  

Results  reported  by  Tai  et.al.  

Ours  

Page 127: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

SemanEc  Relatedness    

•   Example  predicEons  from  the  SICK  test  set.  GT  is  the  ground  truth  relatedness,  scored  between  1  and  5.    

•   The  last  few  results:  slight  changes  in  sentences  result  in  large  changes  in  relatedness  that  we  are  unable  to  score  correctly.  

Page 128: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

 Paraphrase  DetecEon  •  Microso9  Research  Paraphrase  Corpus:  For  two  sentences  one  

must  predict  whether  or  not  they  are  paraphrases.    

•  The  training  set  contains  4076  sentence  pairs  (2753  are  posiEve)    

•  The  test  set  contains  1725  pairs  (1147  are  posiEve).  

Recursive  Auto-­‐encoders  

Best  published  results  

Ours  

Page 129: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

ClassificaEon  Benchmarks  •  5  datasets:  movie  review  senEment  (MR),  customer  product  

reviews  (CR),  subjecEvity/objecEvity  classificaEon  (SUBJ),  opinion  polarity  (MPQA)  and  quesEon-­‐type  classificaEon  (TREC).    

Bag-­‐of-­‐words  

Super-­‐vised  

Ours  

Page 130: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Summary  •  We  believe  our  model  for  learning  skip-­‐thought  vectors  only  

scratches  the  surface  of  possible  objecEves.  

•         Many  variaEons  have  yet  to  be  explored,  including    -   deep  encoders  and  decoders  -   larger  context  windows  -   encoding  and  decoding  paragraphs  -   other  encoders  

•   It  is  likely  the  case  that  more  exploraEon  of  this  space  will  result  in  even  higher  quality  sentence  representaEons.  

•  Code  and  Data  are  available  online!              hYp://www.cs.toronto.edu/~mbweb/  

Page 131: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

MulE-­‐Modal  Models  

Laser  scans  

Images  

Video  

Text  &  Language    

Time  series    data  

Speech  &    Audio  

Develop  learning  systems  that  come    closer  to  displaying  human  like  intelligence  

Page 132: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Summary  •  Efficient  learning  algorithms  for  Hierarchical  GeneraEve  Models.  

Learning  more  adapEve,  robust,  and  structured  representaEons.    

•  Deep  models  can  improve  current  state-­‐of-­‐the  art  in  many  applicaEon  domains:  Ø  Object  recogniEon  and  detecEon,  text  and  image  retrieval,  handwrinen  

character  and  speech  recogniEon,  and  others.  

Text  &  image  retrieval  /    Object  recogni@on  

Learning  a  Category  Hierarchy  

Dealing  with  missing/occluded  data    

HMM  decoder  

Speech  Recogni@on  

sunset,  pacific  ocean,  beach,  seashore  

Mul@modal  Data   Cap@on  Genera@on  

Page 133: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

Thank  you  

Code  is  available  at:  hnp://deeplearning.cs.toronto.edu/  

Page 134: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

References  •  J.  Ba,  V.  Mnih,  K.  KavukcuogluMulEple  Object  RecogniEon  with  Visual  AnenEonI,  CLR,  2015.  •  L.  Ba,  K.  Swersky,  S.  Fidler,  Salakhutdinov,PredicEng  Deep  Zero-­‐Shot  ConvoluEonal  Neural  

Networks  using  Textual  DescripEons,  arXiv  2015  •  Y.  Bengio,  R.  Ducharme,  P.  Vincent,  C.  Jauvin,  A  Neural  ProbabilisEc  Language  Model,  JMLR  

2003  •  Y.  Bengio,  Y.  LeCun,  Scaling  learning  algorithms  towards  AI,Large-­‐Scale  Kernel  Machines,  MIT  

Press,  2007  •  Y.  Bengo,  Learning  Deep  Architectures  for  AI,  FoundaEons  and  Trends  in  Machine  Learning,  

2009  •  Y.  Burda,  R.  Grosse,  R.  Salakhutdinov,  Accurate  and  ConservaEve  EsEmates  of  MRF  Log-­‐

likelihood  using  Reverse  Annealing,  In  AI  and  StaEsEcs  (AISTATS),  2015  •  K  Cho,  B  van  Merrienboer,  C  Gulcehre,  F  Bougares,  H  Schwenk,  Y.  Bengio,  Learning  Phrase  

RepresentaEons  using  RNN  Encoder-­‐Decoder  for  StaEsEcal  Machine  TranslaEon,  EMLP  2014  •  A.  Frome,  G.  Corrado,  J.  Shlens,  S.  Bengio,  J.  Dean,  M.  Ranzato,  T.  Mikolov,  DeViSE:  A  Deep  

Visual-­‐SemanEc  Embedding  Model,  NIPS  2013  •  G.  Hinton,  R.  Salakhutdinov,  Reducing  the  Dimensionality  of  Data  with  Neural  Networks,  

Science,  2006.  •  G.  Hinton,  S.  Osindero,  Y.  The,  A  fast  learning  algorithm  for  deep  belief  nets,  Neural  

ComputaEon,  18,  2006G.    •  Hinton,  Training  Products  of  Experts  by  Minimizing  ContrasEve  Divergence,  Neural  

ComputaEon,  2002.  

Page 135: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

References  •  H.  Lee,  A.  Banle,  R.  Raina,  A.  Ng,  Efficient  sparse  coding  algorithms,  NIPS  2007  •  H.  Lee,  R.  Grosse,  R.  Ranganath,  A.  Y.  Ng,  ConvoluEonal  deep  belief  networks  for  scalable  

unsupervised  learning  of  hierarchical  representaEon,  .ICML  2009N.    •  Kalchbrenner,  P.  Blunsom,  Recurrent  ConEnuous  TranslaEon  Models,  EMNLP  2013  •  R.  Kiros,  R.  Salakhutdinov,  R.  Zemel,MulEmodal,  Neural  Language  Models,  ICML  2014  •  R.  Kiros,  R.  Salakhutdinov,  R.  Zemel.Unifying,  Visual-­‐SemanEc  Embeddings  with  MulEmodal  

Neural  Language  Models,  TACL  2015.  •  R.  Kiros,  Y.  Zhu,  R.  Salakhutdinov,  R.  Zemel,  A.  Torralba,  R.  Urtasun,S.  Fidler,  Skip-­‐Thought  

Vectors,  arXiv  2015Y.    •  LeCun,  L.  Bonou,  Y.  Bengio,  P.  Haffner,  Gradient-­‐Based  Learning  Applied  to  Document  

RecogniEon,  Proceedings  of  the  IEEE,  1998  •  T.  Mikolov,  I.  Sutskever,  K.  Chen,  G.  Corrado,  J.  Dean,  Distributed  representaEons  of  words  

and  phrases  and  their  composiEonality,  NIPS  2013  •  R.  Salakhutdinov,  J.  Tenenbaum,  A.  Torralba,  Learning  with  Hierarchical-­‐Deep  Models,  IEEE  

PAMI,  vol.  35,  no.  8,  Aug.  2013,  •  R.  Salakhutdinov,  Learning  Deep  GeneraEve  Models,  Annual  Review  of  StaEsEcs  and  Its  

ApplicaEon,  Vol.  2,  2015  •  R.  Salakhutdinov,  G.  Hinton,  An  Efficient  Learning  Procedure  for  Deep  Boltzmann  Machines,  

Neural  ComputaEon,  2012,  Vol.  24,  No.  8.  •  R.  Salakhutdinov,  H.  Larochelle,  Efficient  Learning  of  Deep  Boltzmann  Machines,  AI  and  

StaEsEcs,  2010  

Page 136: Deep Learning Tutorial, part 2 (by Ruslan Salakhutdinov)

References  •  R.  Salakhutdinov,  G.  Hinton,  Replicated  So9max:  an  Undirected  Topic  Model,  NIPS  2010  •  R.  Salakhutdinov,  A.  Mnih,  G.  Hinton,  Restricted  Boltzmann  Machines  for  CollaboraEve  

Filtering,  ICML  2007  •  N.  Srivastava,  E.  Mansimov,  R.  Salakhutdinov,  Unsupervised  Learning  of  Video  

RepresentaEons  using  LSTMs,  ICML,  2015  •  N.  Srivastava,  R.  Salakhutdinov,  MulEmodal  Learning  with  Deep  Boltzmann  Machines,  JMLR,  

2014  •  I.  Sutskever,  O.  Vinyals,  Q.  Le,,  Sequence  to  Sequence  Learning  with  Neural  Networks,  NIPS  

2014  •  R.  Socher,  M.  Ganjoo,  C.  Manning,  A.  Ng,  Zero-­‐Shot  Learning  Through  Cross-­‐Modal  Transfer,  

NIPS  2013  •  Y.  Tang,  R.  Salakhutdinov,  G.  Hinton,  Deep  LamberEan  Networks,  ICM  2012  •  Y.  Tang,  R.  Salakhutdinov,  G.  Hinton,  Robust  Boltzmann  Machines  for  RecogniEon  and  

Denoising,  CVPR  2012  •  K.  Xu,  J.  Ba,  R.  Kiros,  K.  Cho,  A.  Courville,  R.  Salakhutdinov,R.  Zemel,  Y.  Bengio,  Show,  Anend  

and  Tell:  Neural  Image  CapEon  GeneraEon  with  Visual  AnenEon,  ICML,  2015