energy-based approaches to representation learning
TRANSCRIPT
![Page 1: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/1.jpg)
Energy-Based ApproachesTo Representation Learning
IAS2019-10-16
Yann LeCunNew York University
Facebook AI Researchhttp://yann.lecun.com
![Page 2: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/2.jpg)
Y. LeCun
Supervised Learning works but requires many labeled samples
Training a machine by showing examples instead of programming itWhen the output is wrong, tweak the parameters of the machine
PLANE
CAR
Works well for:Speech→words
Image→categories
Portrait→ name
Photo→caption
Text→topic
….
![Page 3: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/3.jpg)
Y. LeCun
Reinforcement Learning: works great for games and simulations.
57 Atari games: takes 83 hours equivalent real-time (18 million frames) to reach a performance that humans reach in 15 minutes of play.[Hessel ArXiv:1710.02298]
Elf OpenGo v2: 20 million self-play games. (2000 GPU for 14 days)[Tian arXiv:1902.04522]
StarCraft: AlphaStar 200 years of equivalent real-time play[Vinyals blog post 2019]
OpenAI single-handed Rubik’s cube10,000 years of simulation
![Page 4: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/4.jpg)
Y. LeCun
But RL Requires too many trials in the real world
Pure RL requires too many trials to learn anythingit’s OK in a game
it’s not OK in the real world
RL works in simple virtual world that you can run faster than real-time on many machines in parallel.
Anything you do in the real world can kill you
You can’t run the real world faster than real time
![Page 5: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/5.jpg)
How do Humans and Animal Learn?
So quickly
![Page 6: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/6.jpg)
Y. LeCun
Babies learn how the world works by observation
Largely by observation, with remarkably little interaction.
Photos courtesy of Emmanuel Dupoux
![Page 7: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/7.jpg)
Y. LeCun
Early Conceptual Acquisition in Infants [from Emmanuel Dupoux]
Perc
epti
on
Pro
duct
ion
Physics
Actions
Objects
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Time (months)
stability, support
gravity, inertiaconservation ofmomentum
Object permanence
solidity, rigidity
shape constancy
crawling walkingemotional contagion
Social-communicative
rational, goal-directed actions
face tracking
proto-imitation
pointing
biological motion
false perceptual beliefs
helping vshindering
natural kind categories
![Page 8: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/8.jpg)
Y. LeCun
Prediction is the essence of Intelligence
We learn models of the world by predicting
![Page 9: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/9.jpg)
Y. LeCun
The Next AI Revolution
THE REVOLUTION THE REVOLUTION WILL NOT BE SUPERVISEDWILL NOT BE SUPERVISED (nor purely reinforced)(nor purely reinforced)
With thanks to Alyosha Efros and Gil Scott Heron
Get the T-shirt!
![Page 10: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/10.jpg)
The Salvation?Self-Supervised Learning
Training very large networks to Understand the world through prediction
![Page 11: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/11.jpg)
Y. LeCun
Self-Supervised Learning: Prediction & Reconstruction
Predict any part of the input from any other part.Predict the future from the past.
Predict the future from the recent past.
Predict the past from the present.
Predict the top from the bottom.
Predict the occluded from the visiblePretend there is a part of the input you don’t know and predict that.
← PastPresent
Time →
Future →
![Page 12: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/12.jpg)
Y. LeCun
Three Types of Learning
Reinforcement LearningThe machine predicts a scalar reward given once in a while.
weak feedback
Supervised LearningThe machine predicts a category or a few numbers for each input
medium feedback
Self-supervised LearningThe machine predicts any part of its input for any observed part.
Predicts future frames in videos
A lot of feedback
PLANE
CAR
![Page 13: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/13.jpg)
Y. LeCun
How Much Information is the Machine Given during Learning?
“Pure” Reinforcement Learning (cherry)The machine predicts a scalar reward given once in a while.
A few bits for some samples
Supervised Learning (icing)The machine predicts a category or a few numbers for each input
Predicting human-supplied data
10→10,000 bits per sample
Self-Supervised Learning (cake génoise)The machine predicts any part of its input for any observed part.
Predicts future frames in videos
Millions of bits per sample
![Page 14: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/14.jpg)
Y. LeCun
Self-Supervised Learning: filling in the bl_nks
Natural Language Processing: works great!
INPUT: This is a […...] of text extracted […..] a large set of [……] articles
OUTPUT: This is a piece of text extracted from a large set of news articles
Encoder
Decoder
CODE
Image Recognition / Understanding: works so-so [Pathak et al 2014]
Encoder DecoderCODE
![Page 15: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/15.jpg)
Y. LeCun
Self-Supervised Learning works well for text
Word2vec[Mikolov 2013]
FastText[Joulin 2016] (FAIR)
BERTBidirectional Encoder Representations from Transformers
[Devlin 2018]
Cloze-Driven Auto-Encoder[Baevski 2019] (FAIR)
RoBERTa [Ott 2019] (FAIR)
Figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/
![Page 16: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/16.jpg)
Y. LeCun
Self-Supervised Learning in vision: Filling in the Blanks
![Page 17: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/17.jpg)
Y. LeCun
PhysNet: tell me what happens next
[Lerer, Gross, Fergus ICML 2016, arxiv:1603.01312]ConvNet produces object masks that predict the trajectories of falling blocks. Blurry predictions when uncertain
![Page 18: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/18.jpg)
Y. LeCun
The world is not entirely predictable / stochastic
Video prediction:Multiple futures are possible.
Training a system to make a single prediction results in “blurry” results
the average of all the possible futures
![Page 19: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/19.jpg)
Y. LeCun
For multiple predictions: latent variable model
As the latent variable z varies, the prediction yp varies over the region of high data density
![Page 20: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/20.jpg)
Energy-Based Learning
Energy ShapingRegularized Auto-Encoders
![Page 21: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/21.jpg)
Y. LeCun
Energy-Based Unsupervised Learning
Learning an energy function (or contrast function) F(y), that takesLow values on the data manifold
Higher values everywhere else
Y1
Y2
![Page 22: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/22.jpg)
Y. LeCunCapturing Dependencies Between Variables with an Energy Function
The energy surface is a “contrast function” that takes low values on the data manifold, and higher values everywhere else
Special case: energy = negative log density
Example: the samples live in the manifold
Y1Y2
Y 2=(Y 1)2
![Page 23: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/23.jpg)
Y. LeCun
Energy-Based Self-Supervised Learning
Energy Function: Takes low value on data manifold, higher values everywhere else
Push down on the energy of desired outputs. Push up on everything else. But how do we choose where to push up?
Observed Data (low energy)
Unobserveddata(high energy)
![Page 24: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/24.jpg)
Y. LeCun
Conditional and Unconditional Energy-Based Model
UnconditionalF(y)
ConditionalF(x,y)
Inference:
y = argminy F(x,y)
F(x,y)
x y
F(y)
y
x
y
y2
y1
![Page 25: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/25.jpg)
Y. LeCun
Learning the Energy Function
parameterized energy function Fw(x,y)
Make the energy low on the samples
Make the energy higher everywhere else
Making the energy low on the samples is easy
But how do we make it higher everywhere else?
![Page 26: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/26.jpg)
Y. LeCun
Conditional and Unconditional Latent Variable EBM
Unconditional
ConditionalE(x,y,z)
x
z
y
F (x , y )=minz E(x , y , z )
E(y,z)
z
y
F ( y )=minz E ( y , z)
F ( y )=−1βlog [∫
z
exp(−βE ( y , z))]
F (x , y )=−1βlog [∫
z
exp (−β E(x , y , z))]
![Page 27: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/27.jpg)
Y. LeCun
Energy-Based Latent Variable Model for Prediction
As the latent variable z varies, the prediction varies over the region of high data density
Inference, free energy:
How to make sure that there is no z for y outside the region of high data density?
Pred(x)
y
x
z
y
Dec(h,z)
h
Cost(y,y)
E( x , y , z)=Cost ( y , Dec(Pred (x) , z ))
F (x , y )=minz E(x , y , z )
F (x , y )=−1βlog [∫
z
exp (−β E(x , y , z))]
![Page 28: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/28.jpg)
Y. LeCun
(Conditional) Auto-Encoder
Auto-encoders learns to reconstruct.
Reconstruction error = Energy function
If it reconstructs everything it’s uselessIdentity function == flat energy surface
How to make sure that the reconstruction error is high outside the region of high data density?
C(y,y)
Pred(x)
y
x
z
y
Dec(h,z)
Enc(h,y)
h
![Page 29: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/29.jpg)
Y. LeCun
Simple examples: PCA and K-means
Limit the capacity of z so that the volume of low energy stuff is boundedPCA, K-means, GMM, square ICA...
F (Y )=‖W TWY−Y‖2
PCA: z is low dimensionalK-Means, Z constrained to 1-of-K codeF (Y )=minz∑i
‖Y−W i Z i‖2
![Page 30: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/30.jpg)
Y. LeCun
Familiar Example: Maximum Likelihood Learning
Y
P(y)
Y
F(y)
The energy can be interpreted as an unnormalized negative log densityGibbs distribution: Probability proportional to exp(-energy)
Beta parameter is akin to an inverse temperature
Don't compute probabilities unless you absolutely have toBecause the denominator is often intractable
P( y )=−exp [−β F ( y )]
∫y'
exp [−β F ( y ')]
P( y∣x )=−exp[−β F( x , y )]
∫y '
exp[−β F( x , y ')]
![Page 31: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/31.jpg)
push down of the energy of data points, push up everywhere else
Max likelihood (requires a tractable partition function)
Y
P(Y)
Y
E(Y)
Maximizing P(Y|W) on training samplesmake this big
make this bigmake this small
Minimizing -log P(Y,W) on training samples
make this small
![Page 32: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/32.jpg)
Gradient of the negative log-likelihood loss for one sample Y:
Pushes down on theenergy of the samples
Pulls up on theenergy of low-energy Y's
Y
Y
E(Y)Gradient descent:
push down of the energy of data points, push up everywhere else
![Page 33: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/33.jpg)
Y. LeCun
Seven Strategies to Shape the Energy Function
1. build the machine so that the volume of low energy stuff is constantPCA, K-means, GMM, square ICA
2. push down of the energy of data points, push up everywhere elseMax likelihood (needs tractable partition function or variational approximation)
3. push down of the energy of data points, push up on chosen locations Contrastive divergence, Ratio Matching, Noise Contrastive Estimation, Min Probability Flow, adversarial generator/GANs
4. minimize the gradient and maximize the curvature around data points score matching
5. if F(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.Contracting auto-encoder, saturating auto-encoder
6. train a dynamical system so that the dynamics goes to the data manifolddenoising auto-encoder, masked auto-encoder (e.g. BERT)
7. use a regularizer that limits the volume of space that has low energySparse coding, sparse auto-encoder, LISTA & PSD, Variational auto-encoders
MO
ST IMPO
RTANT SLIDE
![Page 34: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/34.jpg)
Regularized Latent Variable EBM
Sparse Modeling, Regularized Auto-Encoders
![Page 35: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/35.jpg)
Y. LeCun
Regularized Latent-Variable EBM
Restrict the information capacity of z through regularization
C(y,y)
y
z
y
R(z)
Dec(z)
E( y , z )=C( y ,Dec (z ))+λ R (z)
![Page 36: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/36.jpg)
Y. LeCun
Familiar Example: Sparse Coding / Sparse Modeling
Regularized latent variable through L1 normInduces sparsity
|| y-y ||2
y
z
y
|z|
Wz
E( y , z )=‖y−Wz‖2+λ|z|
![Page 37: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/37.jpg)
Sparse Auto-Encoders
Computing the sparse code efficiently
![Page 38: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/38.jpg)
Y. LeCun
Sparse Auto-Encoder
Encoder learns to compute the latent variable efficiently
No explicit inference step
C(y,y)
Pred(x)
y
x
z
y
R(z)
Dec(h,z)
Enc(h,y)
h
![Page 39: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/39.jpg)
Y. LeCun
Sparse (conditional) Auto-Encoder
Encoder learns to compute the latent variable efficiently
Simple inference step
C(y,y)
Pred(x)
y
x
z
y
R(z)
Dec(h,z)
Enc(h,y)
h D(z,z)
z
E( x , y , z)=C ( y , Dec(h , z ))+D(z , Enc (h , y ))+R (z )
![Page 40: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/40.jpg)
Y. LeCun
Sparse Modeling on handwritten digits (MNIST)
Basis functions (columns of decoder matrix) are digit partsAll digits are a linear combination of a small number of these
![Page 41: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/41.jpg)
Y. LeCun
Training on natural images patches.
12X12256 basis functions
Predictive Sparse Decomposition (PSD): Training
![Page 42: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/42.jpg)
Y. LeCunLearned Features on natural patches: V1-like receptive fields
![Page 43: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/43.jpg)
Y. LeCun
ISTA/FISTA: iterative algorithm that converges to optimal sparse code
INPUT Y ZW e sh( )
S
+
[Gregor & LeCun, ICML 2010], [Bronstein et al. ICML 2012], [Rolfe & LeCun ICLR 2013]
Lateral Inhibition
Giving the “right” structure to the encoder
ISTA/FastISTA reparameterized:
LISTA (Learned ISTA): learn the We and S matrices to get fast solutions
![Page 44: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/44.jpg)
Y. LeCun
Think of the FISTA flow graph as a recurrent neural net where We and S are trainable parameters
INPUT Y ZW e sh()
S
+
Time-Unfold the flow graph for K iterations
Learn the We and S matrices with “backprop-through-time”
Get the best approximate solution within K iterations
Y
Z
W e
sh()+ S sh()+ S
LISTA: Train We and S matrices to give a good approximation quickly
![Page 45: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/45.jpg)
Y. LeCun
Learning ISTA (LISTA) vs ISTA/FISTA
Number of LISTA or FISTA iterations
Reco
nst
ruct
ion
Err
or
![Page 46: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/46.jpg)
Y. LeCun
LISTA with partial mutual inhibition matrix
Proportion of S matrix elements that are non zero
Reco
nst
ruct
ion
Err
or
Smallest elementsremoved
![Page 47: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/47.jpg)
Y. LeCun
Learning Coordinate Descent (LcoD): faster than LISTA
Number of LISTA or FISTA iterations
Reco
nst
ruct
ion
Err
or
![Page 48: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/48.jpg)
Y. LeCun
Replace the dot products with dictionary element by convolutions.Input Y is a full imageEach code component Zk is a feature map (an image)Each dictionary element is a convolution kernel
Regular sparse coding
Convolutional S.C.
∑k
. * ZkWk
Y =
Also used in “deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]
Convolutional Sparse Coding
![Page 49: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/49.jpg)
Y. LeCun
Convolutional FormulationExtend sparse coding from PATCH to IMAGE
PATCH based learning CONVOLUTIONAL learning
Convolutional PSD: Encoder with a soft sh() Function
![Page 50: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/50.jpg)
Y. LeCun
Convolutional Sparse Auto-Encoder on Natural Images
Filters and Basis Functions obtained
– with 1, 2, 4, 8, 16, 32, and 64 filters.
Encoder Filters Decoder Filters Encoder Filters Decoder Filters
![Page 51: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/51.jpg)
Y. LeCun
Learning Invariant Features [Kavukcuoglu et al. CVPR 2009]
Unsupervised PSD ignores the spatial pooling step.Could we devise a similar method that learns the pooling layer as well?Idea [Hyvarinen & Hoyer 2001]: group sparsity on pools of featuresMinimum number of pools must be non-zero
Number of features that are on within a pool doesn't matter
Polls tend to regroup similar features
INPUT Y Z
∥Y i− Y∥2 W d Z
FEATURES
∑ j.
∥Z− Z∥2ge W e ,Yi
∑k∈P jZk2
![Page 52: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/52.jpg)
Y. LeCun
Why just pool over space? Why not over orientation?Using an idea from Hyvarinen: topographic square pooling (subspace ICA)
1. Apply filters on a patch (with suitable non-linearity)
2. Arrange filter outputs on a 2D plane
3. square filter outputs
4. minimize sqrt of sum of blocks of squared filter outputs
![Page 53: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/53.jpg)
Y. LeCun
Why just pool over space? Why not over orientation?
The filters arrange themselves spontaneously so that similar filters enter the same pool.The pooling units can be seen as complex cellsThey are invariant to local transformations of the inputFor some it's translations, for others rotations, or other transformations.
![Page 54: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/54.jpg)
Y. LeCun
Pinwheels!
Does that look pinwheely to you?
![Page 55: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/55.jpg)
Variational Auto-Encoder
Limiting the information content of the code by adding noise to it
![Page 56: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/56.jpg)
Y. LeCun
Variational (conditional) AE
Use the D() energy term as a prior to sample z from
C(y,y)
Pred(x)
y
x
z
y
R(z)
Dec(h,z)
Enc(h,y)
h D(z,z)
z
![Page 57: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/57.jpg)
Y. LeCun
Variational Auto-Encoder
Code vectors for training samples
Z1
Z2
![Page 58: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/58.jpg)
Y. LeCun
Variational Auto-Encoder
Code vectors for training sample with Gaussian noiseSome fuzzy balls overlap, causing bad reconstructions
Z1
Z2
![Page 59: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/59.jpg)
Y. LeCun
Variational Auto-Encoder
The code vectors want to move away from each other to minimize reconstruction errorBut that does nothing for us
Z1
Z2
![Page 60: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/60.jpg)
Y. LeCun
Variational Auto-Encoder
Attach the balls to the center with a sping, so they don’t fly awayMinimize the square distances of the balls to the origin
Center the balls around the originMake the center of mass zero
Make the sizes of the balls close to 1 in each dimensionThrough a so-called KL term Z1
Z2
![Page 61: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/61.jpg)
Denoising Auto-Encoders
Masked AE, BERT...
![Page 62: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/62.jpg)
Y. LeCun
Denoising Auto-Encoder
Pink points: training samples
Orange points: corrupted training samples
Additive Gaussian noise
Figures credit: Alfredo Canziani
![Page 63: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/63.jpg)
Y. LeCun
Data
Pink points: training samples
Orange points: corrupted training samples
Additive Gaussian noise
Blue points:Network outputs
Denoised samples
![Page 64: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/64.jpg)
Y. LeCun
Data
Orange points: Test samples on a grid
Blue points:Network outputs
![Page 65: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/65.jpg)
Y. LeCun
Data
Pink points:Training samples
Color:Recunstruction energy
Vector field:Displacement from network input to network output (scaled down)
![Page 66: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/66.jpg)
Adversarial Training &Video Prediction
![Page 67: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/67.jpg)
Y. LeCun
The Hard Part: Prediction Under Uncertainty
Invariant prediction: The training samples are merely representatives of a whole set of possible outputs (e.g. a manifold of outputs).
Percepts
Hidden StateOf the World
![Page 68: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/68.jpg)
Y. LeCun
Y
F(X,Y)
Adversarial Training: the key to prediction under uncertainty?
GeneratorG(X,Z)
DiscriminatorF(X,Y)
X
Past: X
YZ
F: minimizeDataset
T(X)Y
X
DiscriminatorF(X,Y)
Past: X
F: maximize
Actual future
Predicted future
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
![Page 69: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/69.jpg)
Y. LeCun
Y
F(X,Y)
GeneratorG(X,Z)
DiscriminatorF(X,Y)
X
Past: X
YZ
F: minimizeDataset
T(X)Y
X
DiscriminatorF(X,Y)
Past: X
F: maximize
Actual future
Predicted future
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
F(X,Y)
Adversarial Training: the key to prediction under uncertainty?
![Page 70: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/70.jpg)
Y. LeCun
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
GeneratorG(X,Z)
DiscriminatorF(X,Y)
X
Past: X
YZ
F: minimizeDataset
T(X)Y
X
DiscriminatorF(X,Y)
Past: X
F: maximize
Actual future
Predicted future
Y
F(X,Y)
Adversarial Training: the key to prediction under uncertainty?
![Page 71: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/71.jpg)
Y. LeCun
Faces “invented” by a GAN (Generative Adversarial Network)
Random vector → Generator Network → output image [Goodfellow NIPS 2014]
[Karras et al. ICLR 2018] (from NVIDIA)
![Page 72: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/72.jpg)
Y. LeCun
Generative Adversarial Networks for Creation
[Sbai 2017]
![Page 73: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/73.jpg)
Y. LeCun
Self-supervised Adversarial Learning for Video PredictionOur brains are “prediction machines”Can we train machines to predict the future?Some success with “adversarial training” [Mathieu, Couprie, LeCun arXiv:1511:05440]
But we are far from a complete solution.
![Page 74: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/74.jpg)
Learning Models of the World
Learning motor skills with no interaction with the real world[Henaff, Canziani, LeCun ICLR 2019]
[Henaff, Zhao, LeCun ArXiv:1711.04994][Henaff, Whitney, LeCun Arxiv:1705.07177]
![Page 75: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/75.jpg)
Y. LeCun
Planning Requires Prediction
To plan ahead, we simulate the world
World
Agent
Percepts
ObjectiveCost
Agent State
Actions/Outputs
AgentWorld
Simulator
Actor
PredictedPercepts
Critic Predicted Cost
ActionProposals
InferredWorld State
Actor State
![Page 76: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/76.jpg)
Y. LeCun
Training the Actor with Optimized Action Sequences
1. Find action sequence through optimization2. Use sequence as target to train the actorOver time we get a compact policy that requires no run-time optimization
AgentWorld
Simulator
Actor
Critic
World
Simulator
Actor
Critic
World
Simulator
Actor
Critic
World
Simulator
Actor
Critic
Perception
![Page 77: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/77.jpg)
Y. LeCun
Planning/learning using a self-supervised predictive world model
Feed initial stateRun the forward model Backpropagate gradient of cost Act (model-predictive control)
or
Use the gradient to train a policy network.Iterate
![Page 78: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/78.jpg)
Y. LeCun
Using Forward Models to Plan (and to learn to drive)
Overhead camera on highway.Vehicles are tracked
A “state” is a pixel representation of a rectangular window centered around each car.Forward model is trained to predict how every car moves relative to the central car. steering and acceleration are computed
![Page 79: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/79.jpg)
Y. LeCun
Forward Model Architecture
Architecture: Encoder Decoder
Latent variable predictor
expander
+
![Page 80: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/80.jpg)
Y. LeCun
Stochastic Forward Modeling: regularized latent variable model
Latent variable is predicted from the target.The latent variable is set to zero half the time during training (drop out) and corrupted with noiseThe model predicts as much as it can without the latent var. The latent var corrects the residual error.
+
Noise
predictor
fpred
fenc
fdec
![Page 81: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/81.jpg)
Y. LeCun
Actual, Deterministic, VAE+Dropout Predictor/encoder
![Page 82: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/82.jpg)
Y. LeCun
Cost optimized for Planning & Policy Learning
Differentiable cost functionIncreases as car deviates from lane
Increases as car gets too close to other cars nearby in a speed-dependent way
Uncertainty cost:Increases when the costs from multiple predictions (obtained through sampling of drop-out) have high variance.
Prevents the system from exploring unknown/unpredictable configurations that may have low cost.
![Page 83: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/83.jpg)
Y. LeCun
Learning to Drive by Simulating it in your Head
Feed initial stateSample latent variable sequences of length 20Run the forward model with these sequencesBackpropagate gradient of cost to train a policy network.Iterate
No need for planning at run time.
![Page 84: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/84.jpg)
Y. LeCun
Adding an Uncertainty Cost (doesn’t work without it)
Estimates epistemic uncertaintySamples multiple drop-puts in forward modelComputes variance of predictions (differentiably)Train the policy network to minimize the lane&proximity cost plus the uncertainty cost.Avoids unpredictable outcomes
![Page 85: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/85.jpg)
Y. LeCun
Driving an Invisible Car in “Real” Traffic
![Page 86: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/86.jpg)
Y. LeCun
Driving!
Yellow: real carBlue: bot-driven car
![Page 87: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/87.jpg)
Driving!
Yellow: real carBlue: bot-driven car
![Page 88: Energy-Based Approaches To Representation Learning](https://reader033.vdocuments.site/reader033/viewer/2022052501/628b31576312e5647f1df6f5/html5/thumbnails/88.jpg)
Thank you