DeepLearning
NIPS’2015Tutorial
GeoffHinton,YoshuaBengio&YannLeCun
Breakthrough DeepLearning:machinelearningalgorithmsbasedonlearningmulHplelevelsofrepresentaHon/abstracHon.
2
Amazingimprovementsinerrorrateinobjectrecogni4on,objectdetec4on,speechrecogni4on,andmorerecently,innaturallanguageprocessing/understanding
Machine Learning, AI & No Free Lunch • FourkeyingredientsforMLtowardsAI
1. Lots&lotsofdata
2. Veryflexiblemodels
3. Enoughcompu4ngpower
4. Powerfulpriorsthatcandefeatthecurseofdimensionality
3
Bypassing the curse of dimensionality Weneedtobuildcomposi4onalityintoourMLmodels
Justashumanlanguagesexploitcomposi4onalitytogiverepresenta4onsandmeaningstocomplexideas
Exploi4ngcomposi4onalitygivesanexponen4algaininrepresenta4onalpower
(1)Distributedrepresenta4ons/embeddings:featurelearning
(2)Deeparchitecture:mul4plelevelsoffeaturelearning
Addi4onalprior:composi4onalityisusefultodescribetheworldaroundusefficiently
4
Classical Symbolic AI vs Learning Distributed Representations
• Twosymbolsareequallyfarfromeachother• Conceptsarenotrepresentedbysymbolsinour
brain,butbypaWernsofac4va4on(Connec'onism,1980’s)
5
catdog
personInputunits
Hiddenunits
Outputunits
GeoffreyHinton
DavidRumelhart
Exponential advantage of distributed representations
Learningasetofparametricfeaturesthatarenotmutuallyexclusivecanbeexponen4allymoresta4s4callyefficientthanhavingnearest-neighbor-likeorclustering-likemodels
Under review as a conference paper at ICLR 2015
People Lighting
Animals
Tables
Seating
Object counts in SUN
0
5000
10000
15000
Object counts of most informative objects for scene recognition
Counts of CNN units discovering each object class.
c)
d)
b)
a) w
all
w
ind
ow
c
hai
r
bu
ildin
g
flo
or
t
ree
c
eilin
g la
mp
c
abin
et
cei
ling
p
erso
n
pla
nt
c
ush
ion
s
ky
pic
ture
c
urt
ain
p
ain
tin
g
do
or
d
esk
lam
p
sid
e ta
ble
t
able
b
ed
bo
oks
p
illo
w
mo
un
tain
c
ar
po
t
arm
chai
r
bo
x
vas
e
flo
wer
s
ro
ad
gra
ss
bo
ttle
s
ho
es
so
fa
ou
tlet
w
ork
top
s
ign
b
oo
k
sco
nce
p
late
m
irro
r
co
lum
n
ru
g
bas
ket
g
rou
nd
d
esk
c
off
ee t
able
c
lock
s
hel
ves
0
5
10
15
20
0
10
20
30
w
all
w
ind
ow
c
hai
r
bu
ildin
g
flo
or
t
ree
c
eilin
g la
mp
c
abin
et
cei
ling
p
erso
n
pla
nt
c
ush
ion
s
ky
pic
ture
c
urt
ain
p
ain
tin
g
do
or
d
esk
lam
p
sid
e ta
ble
t
able
b
ed
bo
oks
p
illo
w
mo
un
tain
c
ar
po
t
arm
chai
r
bo
x
vas
e
flo
wer
s
ro
ad
gra
ss
bo
ttle
s
ho
es
so
fa
ou
tlet
w
ork
top
s
ign
b
oo
k
sco
nce
p
late
m
irro
r
co
lum
n
ru
g
bas
ket
g
rou
nd
d
esk
c
off
ee t
able
c
lock
s
hel
ves
Figure 9: (a) Segmentations from pool5 in Places-CNN. Many classes are encoded by several unitscovering different object appearances. Each row shows the 3 top most confident images for eachunit. (b) Object frequency in SUN (only top 50 objects shown), (c) Counts of objects discovered bypool5 in Places-CNN. (d) Frequency of most informative objects for scene classification.
4 EMERGENCE OF OBJECTS AS THE INTERNAL REPRESENTATION
As shown before, a large number of units in pool5 are devoted to detecting objects and scene-regions (Fig. 8). But what categories are found? Is each category mapped to a single unit or arethere multiple units for each object class? Can we actually use this information to segment a scene?
4.1 WHAT OBJECT CLASSES EMERGE?
Fig. 9(a) shows some units from the Places-CNN grouped by the object class they seem to be detect-ing. Each row shows the top three images for a particular unit that produce the strongest activations.The segmentation shows the region of the image for which the unit is above a threshold. Each unitseems to be selective to a particular appearance of the object. For instance, there are 6 units thatdetect lamps, each unit detecting a particular type of lamp providing finer-grained discrimination;there are 9 units selective to people, each one tuned to different scales or people doing differenttasks. ImageNet has an abundance of animals among the categories present: in the ImageNet-CNN,out of the 256 units in pool5, there are 23 units devoted to detecting dogs or parts of dogs. Thecategories found in pool5 tend to follow the target categories in ImageNet.
To answer the question of why certain objects emerge from pool5, we tested the Places-CNN onfully annotated images from the SUN database (Xiao et al., 2014). The SUN database contains8220 fully annotated images from the same 205 place categories used to train Places-CNN. Thereare no duplicate images between SUN and Places. We use SUN instead of COCO (Lin et al., 2014)as we need dense object annotations to study what the most informative object classes for scenecategorization are, and what the natural object frequencies in scene images are. For this study, wemanually mapped the tags given by AMT workers to the SUN categories. Fig. 9(b) shows the sorteddistribution of object counts in the SUN database which follows Zipf’s law.
One possibility is that the objects that emerge in pool5 correspond to the most frequent ones in thedatabase. Fig. 9(c) shows the counts of units found in pool5 for each object class (same sortingas in Fig. 9(b)). The correlation between object frequency in the database and object frequencydiscovered by the units in pool5 is 0.54. Another possibility is that the objects that emerge are theobjects that allow discriminating among scene categories. To measure the set of discriminant objectswe used the ground truth in the SUN database to measure the classification performance achieved byeach object class for scene classification. Then we count how many times each object class appearsas the most informative one. This measures the number of scene categories a particular object classis the most useful for. The counts are shown in Fig. 9(d). Note the similarity between Fig. 9(c) andFig. 9(d). The correlation is 0.84 indicating that the network is automatically identifying the mostdiscriminative object categories to a large extent.
7
Hidden Units Discover Semantically Meaningful Concepts
• Zhouetal&Torralba,arXiv1412.6856submiWedtoICLR2015• Networktrainedtorecognizeplaces,notobjects
7
Under review as a conference paper at ICLR 2015
Figure 10: Interpretation of a picture by different layers of the Places-CNN using the tags providedby AMT workers. The first shows the final layer output of Places-CNN. The other three showdetection results along with the confidence based on the units’ activation and the semantic tags.Fireplace (J=5.3%, AP=22.9%)
Wardrobe (J=4.2%, AP=12.7%)
Billiard table (J=3.2%, AP=42.6%)
Bed (J=24.6%, AP=81.1%)
Mountain (J=11.3%, AP=47.6%)
Sofa (J=10.8%, AP=36.2%)
Building (J=14.6%, AP=47.2%) Washing machine (J=3.2%, AP=34.4%)
0
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
Sofa
Desk lampSwimming pool
BedCar
Pre
cis
ion
Recall
Co
un
ts
Average precision (AP)0 10 20 30 40 50 60 70 80 90 100
a)
b)
c)
Figure 11: (a) Segmentation of images from the SUN database using pool5 of Places-CNN (J =Jaccard segmentation index, AP = average precision-recall.) (b) Precision-recall curves for somediscovered objects. (c) Histogram of AP for all discovered object classes.
Note that there are 115 units in pool5 of Places-CNN not detecting objects. This could be due toincomplete learning or a complementary texture-based or part-based representation of the scenes.
4.2 OBJECT LOCALIZATION WITHIN THE INNER LAYERS
Places-CNN is trained to do scene classification using the output of the final layer of logistic re-gression and achieves the state-of-the-art performance. From our analysis above, many of the unitsin the inner layers could perform interpretable object localization. Thus we could use this singlePlaces-CNN with the annotation of units to do both scene recognition and object localization in asingle forward-pass. Fig. 10 shows an example of the output of different layers of the Places-CNNusing the tags provided by AMT workers. Bounding boxes are shown around the areas where eachunit is activated within its RF above a threshold.
In Fig. 11 we evaluate the segmentation performance of the objects discovered in pool5 using theSUN database. The performance of many units is very high which provides strong evidence thatthey are indeed detecting those object classes despite being trained for scene classification.
5 CONCLUSION
We find that object detectors emerge as a result of learning to classify scene categories, showingthat a single network can support recognition at several levels of abstraction (e.g., edges, textures,objects, and scenes) without needing multiple outputs or networks. While it is common to train anetwork to do several tasks and to use the final layer as the output, here we show that reliable outputscan be extracted at each layer. As objects are the parts that compose a scene, detectors tuned to theobjects that are discriminant between scenes are learned in the inner layers of the network. Notethat only informative objects for specific scene recognition tasks will emerge. Future work shouldexplore which other tasks would allow for other object classes to be learned without the explicitsupervision of object labels.
8
Each feature can be discovered without the need for seeing the exponentially large number of configurations of the other features
• Consideranetworkwhosehiddenunitsdiscoverthefollowingfeatures:• Personwearsglasses• Personisfemale• Personisachild• Etc.
IfeachofnfeaturerequiresO(k)parameters,needO(nk)examplesNon-parametricmethodswouldrequireO(nd)examples
8
Exponential advantage of distributed representations
9
• Bengio2009(LearningDeepArchitecturesforAI,F&TinML)• Montufar&Morton2014(Whendoesamixtureofproducts
containaproductofmixtures?SIAMJ.Discr.Math)
• Longerdiscussionandrela4onstotheno4onofpriors:DeepLearning,toappear,MITPress.
• Prop.2ofPascanu,Montufar&BengioICLR’2014:numberofpiecesdis4nguishedby1-hidden-layerrec4fiernetwithnunitsanddinputs(i.e.O(nd)parameters)is
Input
Hand-designed program
Output
Input
Hand-designed features
Mapping from
features
Output
Input
Features
Mapping from
features
Output
Input
Simplest features
Mapping from
features
Output
Most complex features
Rule-basedsystems
Classicmachinelearning
Representationlearning
Deeplearning
Deep Learning: Automating Feature Discovery
10 Fig:I.Goodfellow
Exponential advantage of depth
Theore4calarguments:
…1 2 3 2n
1 2 3…
n
= universal approximator 2 layers of Logic gates Formal neurons RBF units
Theorems on advantage of depth: (Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Martens et al 2013, Pascanu et al 2014, Montufar et al NIPS 2014)
Some functions compactly represented with k layers may require exponential size with 2 layers
RBMs & auto-encoders = universal approximator
Why does it work? No Free Lunch
• Itonlyworksbecausewearemakingsomeassump4onsaboutthedatagenera4ngdistribu4on
• Worse-casedistribu4onss4llrequireexponen4aldata
• Buttheworldhasstructureandwecangetanexponen4algainbyexploi4ngsomeofit
12
• Expressivenessofdeepnetworkswithpiecewiselinearac4va4onfunc4ons:exponen4aladvantagefordepth (Montufaretal,NIPS2014)
• Numberofpiecesdis4nguishedforanetworkwithdepthLandniunitsperlayerisatleast
or,ifhiddenlayershavewidthnandinputhassizen0
13
Exponential advantage of depth
Y LeCun
Backprop (modular approach)
Y LeCun Typical Multilayer Neural Net Architecture
l Complex learning machines can be built by assembling modules into networks
l Linear Module l Out = W.In+B
l ReLU Module (Rectified Linear Unit) l Outi = 0 if Ini<0 l Outi = Ini otherwise
l Cost Module: Squared Distance l C = ||In1 - In2||2
l Objective Function
l L(Θ)=1/p Σk C(Xk,Yk,Θ) l Θ = (W1,B1,W2,B2,W3,B3) Linear
ReLU
Linear
ReLU
Squared Distance
Linear
C(X,Y,Θ)
X (input) Y (desired output)
W1, B1
W2, B2
W3, B3
Y LeCun Building a Network by Assembling Modules
l All major deep learning frameworks use modules (inspired by SN/Lush, 1991) l Torch7, Theano, TensorFlow….
Linear
ReLU
Linear
LogSoftMax
NegativeLogLikelihood
C(X,Y,Θ)
X input
Y Label
W1,B1
W2,B2
Y LeCun Computing Gradients by Back-Propagation
l A practical Application of Chain Rule
l Backprop for the state gradients:
l dC/dXi-1 = dC/dXi . dXi/dXi-1
l dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
l Backprop for the weight gradients:
l dC/dWi = dC/dXi . dXi/dWi
l dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
Cost
Fn(Xn-1,Wn)
C(X,Y,Θ)
X (input) Y (desired output)
Fi(Xi-1,Wi)
F1(X0,W1)
Xi-1
Xi
dC/dXi-1
dC/dXi
dC/dWn Wn
dC/dWi Wi
Y LeCun Running Backprop
l Torch7 example
l Gradtheta contains the gradient
Linear
ReLU
Linear
LogSoftMax
NegativeLogLikelihood
C(X,Y,Θ)
X input
Y Label
W1,B1
W2,B2
Θ
Y LeCun Module Classes
l Y = W.X ; dC/dX = WT . dC/dY ; dC/dW = dC/dY . (dC/dX)T
l y = ReLU(x) ; if (x<0) dC/dx = 0 else dC/dx = dC/dy
l Y1 = X, Y2 = X ; dC/dX = dC/dY1 + dC/dY2
l Y = X1 + X2 ; dC/dX1 = dC/dY ; dC/dX2 = dC/dY
l y = max(x1,x2) ; if (x1>x2) dC/dx1 = dC/dy else dC/dx1=0
l Yi = Xi – log[∑j exp(Xj)] ; …..
ReLU
Linear
LogSoftMax
Duplicate
Add
Max
Y LeCun Module Classes
l Many more basic module classes
l Cost functions: l Squared error l Hinge loss l Ranking loss
l Non-linearities and operators l ReLU, “leaky” ReLU, abs,…. l Tanh, logistic l Just about any simple function (log, exp, add, mul,….)
l Specialized modules l Multiple convolutions (1D, 2D, 3D) l Pooling/subsampling: max, average, Lp, log(sum(exp())), maxout l Long Short-Term Memory, attention, 3-way multiplicative interactions. l Switches l Normalizations: batch norm, contrast norm, feature norm... l inception
Y LeCun Any Architecture works
" Any connection graph is permissible " Directed acyclic graphs (DAG) " Networks with loops must be
“unfolded in time”.
" Any module is permissible " As long as it is continuous and
differentiable almost everywhere with respect to the parameters, and with respect to non-terminal inputs.
" Most frameworks provide automatic differentiation " Theano, Torch7+autograd,… " Programs are turned into
computation DAGs and automatically differentiated.
Y LeCun Backprop in Practice
" Use ReLU non-linearities
" Use cross-entropy loss for classification
" Use Stochastic Gradient Descent on minibatches
" Shuffle the training samples (← very important)
" Normalize the input variables (zero mean, unit variance)
" Schedule to decrease the learning rate
" Use a bit of L1 or L2 regularization on the weights (or a combination) " But it's best to turn it on after a couple of epochs
" Use “dropout” for regularization
" Lots more in [LeCun et al. “Efficient Backprop” 1998]
" Lots, lots more in “Neural Networks, Tricks of the Trade” (2012 edition) edited by G. Montavon, G. B. Orr, and K-R Müller (Springer)
" More recent: Deep Learning (MIT Press book in preparation)
Y LeCun
Convolutional
Networks
+
Y LeCun Deep Learning = Training Multistage Machines
" Traditional Pattern Recognition: Fixed/Handcrafted Feature Extractor
Trainable Classifier
Feature Extractor
" Mainstream Pattern Recognition 9until recently)
Trainable Classifier
Feature Extractor
Mid-Level Features
" Deep Learning: Multiple stages/layers trained end to end
Trainable Classifier
Low-Level Features
Mid-Level Features
High-Level Features
Y LeCun
Overall Architecture: multiple stages of Normalization → Filter Bank → Non-Linearity → Pooling
" Normalization: variation on whitening (optional)
– Subtractive: average removal, high pass filtering – Divisive: local contrast normalization, variance normalization
" Filter Bank: dimension expansion, projection on overcomplete basis " Non-Linearity: sparsification, saturation, lateral inhibition....
– Rectification (ReLU), Component-wise shrinkage, tanh,..
" Pooling: aggregation over space or feature type
– Max, Lp norm, log prob.
Classifier feature Pooling
Non- Linear
Filter Bank
Norm feature
Pooling Non-
Linear Filter Bank
Norm
Y LeCun ConvNet Architecture
" LeNet1 [LeCun et al. NIPS 1989]
Filter Bank +non-linearity
Filter Bank +non-linearity
Pooling
Pooling
Filter Bank +non-linearity
Y LeCun Multiple Convolutions
Animation: Andrej Karpathy http://cs231n.github.io/convolutional-networks/
Y LeCun Convolutional Networks (vintage 1990)
" filters → tanh → average-tanh → filters → tanh → average-tanh → filters → tanh
Y LeCun Example: 1D (Temporal) convolutional net
" 1D (Temporal) ConvNet, aka Timed-Delay Neural Nets " Groups of units are replicated at each time step. " Replicas have identical (shared) weights.
Y LeCun LeNet5
" Simple ConvNet " for MNIST " [LeCun 1998]
input 1@32x32
Layer 1 6@28x28
Layer 2 6@14x14
Layer 3 12@10x10
Layer 4 12@5x5
Layer 5 100@1x1
10
5x5 convolution
5x5 convolution
5x5 convolution
2x2 pooling/ subsampling
2x2 pooling/ subsampling
Layer 6: 10
Y LeCun Applying a ConvNet with a Sliding Window
" Every layer is a convolution " Sometimes called “fully convolutional nets” " There is no such thing as a “fully connected layer”
Y LeCun Sliding Window ConvNet + Weighted FSM (Fixed Post-Proc)
[Matan,Burges,LeCun,DenkerNIPS1991][LeCun,BoWou,Bengio,Haffner,ProcIEEE1998]
Y LeCun Sliding Window ConvNet + Weighted FSM
Y LeCun Why Multiple Layers? The World is Compositional
" Hierarchy of representations with increasing level of abstraction
" Each stage is a kind of trainable feature transform
" Image recognition: Pixel → edge → texton → motif → part → object
" Text: Character → word → word group → clause → sentence → story
" Speech: Sample → spectral band → sound → … → phone → phoneme → word
Trainable Classifier
Low-Level Feature
Mid-Level Feature
High-Level Feature
Y LeCun Yes, ConvNets are somewhat inspired by the Visual Cortex
[picture from Simon Thorpe]
[Gallant & Van Essen]
" The ventral (recognition) pathway in the visual cortex has multiple stages " Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
Y LeCun What are ConvNets Good For
" Signals that comes to you in the form of (multidimensional) arrays. " Signals that have strong local correlations " Signals where features can appear anywhere " Signals in which objects are invariant to translations and distortions.
" 1D ConvNets: sequential signals, text
– Text Classification – Musical Genre Recognition – Acoustic Modeling for Speech Recognition – Time-Series Prediction
" 2D ConvNets: images, time-frequency representations (speech and audio) – Object detection, localization, recognition
" 3D ConvNets: video, volumetric images, tomography images – Video recognition / understanding – Biomedical image analysis – Hyperspectral image analysis
Recurrent Neural Networks
37
Recurrent Neural Networks
• Selec4velysummarizeaninputsequenceinafixed-sizestatevectorviaarecursiveupdate
38
stst�1 st+1
F✓ F✓ F✓
xtxt�1 xt+1x
sF✓
unfold
Recurrent Neural Networks
• Canproduceanoutputateach4mestep:unfoldingthegraphtellsushowtoback-propthrough4me.
39
xtxt�1 xt+1x
unfold
V WW
W W W
V V V
U U U U
s
o
st�1
ot�1 ot
st st+1
ot+1
Generative RNNs
40xtxt�1 xt+1
WW W W
V V V
U U U
st�1
ot�1 ot
st st+1
ot+1
Lt+1Lt�1 Lt
xt+2
• AnRNNcanrepresentafully-connecteddirectedgeneraHvemodel:everyvariablepredictedfromallpreviousones.
Maximum Likelihood = Teacher Forcing
• Duringtraining,pastyininputisfromtrainingdata
• Atgenera4on4me,pastyininputisgenerated
• Mismatchcancause”compoundingerror”
41
P (yt | ht)
ht
xt
yt ⇠ P (yt | ht)
(xt, yt) : next input/output training pair
yt
Increasing the Expressive Power of RNNs with more Depth
• ICLR2014,Howtoconstructdeeprecurrentneuralnetworks
42
����
����
����
��
��
��
����
����
����
��
���� ��
��
��
���� ��
��OrdinaryRNNs
+deephid-to-out+deephid-to-hid+deepin-to-hid
+skipconnec4onsforcrea4ngshorterpaths
��
������
��
� ���� �
+stacking
Long-Term Dependencies
• TheRNNgradientisaproductofJacobianmatrices,eachassociatedwithastepintheforwardcomputa4on.Tostoreinforma4onrobustlyinafinite-dimensionalstate,thedynamicsmustbecontrac4ve[Bengioetal1994].
• Problems:
• sing.valuesofJacobians>1àgradientsexplode• orsing.values<1àgradientsshrink&vanish• orrandomàvariancegrowsexponen4ally
43
Storingbitsrobustlyrequiressing.values<1
(Hochreiter1991)
Gradientclipping
Gradient Norm Clipping
44
(Mikolovthesis2012;Pascanu,Mikolov,Bengio,ICML2013)
RNN Tricks (Pascanu,Mikolov,Bengio,ICML2013;Bengio,Boulanger&Pascanu,ICASSP2013)
• Clippinggradients(avoidexplodinggradients)• Leakyintegra4on(propagatelong-termdependencies)• Momentum(cheap2ndorder)• Ini4aliza4on(startinrightballparkavoidsexploding/vanishing)• SparseGradients(symmetrybreaking)• Gradientpropaga4onregularizer(avoidvanishinggradient)• LSTMself-loops(avoidvanishinggradient)
45
error
✓
✓
×
input input gate forget gate output gate
output
state
self-loop
×
+ ×
Gated Recurrent Units & LSTM • Createapathwhere
gradientscanflowforlongerwithself-loop
• CorrespondstoaneigenvalueofJacobianslightlylessthan1
• LSTMisheavilyused(Hochreiter&Schmidhuber1997)
• GRUlight-weightversion(Choetal2014)
46
RNN Tricks • Delaysandmul4ple4mescales,Elhihi&BengioNIPS1996
47
xtxt�1 xt+1x
unfold
s
o
st�1
ot�1 ot
st st+1
ot+1
W1
W3
W1 W1 W1 W1
W3
st�2
W3 W3 W3
Backprop in Practice
48
Othertricks:seeDeepLearningbook(inprepara4on,online)
Y LeCun
The Convergence of Gradient Descent
" Batch Gradient " There is an optimal learning rate " Equal to inverse 2
nd derivative
Y LeCun
Let's Look at a single linear unit
" Single unit, 2 inputs
" Quadratic loss " E(W) = 1/p ∑p (Y – W●Xp)2
" Dataset: classification: Y=-1 for blue, +1 for red.
" Hessian is covariance matrix of input vectors " H = 1/p ∑ Xp Xp
T " To avoid ill conditioning: normalize the inputs
" Zero mean " Unit variance for all variable
X1 X2
W1 W2
W0
Y LeCun
Convergence is Slow When Hessian has Different Eigenvalues
" Batch Gradient, small learning rate Batch Gradient, large learning rate
Y LeCun
Convergence is Slow When Hessian has Different Eigenvalues
" Batch Gradient, small learning rate " Stochastic Gradient: Much Faster " But fluctuates near the minimum
" Batch Gradient, small learning rate " Batch Gradient, small learning rate
Y LeCun
Multilayer Nets Have Non-Convex Objective Functions
" 1-1-1 network " Y = W1*W2*X
" trained to compute the identity function with quadratic loss " Single sample X=1, Y=1 L(W) = (1-W1*W2)^2
" Solution: W2 = 1/W2 hyperbola.
Solution Saddle point Solution
X
Z
Y
W2
W1
Y LeCun
Deep Nets with ReLUs and Max Pooling
" Stack of linear transforms interspersed with Max operators " Point-wise ReLUs:
" Max Pooling " “switches” from one layer to the next
" Input-output function " Sum over active paths " Product of all weights along the path " Solutions are hyperbolas
" Objective function is full of saddle points
14
22
3
31
W14,3
W22,14
W31,22
Z3
A Myth Has Been Debunked: Local Minima in Neural Nets ! Convexity is not needed • (Pascanu,Dauphin,Ganguli,Bengio,arXivMay2014):Onthe
saddlepointproblemfornon-convexop'miza'on• (Dauphin,Pascanu,Gulcehre,Cho,Ganguli,Bengio,NIPS’2014):
Iden'fyinganda[ackingthesaddlepointprobleminhigh-dimensionalnon-convexop'miza'on
• (Choromanska,Henaff,Mathieu,BenArous&LeCun,AISTATS’2015):TheLossSurfaceofMul'layerNets
55
Saddle Points
• Localminimadominateinlow-D,butsaddlepointsdominateinhigh-D
• MostlocalminimaareclosetotheboWom(globalminimumerror)
56
Saddle Points During Training
• Oscilla4ngbetweentwobehaviors:• Slowlyapproachingasaddlepoint• Escapingit
57
Low Index Critical Points
Choromanskaetal&LeCun2014,‘TheLossSurfaceofMul'layerNets’Showsthatdeeprec4fiernetsareanalogoustosphericalspin-glassmodelsThelow-indexcri4calpointsoflargemodelsconcentrateinabandjustabovetheglobalminimum
58
Piecewise Linear Nonlinearity
• Jarreth,Kavukcuoglu,Ranzato&LeCunICCV2009:absolutevaluerec4fica4onworksbeWerthantanhinlowerlayersofconvnet
• Nair&HintonICML2010:Duplica4ngsigmoidunitswithsameweightsbutdifferentbiasinanRBMapproximatesarec4fiedlinearunit(ReLU)
• Glorot,BordesandBengioAISTATS2011:Usingarec4fiernon-linearity(ReLU)insteadoftanhofsotplusallowsforthefirst4metotrainverydeepsupervisednetworkswithouttheneedforunsupervisedpre-training;wasbiologicallymoHvated
• Krizhevsky,Sutskever&HintonNIPS2012:rec4fiersoneofthecrucialingredientsinImageNetbreakthrough
f(x)=max(0,x)
f(x)=log(1+exp(x))
Leaky integrate-and-fire model
Neuroscience motivations
sotplus
Leaky integrate-and-fire model
Stochastic Neurons as Regularizer: ImprovingneuralnetworksbyprevenHngco-adaptaHonoffeaturedetectors(Hinton et al 2012, arXiv) • Dropoutstrick:duringtrainingmul4plyneuronoutputbyrandom
bit(p=0.5),duringtestby0.5• Usedindeepsupervisednetworks• Similartodenoisingauto-encoder,butcorrup4ngeverylayer• WorksbeWerwithsomenon-lineari4es(rec4fiers,maxout)
(Goodfellowetal.ICML2013)
• Equivalenttoaveragingoverexponen4allymanyarchitectures• UsedbyKrizhevskyetaltobreakthroughImageNetSOTA• AlsoimprovesSOTAonCIFAR-10(18à16%err)• Knowledge-freeMNISTwithDBMs(.95à.79%err)• TIMITphonemeclassifica4on(22.7à19.7%err)
60
Dropout Regularizer: Super-Efficient Bagging
61
*
……
Batch Normalization
• Standardizeac4va4ons(beforenonlinearity)acrossminibatch• BackpropthroughthisoperaHon• Regularizes&helpstotrain
62
x
k
=1
m
mX
i=1
x
i,k
, (1)
�
2k
=1
m
mX
i=1
(xi,k
� x
k
)2, (2)
where m is the size of the mini-batch. Using these statistics,we can standardize each feature as follows
x
k
=x
k
� x
kp�
2k
+ ✏
, (3)
where ✏ is a small positive constant to improve numerical sta-bility.
However, standardizing the intermediate activations re-duces the representational power of the layer. To account forthis, batch normalization introduces additional learnable pa-rameters � and �, which respectively scale and shift the data,leading to a layer of the form
BN(xk
) = �
k
x
k
+ �
k
. (4)
By setting �
k
to �
k
and �
k
to x
k
, the network can recover theoriginal layer representation. So, for a standard feedforwardlayer in a neural network
y = �(Wx+ b), (5)
where W is the weights matrix, b is the bias vector, x is theinput of the layer and � is an arbitrary activation function,batch normalization is applied as follows
y = �(BN(Wx)). (6)
Note that the bias vector has been removed, since its effectis cancelled by the standardization. Since the normalizationis now part of the network, the back propagation procedureneeds to be adapted to propagate gradients through the meanand variance computations as well.
At test time, we can’t use the statistics of the mini-batch.Instead, we can estimate them by either forwarding severaltraining mini-batches through the network and averaging theirstatistics, or by maintaining a running average calculated overeach mini-batch seen during training.
3. RECURRENT NEURAL NETWORKS
Recurrent Neural Networks (RNNs) extend Neural Net-works to sequential data. Given an input sequence of vec-tors (x1, . . . ,xT
), they produce a sequence of hidden states(h1, . . . ,hT
), which are computed at time step t as follows
h
t
= �(Wh
h
t�1 +W
x
x
t
), (7)
where W
h
is the recurrent weight matrix, Wx
is the input-to-hidden weight matrix, and � is an arbitrary activation func-tion.
If we have access to the whole input sequence, we can useinformation not only from the past time steps, but also fromthe future ones, allowing for bidirectional RNNs [12]
�!h
t
= �(�!W
h
�!h
t�1 +�!W
x
x
t
), (8) �h
t
= �( �W
h
�h
t+1 + �W
x
x
t
), (9)
h
t
= [�!h
t
: �h
t
], (10)
where [x : y] denotes the concatenation of x and y. Finally,we can stack RNNs by using h as the input to another RNN,creating deeper architectures [13]
h
l
t
= �(Wh
h
l
t�1 +W
x
h
l�1t
). (11)
In vanilla RNNs, the activation function � is usually a sig-moid function, such as the hyperbolic tangent. Training suchnetworks is known to be particularly difficult, because of van-ishing and exploding gradients [14].
3.1. Long Short-Term Memory
A commonly used recurrent structure is the Long Short-TermMemory (LSTM). It addresses the vanishing gradient prob-lem commonly found in vanilla RNNs by incorporating gat-ing functions into its state dynamics [6]. At each time step,an LSTM maintains a hidden vector h and a cell vector c
responsible for controlling state updates and outputs. Moreconcretely, we define the computation at time step t as fol-lows [15]:
i
t
= sigmoid(Whi
h
t�1 +W
xi
x
t
) (12)f
t
= sigmoid(Whf
h
t�1 +W
hf
x
t
) (13)c
t
= f
t
� c
t�1 + i
t
� tanh(Whc
h
t�1 +W
xc
x
t
) (14)o
t
= sigmoid(Who
h
t�1 +W
hx
x
t
+W
co
c
t
) (15)h
t
= o
t
� tanh(ct
) (16)
where sigmoid(·) is the logistic sigmoid function, tanh is thehyperbolic tangent function, W
h· are the recurrent weightmatrices and W
x· are the input-to-hiddent weight matrices.i
t
, ft
and o
t
are respectively the input, forget and output gates,and c
t
is the cell.
4. BATCH NORMALIZATION FOR RNNS
From equation 6, an analogous way to apply batch normaliza-tion to an RNN would be as follows:
h
t
= �(BN(Wh
h
t�1 +W
x
x
t
)). (17)
However, in our experiments, when batch normalization wasapplied in this fashion, the model failed to learn. In un-normalized RNNs, the tied nature of the recurrent weightmatrix W
h
makes optimization difficult since small changes
x
k
=1
m
mX
i=1
x
i,k
, (1)
�
2k
=1
m
mX
i=1
(xi,k
� x
k
)2, (2)
where m is the size of the mini-batch. Using these statistics,we can standardize each feature as follows
x
k
=x
k
� x
kp�
2k
+ ✏
, (3)
where ✏ is a small positive constant to improve numerical sta-bility.
However, standardizing the intermediate activations re-duces the representational power of the layer. To account forthis, batch normalization introduces additional learnable pa-rameters � and �, which respectively scale and shift the data,leading to a layer of the form
BN(xk
) = �
k
x
k
+ �
k
. (4)
By setting �
k
to �
k
and �
k
to x
k
, the network can recover theoriginal layer representation. So, for a standard feedforwardlayer in a neural network
y = �(Wx+ b), (5)
where W is the weights matrix, b is the bias vector, x is theinput of the layer and � is an arbitrary activation function,batch normalization is applied as follows
y = �(BN(Wx)). (6)
Note that the bias vector has been removed, since its effectis cancelled by the standardization. Since the normalizationis now part of the network, the back propagation procedureneeds to be adapted to propagate gradients through the meanand variance computations as well.
At test time, we can’t use the statistics of the mini-batch.Instead, we can estimate them by either forwarding severaltraining mini-batches through the network and averaging theirstatistics, or by maintaining a running average calculated overeach mini-batch seen during training.
3. RECURRENT NEURAL NETWORKS
Recurrent Neural Networks (RNNs) extend Neural Net-works to sequential data. Given an input sequence of vec-tors (x1, . . . ,xT
), they produce a sequence of hidden states(h1, . . . ,hT
), which are computed at time step t as follows
h
t
= �(Wh
h
t�1 +W
x
x
t
), (7)
where W
h
is the recurrent weight matrix, Wx
is the input-to-hidden weight matrix, and � is an arbitrary activation func-tion.
If we have access to the whole input sequence, we can useinformation not only from the past time steps, but also fromthe future ones, allowing for bidirectional RNNs [12]
�!h
t
= �(�!W
h
�!h
t�1 +�!W
x
x
t
), (8) �h
t
= �( �W
h
�h
t+1 + �W
x
x
t
), (9)
h
t
= [�!h
t
: �h
t
], (10)
where [x : y] denotes the concatenation of x and y. Finally,we can stack RNNs by using h as the input to another RNN,creating deeper architectures [13]
h
l
t
= �(Wh
h
l
t�1 +W
x
h
l�1t
). (11)
In vanilla RNNs, the activation function � is usually a sig-moid function, such as the hyperbolic tangent. Training suchnetworks is known to be particularly difficult, because of van-ishing and exploding gradients [14].
3.1. Long Short-Term Memory
A commonly used recurrent structure is the Long Short-TermMemory (LSTM). It addresses the vanishing gradient prob-lem commonly found in vanilla RNNs by incorporating gat-ing functions into its state dynamics [6]. At each time step,an LSTM maintains a hidden vector h and a cell vector c
responsible for controlling state updates and outputs. Moreconcretely, we define the computation at time step t as fol-lows [15]:
i
t
= sigmoid(Whi
h
t�1 +W
xi
x
t
) (12)f
t
= sigmoid(Whf
h
t�1 +W
hf
x
t
) (13)c
t
= f
t
� c
t�1 + i
t
� tanh(Whc
h
t�1 +W
xc
x
t
) (14)o
t
= sigmoid(Who
h
t�1 +W
hx
x
t
+W
co
c
t
) (15)h
t
= o
t
� tanh(ct
) (16)
where sigmoid(·) is the logistic sigmoid function, tanh is thehyperbolic tangent function, W
h· are the recurrent weightmatrices and W
x· are the input-to-hiddent weight matrices.i
t
, ft
and o
t
are respectively the input, forget and output gates,and c
t
is the cell.
4. BATCH NORMALIZATION FOR RNNS
From equation 6, an analogous way to apply batch normaliza-tion to an RNN would be as follows:
h
t
= �(BN(Wh
h
t�1 +W
x
x
t
)). (17)
However, in our experiments, when batch normalization wasapplied in this fashion, the model failed to learn. In un-normalized RNNs, the tied nature of the recurrent weightmatrix W
h
makes optimization difficult since small changes
x
k
=1
m
mX
i=1
x
i,k
, (1)
�
2k
=1
m
mX
i=1
(xi,k
� x
k
)2, (2)
where m is the size of the mini-batch. Using these statistics,we can standardize each feature as follows
x
k
=x
k
� x
kp�
2k
+ ✏
, (3)
where ✏ is a small positive constant to improve numerical sta-bility.
However, standardizing the intermediate activations re-duces the representational power of the layer. To account forthis, batch normalization introduces additional learnable pa-rameters � and �, which respectively scale and shift the data,leading to a layer of the form
BN(xk
) = �
k
x
k
+ �
k
. (4)
By setting �
k
to �
k
and �
k
to x
k
, the network can recover theoriginal layer representation. So, for a standard feedforwardlayer in a neural network
y = �(Wx+ b), (5)
where W is the weights matrix, b is the bias vector, x is theinput of the layer and � is an arbitrary activation function,batch normalization is applied as follows
y = �(BN(Wx)). (6)
Note that the bias vector has been removed, since its effectis cancelled by the standardization. Since the normalizationis now part of the network, the back propagation procedureneeds to be adapted to propagate gradients through the meanand variance computations as well.
At test time, we can’t use the statistics of the mini-batch.Instead, we can estimate them by either forwarding severaltraining mini-batches through the network and averaging theirstatistics, or by maintaining a running average calculated overeach mini-batch seen during training.
3. RECURRENT NEURAL NETWORKS
Recurrent Neural Networks (RNNs) extend Neural Net-works to sequential data. Given an input sequence of vec-tors (x1, . . . ,xT
), they produce a sequence of hidden states(h1, . . . ,hT
), which are computed at time step t as follows
h
t
= �(Wh
h
t�1 +W
x
x
t
), (7)
where W
h
is the recurrent weight matrix, Wx
is the input-to-hidden weight matrix, and � is an arbitrary activation func-tion.
If we have access to the whole input sequence, we can useinformation not only from the past time steps, but also fromthe future ones, allowing for bidirectional RNNs [12]
�!h
t
= �(�!W
h
�!h
t�1 +�!W
x
x
t
), (8) �h
t
= �( �W
h
�h
t+1 + �W
x
x
t
), (9)
h
t
= [�!h
t
: �h
t
], (10)
where [x : y] denotes the concatenation of x and y. Finally,we can stack RNNs by using h as the input to another RNN,creating deeper architectures [13]
h
l
t
= �(Wh
h
l
t�1 +W
x
h
l�1t
). (11)
In vanilla RNNs, the activation function � is usually a sig-moid function, such as the hyperbolic tangent. Training suchnetworks is known to be particularly difficult, because of van-ishing and exploding gradients [14].
3.1. Long Short-Term Memory
A commonly used recurrent structure is the Long Short-TermMemory (LSTM). It addresses the vanishing gradient prob-lem commonly found in vanilla RNNs by incorporating gat-ing functions into its state dynamics [6]. At each time step,an LSTM maintains a hidden vector h and a cell vector c
responsible for controlling state updates and outputs. Moreconcretely, we define the computation at time step t as fol-lows [15]:
i
t
= sigmoid(Whi
h
t�1 +W
xi
x
t
) (12)f
t
= sigmoid(Whf
h
t�1 +W
hf
x
t
) (13)c
t
= f
t
� c
t�1 + i
t
� tanh(Whc
h
t�1 +W
xc
x
t
) (14)o
t
= sigmoid(Who
h
t�1 +W
hx
x
t
+W
co
c
t
) (15)h
t
= o
t
� tanh(ct
) (16)
where sigmoid(·) is the logistic sigmoid function, tanh is thehyperbolic tangent function, W
h· are the recurrent weightmatrices and W
x· are the input-to-hiddent weight matrices.i
t
, ft
and o
t
are respectively the input, forget and output gates,and c
t
is the cell.
4. BATCH NORMALIZATION FOR RNNS
From equation 6, an analogous way to apply batch normaliza-tion to an RNN would be as follows:
h
t
= �(BN(Wh
h
t�1 +W
x
x
t
)). (17)
However, in our experiments, when batch normalization wasapplied in this fashion, the model failed to learn. In un-normalized RNNs, the tied nature of the recurrent weightmatrix W
h
makes optimization difficult since small changes
x
k
=1
m
mX
i=1
x
i,k
, (1)
�
2k
=1
m
mX
i=1
(xi,k
� x
k
)2, (2)
where m is the size of the mini-batch. Using these statistics,we can standardize each feature as follows
x
k
=x
k
� x
kp�
2k
+ ✏
, (3)
where ✏ is a small positive constant to improve numerical sta-bility.
However, standardizing the intermediate activations re-duces the representational power of the layer. To account forthis, batch normalization introduces additional learnable pa-rameters � and �, which respectively scale and shift the data,leading to a layer of the form
BN(xk
) = �
k
x
k
+ �
k
. (4)
By setting �
k
to �
k
and �
k
to x
k
, the network can recover theoriginal layer representation. So, for a standard feedforwardlayer in a neural network
y = �(Wx+ b), (5)
where W is the weights matrix, b is the bias vector, x is theinput of the layer and � is an arbitrary activation function,batch normalization is applied as follows
y = �(BN(Wx)). (6)
Note that the bias vector has been removed, since its effectis cancelled by the standardization. Since the normalizationis now part of the network, the back propagation procedureneeds to be adapted to propagate gradients through the meanand variance computations as well.
At test time, we can’t use the statistics of the mini-batch.Instead, we can estimate them by either forwarding severaltraining mini-batches through the network and averaging theirstatistics, or by maintaining a running average calculated overeach mini-batch seen during training.
3. RECURRENT NEURAL NETWORKS
Recurrent Neural Networks (RNNs) extend Neural Net-works to sequential data. Given an input sequence of vec-tors (x1, . . . ,xT
), they produce a sequence of hidden states(h1, . . . ,hT
), which are computed at time step t as follows
h
t
= �(Wh
h
t�1 +W
x
x
t
), (7)
where W
h
is the recurrent weight matrix, Wx
is the input-to-hidden weight matrix, and � is an arbitrary activation func-tion.
If we have access to the whole input sequence, we can useinformation not only from the past time steps, but also fromthe future ones, allowing for bidirectional RNNs [12]
�!h
t
= �(�!W
h
�!h
t�1 +�!W
x
x
t
), (8) �h
t
= �( �W
h
�h
t+1 + �W
x
x
t
), (9)
h
t
= [�!h
t
: �h
t
], (10)
where [x : y] denotes the concatenation of x and y. Finally,we can stack RNNs by using h as the input to another RNN,creating deeper architectures [13]
h
l
t
= �(Wh
h
l
t�1 +W
x
h
l�1t
). (11)
In vanilla RNNs, the activation function � is usually a sig-moid function, such as the hyperbolic tangent. Training suchnetworks is known to be particularly difficult, because of van-ishing and exploding gradients [14].
3.1. Long Short-Term Memory
A commonly used recurrent structure is the Long Short-TermMemory (LSTM). It addresses the vanishing gradient prob-lem commonly found in vanilla RNNs by incorporating gat-ing functions into its state dynamics [6]. At each time step,an LSTM maintains a hidden vector h and a cell vector c
responsible for controlling state updates and outputs. Moreconcretely, we define the computation at time step t as fol-lows [15]:
i
t
= sigmoid(Whi
h
t�1 +W
xi
x
t
) (12)f
t
= sigmoid(Whf
h
t�1 +W
hf
x
t
) (13)c
t
= f
t
� c
t�1 + i
t
� tanh(Whc
h
t�1 +W
xc
x
t
) (14)o
t
= sigmoid(Who
h
t�1 +W
hx
x
t
+W
co
c
t
) (15)h
t
= o
t
� tanh(ct
) (16)
where sigmoid(·) is the logistic sigmoid function, tanh is thehyperbolic tangent function, W
h· are the recurrent weightmatrices and W
x· are the input-to-hiddent weight matrices.i
t
, ft
and o
t
are respectively the input, forget and output gates,and c
t
is the cell.
4. BATCH NORMALIZATION FOR RNNS
From equation 6, an analogous way to apply batch normaliza-tion to an RNN would be as follows:
h
t
= �(BN(Wh
h
t�1 +W
x
x
t
)). (17)
However, in our experiments, when batch normalization wasapplied in this fashion, the model failed to learn. In un-normalized RNNs, the tied nature of the recurrent weightmatrix W
h
makes optimization difficult since small changes
(Ioffe&SzegedyICML2015)
Early Stopping
• Beau4fulFREELUNCH(noneedtolaunchmanydifferenttrainingrunsforeachvalueofhyper-parameterfor#itera4ons)
• Monitorvalida4onerrorduringtraining(atervisi4ng#oftrainingexamples=amul4pleofvalida4onsetsize)
• Keeptrackofparameterswithbestvalida4onerrorandreportthemattheend
• Iferrordoesnotimproveenough(withsomepa4ence),stop.
63
Random Sampling of Hyperparameters (Bergstra&Bengio2012)• Commonapproach:manual+gridsearch• Gridsearchoverhyperparameters:simple&wasteful• Randomsearch:simple&efficient
• IndependentlysampleeachHP,e.g.l.rate~exp(U[log(.1),log(.0001)])• Eachtrainingtrialisiid• IfaHPisirrelevantgridsearchiswasteful• Moreconvenient:oktoearly-stop,con4nuefurther,etc.
64
Sequential Model-Based Optimization of Hyper-Parameters
• (HuWeretalJAIR2009;BergstraetalNIPS2011;ThorntonetalarXiv2012;SnoeketalNIPS2012)
• Iterate• Es4mateP(valid.err|hyper-paramsconfigx,D)• chooseop4mis4cx,e.g.maxxP(valid.err<currentmin.err|x)• trainwithconfigx,observevalid.err.v,DßDU{(x,v)}
65
Distributed Training
• Minibatches• Largeminibatches+2ndorder&naturalgradientmethods• AsynchronousSGD(Bengioetal2003,LeetalICML2012,DeanetalNIPS2012)
• Dataparallelismvsmodelparallelism• BoWleneck:sharingweights/updatesamongnodes,toavoidnode-modelstomovetoofarfromeachother
• EASGD(ZhangetalNIPS2015)workswellinprac4ce• Efficientlyexploi4ngmorethanafewGPUsremainsachallenge
66
Vision
67
((switchlaptops)
Speech Recognition
68
69 1%
2%
4%
10%
100%
1990 2000 2010
Wor
d er
ror r
ate
on S
witc
hboa
rd
Using DL
The dramatic impact of Deep Learning on Speech Recognition
(according to Microsoft)
Y LeCun Speech Recognition with Convolutional Nets (NYU/IBM)
" Multilingual recognizer " Multiscale input
" Large context window
Y LeCun Speech Recognition with Convolutional Nets (NYU/IBM)
" Acoustic Model: ConvNet with 7 layers. 54.4 million parameters. " Classifies acoustic signal into 3000 context-dependent subphones categories " ReLU units + dropout for last layers " Trained on GPU. 4 days of training
Y LeCun Speech Recognition with Convolutional Nets (NYU/IBM)
" Training samples. " 40 MEL-frequency Cepstral Coefficients " Window: 40 frames, 10ms each
Y LeCun Speech Recognition with Convolutional Nets (NYU/IBM)
" Convolution Kernels at Layer 1: " 64 kernels of size 9x9
• Hybridsystems,neuralnets+HMMs(Bengio1991,Bo[ou1991)
• Neuralnetoutputsscoresforeacharc,recognizedoutput=labelsalongbestpath;traineddiscrimina4vely(LeCunetal1998)
• Connec4onistTemporalClassifica4on(Graves2006)
• DeepSpeechandaWen4on-basedend-to-endRNNs(Hannunetal2014;Graves&Jaitly2014;ChorowskietalNIPS2015)
74
"o"
"c"
"d"
"x"
"a"
"u"
"p"
"t"
0.4
1.0
1.8
0.1
0.2
0.8
0.2
0.8
RecognitionGraph
"b"
"c"
"a"
"u"
"u"
"a"
"r" "n"
"t"
"t"
"r"
"e"
"e""p"
"r""t" "d"
"c"
"u"
"a"
"t"
"p"
"t"
0.4 0.2
0.8
0.8
0.2
0.8
Gra
ph C
ompo
sitio
n
interpretation graph
match& add
match& add
match& add
interpretations:cut (2.0)cap (0.8)cat (1.4)
grammar graph
End-to-End Training with Search
Natural Language Representations
75
Neural Language Models: fighting one exponential by another one!
• (BengioetalNIPS’2000)
76
w1 w2 w3 w4 w5 w6
R(w6)R(w5)R(w4)R(w3)R(w2)R(w1)
output
input sequence
i−th output = P(w(t) = i | context)
softmax
tanh
. . . . . .. . .
. . . . . .
. . . . . .
across words
most computation here
index for w(t−n+1) index for w(t−2) index for w(t−1)
shared parameters
Matrix
inlook−upTable C
C
C(w(t−2)) C(w(t−1))C(w(t−n+1))
. . .
Exponen4allylargesetofgeneraliza4ons:seman4callyclosesequences
Exponen4allylargesetofpossiblecontexts
Neural word embeddings: visualization directions = Learned Attributes
77
Analogical Representations for Free (Mikolovetal,ICLR2013)
• Seman4crela4onsappearaslinearrela4onshipsinthespaceoflearnedrepresenta4ons
• King–Queen≈Man–Woman• Paris–France+Italy≈Rome
78
Paris
FranceItaly
Rome
Handling Large Output Spaces
79
categories
wordswithineachcategory
• Sampling“nega4ve”examples:increasescoreofcorrectwordandstochas4callydecreasealltheothers• Uniformsampling(Collobert&Weston,ICML2008)
• Importancesampling,(Bengio&SenecalAISTATS2003;DauphinetalICML
2011);GPUfriendlyimplementa4on(JeanetalACL2015)
• Decomposeoutputprobabili4eshierarchically(Morin&Bengio2005;Blitzeretal2005;Mnih&Hinton2007,2009;Mikolovetal2011)
Encoder-Decoder Framework • Intermediaterepresenta4onofmeaning
=‘universalrepresenta4on’• Encoder:fromwordsequencetosentencerepresenta4on• Decoder:fromrepresenta4ontowordsequencedistribu4on
80
�� �� ��
��� �� ��
�
����
����
Frenchencoder
Englishdecoder
Frenchsentence
Englishsentence
Englishencoder
Englishdecoder
Englishsentence
Englishsentence
Forb
itextdata
Foru
nilingualdata
(ChoetalEMNLP2014;SutskeveretalNIPS2014)
Attention Mechanism for Deep Learning
• Consideraninput(orintermediate)sequenceorimage• Consideranupperlevelrepresenta4on,whichcanchoose
«wheretolook»,byassigningaweightorprobabilitytoeachinputposi4on,asproducedbyanMLP,appliedateachposi4on
81
Lower-level
Higher-levelSotmaxoverlowerloca4onscondi4onedoncontextatlowerandhigherloca4ons
(Bahdanau,Cho&Bengio,arXivsept.2014)followingupon(Graves2013)and(Larochelle&HintonNIPS2010)
• SotaWen4on(backprop)vs• Stochas4chardaWen4on(RL)
End-to-End Machine Translation with Recurrent Nets and Attention Mechanism
• Reachedthestate-of-the-artinoneyear,fromscratch
82
→
⋆
•◦
◦
→ →
⋆◦ •
(Bahdanauetal2014,Jeanetal2014,Gulcehreetal2015,Jeanetal2015)
IWSLT 2015 – Luong & Manning (2015) TED talk MT, English-German
30.85
26.18 26.0224.96
22.51
20.08
0
5
10
15
20
25
30
35
Stanford Karlsruhe Edinburgh Heidelberg PJAIT Baseline
BLEU(CASED)
83
16.16
21.8422.67
23.42
28.18
0
5
10
15
20
25
30
Stanford Edinburgh Karlsruhe Heidelberg PJAIT
HTER(HESET)
-26%
Image-to-Text: Caption Generation with Attention
84
AnnotationVectors
Wor
d Ss
ampl
e
ui
Rec
urre
ntSt
ate z i
f = (a, man, is, jumping, into, a, lake, .)
+
hj
Atte
ntio
nM
echa
nism
aAttention weight
jaj� =1
Con
volu
tiona
l Neu
ral N
etw
ork
(Xuetal,ICML2015)
Followingmanypapersoncap4ongenera4on,including(Kirosetal2014;Maoetal2014;Vinyalsetal2014;Donahueetal2014;Karpathy&Li2014;Fangetal2014)
Paying Attention to Selected Parts of the Image While Uttering Words
85
The Good
86
And the Bad
87
Y LeCun
But How can Neural Nets Remember Things?
" Recurrent networks cannot remember things for very long " The cortex only remember things for 20 seconds
" We need a “hippocampus” (a separate memory module) " LSTM [Hochreiter 1997], registers
" Memory networks [Weston et 2014] (FAIR), associative memory
" NTM [Graves et al. 2014], “tape”.
Recurrent net memory
Attention mechanism
Y LeCun Memory Networks Enable REASONING
" Add a short-term memory to a network
Results on Question Answering Task
http://arxiv.org/abs/1410.3916
(Weston, Chopra, Bordes 2014)
Y LeCun
End-to-End Memory Network
" [Sukhbataar, Szlam, Weston, Fergus NIPS 2015, ArXiv:1503.08895] " Weakly-supervised MemNN: no need to tell which memory location to use.
Y LeCun
Stack-Augmented RNN: learning “algorithmic” sequences
" [Joulin & Mikolov, ArXiv:1503.01007]
Sparse Access Memory for Long-Term Dependencies • Amentalstatestoredinanexternalmemorycanstayfor
arbitrarilylongdura4ons,un4levokedforreadorwrite• Forge�ng=vanishinggradient.• Memory=largerstate,reducingtheneedforforge�ng/vanishing
92
passivecopy
access
How do humans generalize from very few examples?
93
• Theytransferknowledgefrompreviouslearning:• Representa4ons
• Explanatoryfactors
• Previouslearningfrom:unlabeleddata
+labelsforothertasks
• Prior:sharedunderlyingexplanatoryfactors,inparHcularbetweenP(x)andP(Y|x)
Rawdata1layer 2layers
4layers3layers
ICML’2011workshoponUnsup.&TransferLearning
NIPS’2011TransferLearningChallengePaper:ICML’2012
Unsupervised and Transfer Learning Challenge + Transfer Learning Challenge: Won by Unsupervised Deep Learning
Multi-Task Learning • GeneralizingbeWertonewtasks(tens
ofthousands!)iscrucialtoapproachAI• Example:speechrecogni4on,sharing
acrossmul4plelanguages
• Deeparchitectureslearngoodintermediaterepresenta4onsthatcanbesharedacrosstasks
(Collobert&WestonICML2008,BengioetalAISTATS2011)
• Goodrepresenta4onsthatdisentangleunderlyingfactorsofvaria4onmakesenseformanytasksbecauseeachtaskconcernsasubsetofthefactors
95
raw input x
task 1 output y1
task 3 output y3
task 2 output y2
TaskA TaskB TaskC
Prior:sharedunderlyingexplanatoryfactorsbetweentasks
E.g.dic4onary,withintermediateconceptsre-usedacrossmanydefini4ons
Google Image Search Joint Embedding: different
object types represented in same space
Google:S.Bengio,J.Weston&N.Usunier
(IJCAI2011,NIPS’2010,JMLR2010,MLJ2010)
WSABIEobjec4vefunc4on:
Combining Multiple Sources of Evidence with Shared Representations
• Tradi4onalML:data=matrix• Rela4onallearning:mul4plesources,
differenttuplesofvariables• Sharerepresenta4onsofsametypes
acrossdatasources• Sharedlearnedrepresenta4onshelp
propagateinforma4onamongdatasources:e.g.,WordNet,XWN,Wikipedia,FreeBase,ImageNet…(BordesetalAISTATS2012,MLJ.2013)
• FACTS=DATA• DeducHon=GeneralizaHon97
person url event
url words history
personurlevent
P(person,url,event)
urlwordshistory
P(url,words,history)
h1h2 h3
Y
X1 X2 X3
selection switch
Multi-Task / Multimodal Learning with Different Inputs for Different Tasks
E.g.speakeradapta4on,mul4modalinput…
98
Unsupervisedmul4modalcase:(Srivastava&SalakhutdinovNIPS2012)
xandyrepresentdifferentmodali4es,e.g.,image,text,sound…Canprovide0-shotgeneraliza4ontonewcategories(valuesofy)
99
pairs in the training set(x,y)-representation (encoder) functionx
y
h
x
= fx
(x)
x
xtest
ytest
hy = fy(y)
y-space-space
fx
-representation (encoder) function fyrelationship between embedded points within one of the domainsmaps between representation spaces
fx
fy
Maps Between Representations
Unsupervised Representation Learning
100
Why Unsupervised Learning?
• RecentprogressmostlyinsupervisedDL• RealchallengesforunsupervisedDL• Poten4albenefits:• Exploittonsofunlabeleddata• Answernewques4onsaboutthevariablesobserved• Regularizer–transferlearning–domainadapta4on• Easierop4miza4on(divideandconquer)• Joint(structured)outputs
101
Why Latent Factors & Unsupervised Representation Learning? Because of Causality.
• IfYsofinterestareamongthecausalfactorsofX,then
is4edtoP(X)andP(X|Y),andP(X)isdefinedintermsofP(X|Y),i.e.• ThebestpossiblemodelofX(unsupervisedlearning)MUST
involveYasalatentfactor,implicitlyorexplicitly.• Representa4onlearningSEEKSthelatentvariablesHthatexplain
thevaria4onsofX,makingitlikelytoalsouncoverY.
102
P (Y |X) =P (X|Y )P (Y )
P (X)
Oncausalandan'causallearning,(JanzingetalICML2012)
x
p(x)
If Y is a Cause of X, Semi-Supervised Learning Works
• Justobservingthex-densityrevealsthecausesy(clusterID)• Aterlearningp(x)asamixture,asinglelabeledexampleperclass
sufficestolearnp(y|x)
103
Invariance & Disentangling Underlying Factors
• Invariantfeatures• Whichinvariances?
• Alterna4ve:learningtodisentanglefactors,i.e.keepalltheexplanatoryfactorsintherepresenta4on
• Gooddisentanglingàavoidthecurseofdimensionality
• Emergesfromrepresenta4onlearning(Goodfellowetal.2009,Glorotetal.2011)
104
Boltzmann Machines / Undirected Graphical Models • Boltzmannmachines:(Hinton84)
• Itera4vesamplingscheme=stochas4crelaxa4on,Monte-CarloMarkovchain
• Trainingrequiressampling:mighttakealotof4metoconvergeiftherearewell-separatedmodes
Restricted Boltzmann Machine (RBM)
• Abuildingblock(single-layer)fordeeparchitectures
• BiparHteundirected
graphicalmodel
observed
hidden
(Smolensky1986,Hintonetal2006)
x‘ ~ P(x | h) x
h ~ P(h|x) h’ ~ P(h|x’)
BlockGibbssampling
Capturing the Shape of the Distribution: Positive & Negative Samples
• Observed (+) examples push the energy down • Generated / dream / fantasy (-) samples / particles push
the energy up
X+
X-
Pr(x) =e
�Energy(x)
Z
Boltzmannmachines,undirectedgraphicalmodels,RBMs,energy-basedmodels
Yann LeCun LeCun
Eight Strategies to Shape the Energy Function
" 1. build the machine so that the volume of low energy stuff is constant " PCA, K-means, GMM, square ICA
" 2. push down of the energy of data points, push up everywhere else " Max likelihood (needs tractable partition function)
" 3. push down of the energy of data points, push up on chosen locations " contrastive divergence, Ratio Matching, Noise Contrastive Estimation, Minimum Probability Flow
" 4. minimize the gradient and maximize the curvature around data points " score matching
" 5. train a dynamical system so that the dynamics goes to the manifold " denoising auto-encoder, diffusion inversion (nonequilibrium dynamics)
" 6. use a regularizer that limits the volume of space that has low energy " Sparse coding, sparse auto-encoder, PSD
" 7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible. " Contracting auto-encoder, saturating auto-encoder
" 8. Adversarial training: generator tries to fool real/synthetic classifier.
Auto-Encoders
input!x!
code!h!
reconstruc,on!r!
Decoder.g!
Encoder.f!
109
Denoisingauto-encoder:Duringtraining,inputiscorruptedstochas4cally,andauto-encodermustlearntoguessthedistribu4onofthemissinginforma4on.
ProbabilisHcreconstrucHoncriterion:Reconstruc4onlog-likelihood=-logP(x|h)
P(h)
P(x|h)
Q(h|x)
x
• Itera4vesampling/undirectedmodels: RBM,denoisingauto-encoder
• Ancestralsampling/directedmodelsHelmholtzmachine,VAE,etc.(Hintonetal1995)
Yann LeCun LeCun
Predictive Sparse Decomposition (PSD)
" Train a “simple” feed-forward function to predict the result of a complex optimization on the data points of interest
INPUT
Decoder
Y
Distance
Z LATENT VARIABLE
Factor B
[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]
Generative Model Factor A
Encoder Distance
Fast Feed-Forward Model Factor A'
1. Find optimal Zi for all Yi; 2. Train Encoder to predict Zi from Yi
Energy = reconstruction_error + code_prediction_error + code_sparsity
Probabilistic interpretation of auto-encoders
• Manifold&probabilis4cinterpreta4onsofauto-encoders• DenoisingScoreMatchingasinduc4veprinciple
• Es4ma4ngthegradientoftheenergyfunc4on
• SamplingviaMarkovchain
• Varia4onalauto-encoders
111
(Alain&BengioICLR2013)
(BengioetalNIPS2013;Sohl-DicksteinetalICML2015)
(GregoretalarXiv2015)
(Vincent2011)
(Kingma&WellingICLR2014)
Denoising Auto-Encoder • Learnsavectorfieldpoin4ngtowardshigher
probabilitydirec4on(Alain&Bengio2013)
• SomeDAEscorrespondtoakindofGaussianRBMwithregularizedScoreMatching(Vincent2011)
[equivalentwhennoiseà0]
Corrupted input
Corrupted input
prior:examplesconcentratenearalowerdimensional“manifold”reconstruction(x)� x ! �
2 @ log p(x)
@x
Regularized Auto-Encoders Learn a Vector Field that Estimates a Gradient Field (Alain&BengioICLR2013)
113
Denoising Auto-Encoder Markov Chain
114
Xt
Xt~ Xt+1~
Xt+1 Xt+2
Xt+2~corrupt
denoise
Thecorrupt-encode-decode-sampleMarkovchainassociatedwithaDAEsamplesfromaconsistentes4matorofthedatagenera4ngdistribu4on
x"
r(x)"
x1" x2" x3"
Preference for Locally Constant Features
• Denoisingorcontrac4veauto-encoderon1-Dinput:
115
E[||r(x+ �z)� x||2] ⇡ E[||r(x)� x||2] + �
2||@r(x)@x
||2F
⇡ x� @E(x)
@x
E(x)
Q(h1|x)
x
h1
h2
h3
P (x|h1)
P (h1|h2)
P (h2|h3)
P (h3)
Q(h2|h1)
Q(h3|h2)
Q(x)
Helmholtz Machines (Hintonetal1995)and Variational Auto-Encoders (VAEs)
• Parametricapproximateinference
• SuccessorsofHelmholtzmachine(Hintonetal‘95)
• Maximizevaria4onallowerboundonlog-likelihood:
where=datadistr.orequivalently
116
Decode
r=gen
erator
Encode
r=inference
(Kingma&Welling2013,ICLR2014)(GregoretalICML2014;RezendeetalICML2014)(Mnih&GregorICML2014;Kingmaetal,NIPS2014)
minKL(Q(x, h)||P (x, h))Q(x)
max
X
x
Q(h|x) log P (x, h)
Q(h|x) = max
X
x
Q(h|x) logP (x|h) +KL(Q(h|x)||P (h))
Geometric Interpretation
• Encoder:mapinputtoanewspacewherethedatahasasimplerdistribu4on
• Addnoisebetweenencoderoutputanddecoderinput:trainthedecodertoberobusttomismatchbetweenencoderoutputandprioroutput.
117
f g
f(x)
P(h)
x
Q(h|x)
contrac4ve
DRAW: Sequential Variational Auto-Encoder with Attention
• Evenforasta4cinput,theencoderanddecoderarenowrecurrentnets,whichgraduallyaddelementstotheanswer,anduseanaWen4onmechanismtochoosewheretodoso.
118
(GregoretalofGoogleDeepMind,arXiv1502.04623,2015)
DRAW: A Recurrent Neural Network For Image Generation
Karol Gregor [email protected] Danihelka [email protected] Graves [email protected] Wierstra [email protected]
Google DeepMind
AbstractThis paper introduces the Deep Recurrent Atten-
tive Writer (DRAW) neural network architecturefor image generation. DRAW networks combinea novel spatial attention mechanism that mimicsthe foveation of the human eye, with a sequentialvariational auto-encoding framework that allowsfor the iterative construction of complex images.The system substantially improves on the stateof the art for generative models on MNIST, and,when trained on the Street View House Numbersdataset, it generates images that cannot be distin-guished from real data with the naked eye.
1. IntroductionA person asked to draw, paint or otherwise recreate a visualscene will naturally do so in a sequential, iterative fashion,reassessing their handiwork after each modification. Roughoutlines are gradually replaced by precise forms, lines aresharpened, darkened or erased, shapes are altered, and thefinal picture emerges. Most approaches to automatic im-age generation, however, aim to generate entire scenes atonce. In the context of generative neural networks, this typ-ically means that all the pixels are conditioned on a singlelatent distribution (Dayan et al., 1995; Hinton & Salakhut-dinov, 2006; Larochelle & Murray, 2011). As well as pre-cluding the possibility of iterative self-correction, the “oneshot” approach is fundamentally difficult to scale to largeimages. The Deep Recurrent Attentive Writer (DRAW) ar-chitecture represents a shift towards a more natural form ofimage construction, in which parts of a scene are createdindependently from others, and approximate sketches aresuccessively refined.
The core of the DRAW architecture is a pair of recurrentneural networks: an encoder network that compresses thereal images presented during training, and a decoder thatreconstitutes images after receiving codes. The combinedsystem is trained end-to-end with stochastic gradient de-
Time
Figure 1. A trained DRAW network generating MNIST dig-its. Each row shows successive stages in the generation of a sin-gle digit. Note how the lines composing the digits appear to be“drawn” by the network. The red rectangle delimits the area at-tended to by the network at each time-step, with the focal preci-sion indicated by the width of the rectangle border.
scent, where the loss function is a variational upper boundon the log-likelihood of the data. It therefore belongs to thefamily of variational auto-encoders, a recently emergedhybrid of deep learning and variational inference that hasled to significant advances in generative modelling (Gre-gor et al., 2014; Kingma & Welling, 2014; Rezende et al.,2014; Mnih & Gregor, 2014; Salimans et al., 2014). WhereDRAW differs from its siblings is that, rather than generat-ing images in a single pass, it iteratively constructs scenesthrough an accumulation of modifications emitted by thedecoder, each of which is observed by the encoder.
An obvious correlate of generating images step by step isthe ability to selectively attend to parts of the scene whileignoring others. A wealth of results in the past few yearssuggest that visual structure can be better captured by a se-
arX
iv:1
502.
0462
3v1
[cs.C
V]
16 F
eb 2
015 DRAW: A Recurrent Neural Network For Image Generation
quence of partial glimpses, or foveations, than by a sin-gle sweep through the entire image (Larochelle & Hinton,2010; Denil et al., 2012; Tang et al., 2013; Ranzato, 2014;Zheng et al., 2014; Mnih et al., 2014; Ba et al., 2014; Ser-manet et al., 2014). The main challenge faced by sequentialattention models is learning where to look, which can beaddressed with reinforcement learning techniques such aspolicy gradients (Mnih et al., 2014). The attention model inDRAW, however, is fully differentiable, making it possibleto train with standard backpropagation. In this sense it re-sembles the selective read and write operations developedfor the Neural Turing Machine (Graves et al., 2014).
The following section defines the DRAW architecture,along with the loss function used for training and the pro-cedure for image generation. Section 3 presents the selec-tive attention model and shows how it is applied to read-ing and modifying images. Section 4 provides experi-mental results on the MNIST, Street View House Num-bers and CIFAR-10 datasets, with examples of generatedimages; and concluding remarks are given in Section 5.Lastly, we would like to direct the reader to the videoaccompanying this paper (https://www.youtube.com/watch?v=Zt-7MI9eKEo) which contains exam-ples of DRAW networks reading and generating images.
2. The DRAW NetworkThe basic structure of a DRAW network is similar to that ofother variational auto-encoders: an encoder network deter-mines a distribution over latent codes that capture salientinformation about the input data; a decoder network re-ceives samples from the code distribuion and uses them tocondition its own distribution over images. However thereare three key differences. Firstly, both the encoder and de-coder are recurrent networks in DRAW, so that a sequence
of code samples is exchanged between them; moreover theencoder is privy to the decoder’s previous outputs, allow-ing it to tailor the codes it sends according to the decoder’sbehaviour so far. Secondly, the decoder’s outputs are suc-cessively added to the distribution that will ultimately gen-erate the data, as opposed to emitting this distribution ina single step. And thirdly, a dynamically updated atten-tion mechanism is used to restrict both the input regionobserved by the encoder, and the output region modifiedby the decoder. In simple terms, the network decides ateach timestep “where to read” and “where to write” aswell as “what to write”. The architecture is sketched inFig. 2, alongside a conventional, feedforward variationalauto-encoder.
2.1. Network Architecture
Let RNN enc be the function enacted by the encoder net-work at a single time-step. The output of RNN enc at time
read
x
zt zt+1
P (x|z1:T )write
encoderRNN
sample
decoderRNN
read
x
write
encoderRNN
sample
decoderRNN
c
t�1
c
t
c
T
�
h
enc
t�1
h
dec
t�1
Q(zt|x, z1:t�1) Q(z
t+1
|x, z
1:t
)
. . .
decoding(generative model)
encoding(inference)
x
encoderFNN
sample
decoderFNN
z
Q(z|x)
P (x|z)
Figure 2. Left: Conventional Variational Auto-Encoder. Dur-ing generation, a sample z is drawn from a prior P (z) and passedthrough the feedforward decoder network to compute the proba-bility of the input P (x|z) given the sample. During inference theinput x is passed to the encoder network, producing an approx-imate posterior Q(z|x) over latent variables. During training, zis sampled from Q(z|x) and then used to compute the total de-scription length KL
�Q(Z|x)||P (Z)
�� log(P (x|z)), which is
minimised with stochastic gradient descent. Right: DRAW Net-work. At each time-step a sample zt from the prior P (zt) ispassed to the recurrent decoder network, which then modifies partof the canvas matrix. The final canvas matrix cT is used to com-pute P (x|z1:T ). During inference the input is read at every time-step and the result is passed to the encoder RNN. The RNNs atthe previous time-step specify where to read. The output of theencoder RNN is used to compute the approximate posterior overthe latent variables at that time-step.
t is the encoder hidden vector h
enct
. Similarly the output ofthe decoder RNN dec at t is the hidden vector h
dect
. In gen-eral the encoder and decoder may be implemented by anyrecurrent neural network. In our experiments we use theLong Short-Term Memory architecture (LSTM; Hochreiter& Schmidhuber (1997)) for both, in the extended form withforget gates (Gers et al., 2000). We favour LSTM due toits proven track record for handling long-range dependen-cies in real sequential data (Graves, 2013; Sutskever et al.,2014). Throughout the paper, we use the notation b = L(a)
to denote a linear weight matrix from the vector a to thevector b.
At each time-step t, the encoder receives input from boththe image x and from the previous decoder hidden vectorh
dect�1
. The precise form of the encoder input depends on aread operation, which will be defined in the next section.The output h
enct
of the encoder is used to parameterise adistribution Q(Z
t
|henct
) over the latent vector z
t
. In ourexperiments the latent distribution is a diagonal GaussianN (Z
t
|µt
, �
t
):
µ
t
= L(h
enc
t
) (1)�
t
= exp (L(h
enc
t
)) (2)
Bernoulli distributions are more common than Gaussians
DRAW Samples of SVHN Images: generated samples vs training nearest neighbor
119
DRAW: A Recurrent Neural Network For Image Generation
Figure 8. Generated MNIST images with two digits.
with attention it constructs the digit by tracing the lines—much like a person with a pen.
4.3. MNIST Generation with Two Digits
The main motivation for using an attention-based genera-tive model is that large images can be built up iteratively,by adding to a small part of the image at a time. To testthis capability in a controlled fashion, we trained DRAWto generate images with two 28 ⇥ 28 MNIST images cho-sen at random and placed at random locations in a 60 ⇥ 60
black background. In cases where the two digits overlap,the pixel intensities were added together at each point andclipped to be no greater than one. Examples of generateddata are shown in Fig. 8. The network typically generatesone digit and then the other, suggesting an ability to recre-ate composite scenes from simple pieces.
4.4. Street View House Number Generation
MNIST digits are very simplistic in terms of visual struc-ture, and we were keen to see how well DRAW performedon natural images. Our first natural image generation ex-periment used the multi-digit Street View House Numbersdataset (Netzer et al., 2011). We used the same preprocess-ing as (Goodfellow et al., 2013), yielding a 64 ⇥ 64 housenumber image for each training example. The network wasthen trained using 54 ⇥ 54 patches extracted at random lo-cations from the preprocessed images. The SVHN trainingset contains 231,053 images, and the validation set contains
Figure 9. Generated SVHN images. The rightmost columnshows the training images closest (in L
2 distance) to the gener-ated images beside them. Note that the two columns are visuallysimilar, but the numbers are generally different.
4,701 images.
A major challenge with natural image generation is how tomodel the pixel colours. In this work we applied a simpleapproximation where the normalised intensity of each ofthe RGB channels was treated as an independent Bernoulliprobability. This approach has the advantage of being easyto implement and train; however it does mean that the lossfunction used for training does not match the true compres-sion cost of the data.
The house number images generated by the network arehighly realistic, as shown in Figs. 9 and 10. Fig. 11 revealsthat, despite the long training time, the DRAW network un-derfit the SVHN training data.
4.5. Generating CIFAR Images
The most challenging dataset we applied DRAW to wasthe CIFAR-10 collection of natural images (Krizhevsky,2009). CIFAR-10 is very diverse, and with only 50,000training examples it is very difficult to generate realistic-looking objects without overfitting (in other words, withoutcopying from the training set). Nonetheless the images inFig. 12 demonstrate that DRAW is able to capture much ofthe shape, colour and composition of real photographs.
Nearesttrainingexampleforlastcolumnofsamples
Adversarial nets framework
120
GAN: Generative Adversarial Networks
GeneratorNetwork
DiscriminatorNetwork
FakeImage
RealImage
TrainingSet
RandomVector
RandomIndex
GoodfellowetalNIPS2014
Laplacian Pyramid
121
(Denton+Chintala,etal2015)
LAPGAN: Laplacian Pyramid of Generative Adversarial Networks
http://soumith.ch/eyescream/
LAPGAN results • 40%ofsamplesmistakenbyhumansforrealphotos
• Sharperimagesthanmax.lik.proxys(whichmin.KL(data|model)):• GANobjec4ve=compromisebetweenKL(data|model)andKL(model|data)
122
(Denton + Chintala, et al 2015)
LAPGAN: Visual Turing Test
Convolutional GANs
Stridedconvolu4ons,batchnormaliza4on,onlyconvolu4onallayers,ReLUandleakyReLU
123
(Radfordetal,arXiv1511.06343)
Under review as a conference paper at ICLR 2016
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the modelcould learn to memorize training examples, but this is experimentally unlikely as we train with asmall learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstratingmemorization with SGD and a small learning rate in only one epoch.
Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visualunder-fitting via repeated textures across multiple samples.
4.3 IMAGENET-1K
We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. Wetrain on 32⇥ 32 min-resized center crops. No data augmentation was applied to the images.
5
Space-Filling in Representation-Space DeeperrepresentaHons"abstracHons"disentanglingManifoldsareexpandedandflabened
Linearinterpola4onatlayer2
Linearinterpola4onatlayer1
3’smanifold
9’smanifold
Linearinterpola4oninpixelspace
X-space
H-space
(BengioetalICML2013)
GAN: Interpolating in Latent Space
Ifthemodelisgood(unfoldsthemanifold),interpola4ngbetweenlatentvaluesyieldsplausibleimages.
125
Under review as a conference paper at ICLR 2016
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the spacelearned has smooth transitions, with every image in the space plausibly looking like a bedroom. Inthe 6th row, you see a room without a window slowly transforming into a room with a giant window.In the 10th row, you see what appears to be a TV slowly being transformed into a window.
scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervisedDCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that thefeatures learnt by the discriminator activate on typical parts of a bedroom, like beds and windows.For comparison, in the same figure, we give a baseline for randomly initialized features that are notactivated on anything that is semantically relevant or interesting.
6.3 MANIPULATING THE GENERATOR REPRESENTATION
6.3.1 FORGETTING TO DRAW CERTAIN OBJECTS
In addition to the representations learnt by a discriminator, there is the question of what representa-tions the generator learns. The quality of samples suggest that the generator learns specific objectrepresentations for major scene components such as beds, windows, lamps, doors, and miscellaneousfurniture. In order to explore the form that these representations take, we conducted an experimentto attempt to remove windows from the generator completely.
7
Under review as a conference paper at ICLR 2016
Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples areaveraged. Arithmetic was then performed on the mean vectors creating a new vector Y . The centersample on the right hand side is produce by feeding Y as input to the generator. To demonstratethe interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was addedto Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples)results in noisy overlap due to misalignment.
9
Supervised and Unsupervised in One Learning Rule?
" Boltzmann Machines have all the right properties [Hinton 1831] [OK, OK 1983 ;-]" Sup & unsup, generative & discriminative in one simple/local learning rule" Feedback circuit reconstructs and propagates virtual hidden targets" But they don't really work (or at least they don't scale).
" Problem: the feedforward path eliminates information" If the feedforward path is invariant, then" the reconstruction path is a one-to-many mapping
" Usual solution: sampling. But I'm allergic.
input
ManyTo
One
reconstruction
OneTo
Many
Predicted What what
Cost
Cost
input
ManyTo
One
reconstruction
OneTo
Many
Predicted What what
Cost
Deep Semi-Supervised Learning
• Unlikeunsupervisedpre-training,modernapproachesop4mizejointlythesupervisedandunsupervisedobjec4ve
• Discrimina4veRBMs(Larochelle&Bengio,ICML2008)
• Semi-SupervisedVAE(Kingmaetal,NIPS2014)
• LadderNetwork(Rasmusetal,NIPS2015)
127
Semisupervised Learning with Ladder Network
• Jointlytrainedstackofdenoisingauto-encoderswithgatedlateralconnec4onsandsemi-supervisedobjec4ve
128
(Rasmusetal,NIPS2015)
y
y
g
(1)(·, ·)
g
(0)(·, ·)
f
(1)(·)f
(1)(·)
f
(2)(·)f
(2)(·)
N (0, �
2)
N (0, �
2)
N (0, �
2)
C
(2)d
C
(1)d
C
(0)d
z
(1)
z
(2)z
(2)
z
(1)z
(1)
z
(2)
x x
x
x
x
g
(2)(·, ·)
Figure 2: A conceptual illustration of the Ladder network when L = 2. The feedforward pathx ! z
(1) ! z
(2) ! y) shares the mappings f
(l) with the corrupted feedforward path, or encoder(x ! ˜
z
(1) ! ˜
z
(2) ! ˜
y). The decoder (˜z(l) ! ˆ
z
(l) ! ˆ
x) consists of denoising functions g
(l) andhas costs functions C
(l)d
on each layer trying to minimize the difference between ˆ
z
(l) and z
(l). Theoutput y of the encoder can also be trained using supervised learning.
Algorithm 1 Calculation of the output and cost function of the Ladder network
Require: x(n)
# Corrupted encoder and classifier˜
h
(0) ˜
z
(0) x(n) + noise
for l = 1 to L do˜
z
(l)pre W
(l)˜
h
(l�1)
˜µ(l) batchmean(
˜
z
(l)pre)
˜�(l) batchstd(
˜
z
(l)pre)
˜
z
(l) batchnorm(
˜
z
(l)pre) + noise
˜
h
(l) activation(�(l) � (
˜
z
(l)+ �(l)
))
end forP (
˜
y | x) ˜
h
(L)
# Clean encoder (for denoising targets)h
(0) z
(0) x(n)
for l = 1 to L doz
(l) batchnorm(W
(l)h
(l�1))
h
(l) activation(�(l) � (z
(l)+ �(l)
))
end for
# Final classification:P (y | x) h
(L)
# Decoder and denoisingfor l = L to 0 do
if l = L thenu
(L) batchnorm(
˜
h
(L))
elseu
(l) batchnorm(V
(l)ˆ
z
(l+1))
end if8i : z
(l)i
g(z
(l)i
, u
(l)i
) # Eq. (1)
8i : z
(l)i,BN
z
(l)i �µ
(l)i
�
(l)i
end for# Cost function C for training:C 0
if t(n) thenC � log P (
˜
y = t(n) | x)
end ifC C +
PL
l=1 �
l
���z(l) � ˆ
z
(l)BN
���2
# Eq. (2)
4
Semi-supervisedobjec4ve:
y
y
g
(1)(·, ·)
g
(0)(·, ·)
f
(1)(·)f
(1)(·)
f
(2)(·)f
(2)(·)
N (0, �
2)
N (0, �
2)
N (0, �
2)
C
(2)d
C
(1)d
C
(0)d
z
(1)
z
(2)z
(2)
z
(1)z
(1)
z
(2)
x x
x
x
x
g
(2)(·, ·)
Figure 2: A conceptual illustration of the Ladder network when L = 2. The feedforward pathx ! z
(1) ! z
(2) ! y) shares the mappings f
(l) with the corrupted feedforward path, or encoder(x ! ˜
z
(1) ! ˜
z
(2) ! ˜
y). The decoder (˜z(l) ! ˆ
z
(l) ! ˆ
x) consists of denoising functions g
(l) andhas costs functions C
(l)d
on each layer trying to minimize the difference between ˆ
z
(l) and z
(l). Theoutput y of the encoder can also be trained using supervised learning.
Algorithm 1 Calculation of the output and cost function of the Ladder network
Require: x(n)
# Corrupted encoder and classifier˜
h
(0) ˜
z
(0) x(n) + noise
for l = 1 to L do˜
z
(l)pre W
(l)˜
h
(l�1)
˜µ(l) batchmean(
˜
z
(l)pre)
˜�(l) batchstd(
˜
z
(l)pre)
˜
z
(l) batchnorm(
˜
z
(l)pre) + noise
˜
h
(l) activation(�(l) � (
˜
z
(l)+ �(l)
))
end forP (
˜
y | x) ˜
h
(L)
# Clean encoder (for denoising targets)h
(0) z
(0) x(n)
for l = 1 to L doz
(l) batchnorm(W
(l)h
(l�1))
h
(l) activation(�(l) � (z
(l)+ �(l)
))
end for
# Final classification:P (y | x) h
(L)
# Decoder and denoisingfor l = L to 0 do
if l = L thenu
(L) batchnorm(
˜
h
(L))
elseu
(l) batchnorm(V
(l)ˆ
z
(l+1))
end if8i : z
(l)i
g(z
(l)i
, u
(l)i
) # Eq. (1)
8i : z
(l)i,BN
z
(l)i �µ
(l)i
�
(l)i
end for# Cost function C for training:C 0
if t(n) thenC � log P (
˜
y = t(n) | x)
end ifC C +
PL
l=1 �
l
���z(l) � ˆ
z
(l)BN
���2
# Eq. (2)
4
y
y
g
(1)(·, ·)
g
(0)(·, ·)
f
(1)(·)f
(1)(·)
f
(2)(·)f
(2)(·)
N (0, �
2)
N (0, �
2)
N (0, �
2)
C
(2)d
C
(1)d
C
(0)d
z
(1)
z
(2)z
(2)
z
(1)z
(1)
z
(2)
x x
x
x
x
g
(2)(·, ·)
Figure 2: A conceptual illustration of the Ladder network when L = 2. The feedforward pathx ! z
(1) ! z
(2) ! y) shares the mappings f
(l) with the corrupted feedforward path, or encoder(x ! ˜
z
(1) ! ˜
z
(2) ! ˜
y). The decoder (˜z(l) ! ˆ
z
(l) ! ˆ
x) consists of denoising functions g
(l) andhas costs functions C
(l)d
on each layer trying to minimize the difference between ˆ
z
(l) and z
(l). Theoutput y of the encoder can also be trained using supervised learning.
Algorithm 1 Calculation of the output and cost function of the Ladder network
Require: x(n)
# Corrupted encoder and classifier˜
h
(0) ˜
z
(0) x(n) + noise
for l = 1 to L do˜
z
(l)pre W
(l)˜
h
(l�1)
˜µ(l) batchmean(
˜
z
(l)pre)
˜�(l) batchstd(
˜
z
(l)pre)
˜
z
(l) batchnorm(
˜
z
(l)pre) + noise
˜
h
(l) activation(�(l) � (
˜
z
(l)+ �(l)
))
end forP (
˜
y | x) ˜
h
(L)
# Clean encoder (for denoising targets)h
(0) z
(0) x(n)
for l = 1 to L doz
(l) batchnorm(W
(l)h
(l�1))
h
(l) activation(�(l) � (z
(l)+ �(l)
))
end for
# Final classification:P (y | x) h
(L)
# Decoder and denoisingfor l = L to 0 do
if l = L thenu
(L) batchnorm(
˜
h
(L))
elseu
(l) batchnorm(V
(l)ˆ
z
(l+1))
end if8i : z
(l)i
g(z
(l)i
, u
(l)i
) # Eq. (1)
8i : z
(l)i,BN
z
(l)i �µ
(l)i
�
(l)i
end for# Cost function C for training:C 0
if t(n) thenC � log P (
˜
y = t(n) | x)
end ifC C +
PL
l=1 �
l
���z(l) � ˆ
z
(l)BN
���2
# Eq. (2)
4
TheyalsouseBatchNormaliza4on
1%erroronPI-MNISTwith100labeledexamples(PezeshkietalarXiv1511.06430)
Yann LeCun LeCun
Stacked What-Where Auto-Encoder (SWWAE)
[Zhao, Mathieu, LeCun arXiv:1506.02351] Stacked What-Where Auto-Encoder
Input
Recons- truction
Desired Output
Predicted Output
Loss
A bit like a ConvNet paired with a DeConvNet
Conclusions & Challenges
130
Learning « How the world ticks »
• Solongasourmachinelearningmodels«cheat»byrelyingonlyonsurfacesta4s4calregulari4es,theyremainvulnerabletoout-of-distribu4onexamples
• HumansgeneralizebeWerthanotheranimalsbyimplicitlyhavingamoreaccurateinternalmodeloftheunderlyingcausalrela4onships
• Thisallowsonetopredictfuturesitua4ons(e.g.,theeffectofplannedac4ons)thatarefarfromanythingseenbefore,anessen4alcomponentofreasoning,intelligenceandscience
131
Learning Multiple Levels of Abstraction
• Thebigpayoffofdeeplearningistoallowlearninghigherlevelsofabstrac4on
• Higher-levelabstrac4onsdisentanglethefactorsofvaria4on,whichallowsmucheasiergeneraliza4onandtransfer
132
Challenges & Open Problems
• Unsupervisedlearning• Howtoevaluate?
• Long-termdependencies• Naturallanguageunderstanding&reasoning• Morerobustop4miza4on(oreasiertotrainarchitectures)• Distributedtraining(thatscales)&specializedhardware• Bridgingthegaptobiology• Deepreinforcementlearning
133
AMoreScienHficApproachisNeeded,notJustBuildingBeberSystems