![Page 1: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/1.jpg)
CENG 783
Special topics in
Deep Learning
Week 13Sinan Kalkan
© AlchemyAPI
![Page 2: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/2.jpg)
Why do we embed words?• 1-of-n encoding is not suitable to learn from
It is sparse
Similar words have different representations
Compare the pixel-based representation of images: Similar images/objects have similar pixels
• Embedding words in a map allows
Encoding them with fixed-length vectors
“Similar” words having similar representations
Allows complex reasoning between words:
king - man + woman = queen
Table: https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
![Page 3: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/3.jpg)
Two different ways to train
1. Using context to predict a target word (~ continuous bag-of-words)
2. Using word to predict a target context (skip-gram)
Produces more accurate results on large datasets
• If the vector for a word cannot predict the context, the mapping to the vector space is adjusted
• Since similar words should predict the same or similar contexts, their vector representations should end up being similar
http://deeplearning4j.org/word2vec
![Page 4: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/4.jpg)
Example: Image Captioning
Fig: https://github.com/karpathy/neuraltalk2
![Page 5: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/5.jpg)
Overview
Pre-trained
word
embedding
is also used
Pre-trained CNN
(e.g., on imagenet)
Image: Karpathy
![Page 6: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/6.jpg)
Example: Neural Machine Translation
![Page 7: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/7.jpg)
Neural Machine Translation• Model
Sutskever et al. 2014 Haitham Elmarakeby
Each box is an LSTM or GRU cell.
![Page 8: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/8.jpg)
Today• Finalize RNNs
Echo State Networks
Time Delay Networks
• Neural Turing Machines
• Autoencoders
• NOTES:
Final Exam date: 16 January, 17:00
Lecture on 31st of December (Monday)
Project deadline (w/o Incomplete period): 26th of January
Project deadline (w Incomplete period): 2nd of February
![Page 9: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/9.jpg)
On deep learning this week
• https://arxiv.org/pdf/1812.08775.pdf
9
![Page 10: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/10.jpg)
On deep learning this week
• https://openreview.net/pdf?id=B1l6qiR5F7
10
![Page 11: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/11.jpg)
On deep learning this week
• https://openreview.net/pdf?id=S1xq3oR5tQ
11
![Page 12: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/12.jpg)
Echo State NetworksReservoir Computing
![Page 13: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/13.jpg)
Motivation
• “Schiller and Steil (2005) also showed that in traditional training methods for RNNs, where all weights (not only the output weights) are adapted, the dominant changes are in the output weights. In cognitive neuroscience, a related mechanism has been investigated by Peter F. Dominey in the context of modelling sequence processing in mammalian brains, especially speech recognition in humans (e.g., Dominey 1995, Dominey, Hoen and Inui 2006). Dominey was the first to explicitly state the principle of reading out target information from a randomly connected RNN. The basic idea also informed a model of temporal input discrimination in biological neural networks (Buonomano and Merzenich 1995).”
http://www.scholarpedia.org/article/Echo_state_network
![Page 14: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/14.jpg)
Echo State Networks (ESN)
• Reservoir of a set of neurons
Randomly initialized and fixed
Run input sequence through the network and keep the activations of the reservoir neurons
Calculate the “readout” weights using linear regression.
• Has the benefits of recurrent connections/networks
• No problem of vanishing gradient Li et al., 2015.
![Page 15: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/15.jpg)
The reservoir
• Provides non-linear expansion
This provides a “kernel” trick.
• Acts as a memory
• Parameters:
𝑊𝑖𝑛, 𝑊 and 𝛼 (leaking rate).
• Global parameters:
Number of neurons: The more the better.
Sparsity: Connect a neuron to a fixed but small number of neurons.
Distribution of the non-zero elements: Uniform or Gaussian distribution. 𝑊𝑖𝑛 is denser than 𝑊.
Spectral radius of W: Maximum absolute eigenvalue of 𝑊, or the width of the distribution of its non-zero elements.
Should be less than 1. Otherwise, chaotic, periodic or multiple fixed-point behavior may be observed.
For problems with large memory requirements, it should be bigger than 1.
Scale of the input weights.
![Page 16: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/16.jpg)
![Page 17: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/17.jpg)
Training ESN
Overfitting (regularization):
![Page 18: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/18.jpg)
Beyond echo state networks
• Good aspects of ESNs Echo state networks can be trained very fast because they just fit a linear model.
• They demonstrate that it’s very important to initialize weights sensibly.
• They can do impressive modeling of one-dimensional time-series.–but they cannot compete
seriously for high-dimensional data.
• Bad aspects of ESNs They need many more hidden units for a given task than an RNN that learns the hiddenhidden weights.
Slide: Hinton
![Page 19: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/19.jpg)
Similar models• Liquid State Machines (Maas et al., 2002)
A spiking version of Echo-state networks
• Extreme Learning Machines
Feed-forward network with a hidden layer.
Input-to-hidden weights are randomly initialized and never updated
![Page 20: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/20.jpg)
Time Delay Neural Networks
![Page 21: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/21.jpg)
Fig: https://www.willamette.edu/~gorr/classes/cs449/Temporal/tappedDelay.htm
![Page 22: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/22.jpg)
Skipped points
![Page 23: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/23.jpg)
Skipping• Stability
• Continuous-time recurrent networks
• Attractor networks
![Page 24: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/24.jpg)
![Page 25: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/25.jpg)
![Page 26: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/26.jpg)
Neural Turing Machines
![Page 27: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/27.jpg)
Why need other mechanisms?
• We mentioned before that RNNs are Turing Complete, right?
• The issues are: The vanishing/exploding gradients (LSTM and other tricks address
these issues)
However, # of parameters increase in LSTMs with the number of layers
Despite its advantages, LSTMs still fail to generalize to sequences longer than the training sequences
The answer to addressing bigger networks with less parameters is a better abstraction of the computational components, e.g., in a form similar to Turing machines
Weston et al., 2015
![Page 28: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/28.jpg)
Turing Machine
Fig: Ucoluk & Kalkan, 2012
Wikipedia:
![Page 29: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/29.jpg)
Neural Turing Machines
• If we make every component differentiable, we can train such a complex machine
• Accessing only a part of the network is problematic Unlike a computer (TM),
we need a ‘blurry’ access mechanism
2014
![Page 30: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/30.jpg)
Neural Turing Machines: Reading
• Let memory 𝐌 be an 𝑁 ×𝑀 matrix
𝑁: the number of “rows”
𝑀: the size of each row (vector)
• Let 𝐌𝑡 be the memory state at time 𝑡
• 𝑤𝑡: a vector of weightings over 𝑁 locations emitted by the read head at time 𝑡. Since the weights are normalized:
𝑖
𝑤𝑡 𝑖 = 1, 0 ≤ 𝑤𝑡 𝑖 ≤ 1, ∀𝑖
• 𝐫𝑡: the read vector of length 𝑀:
𝐫𝑡 ←
𝑖
𝑤𝑡 𝑖 𝐌𝑡(𝑖) .
• which is differentiable, and therefore, trainable.
![Page 31: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/31.jpg)
Neural Turing Machines: Writing
• Writing = erasing content + adding new content Inspired from LSTM’s forgetting and addition gates.
• Erasing: Multiply with an erase vector 𝐞𝑡 ∈ 0,1 𝑀
𝐌𝑡 𝑖 ← 𝐌𝑡−1 𝑖 [𝟏 − 𝑤𝑡 𝑖 𝐞𝑡]
𝟏: vector of ones. Multiplication here is pointwise.
• Adding: Add an add vector 𝐚𝑡 ∈ 0,1 𝑀:
𝐌𝑡 𝑖 ← 𝐌𝑡 𝑖 + 𝑤𝑡 𝑖 𝐚𝑡
![Page 32: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/32.jpg)
Neural Turing Machines: Addressing
• Content-based addressing
• Location-based addressing
In a sense, use variable “names” to access content
![Page 33: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/33.jpg)
Neural Turing Machines: Content-based Addressing
• Each head (reading or writing head) produces an 𝑀 length key vector 𝐤𝑡 𝐤𝑡 is compared to each vector 𝐌𝑡 𝑖 using a similarity measure 𝐾 . , . , e.g., cosine similarity:
𝐾 𝐮, 𝐯 =𝐮 ⋅ 𝐯
𝐮 ⋅ | 𝐯 |
• From these similarity measures, we obtain a vector of “addressing”:
𝑤𝑡𝑐 𝑖 ←
exp 𝛽𝑡𝐾 𝐤𝑡, 𝐌𝑡 𝑖
σ𝑗 exp 𝛽𝑡𝐾 𝐤𝑡, 𝐌𝑡 𝑗
𝛽𝑡: amplifies or attenuates the precision of the focus
![Page 34: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/34.jpg)
Neural Turing Machines: Location-based Addressing
• Important for e.g. iteration over memory locations, or jumping to an arbitrary memory location
• First: Interpolation between addressing schemes using “interpolation gate” 𝑔𝑡:
𝐰𝑡𝑔← 𝑔𝑡𝐰𝑡
𝑐 + 1 − 𝑔𝑡 𝐰𝑡−1
If 𝑔𝑡 = 1: weight from content-addressable component is used
If 𝑔𝑡 = 0: weight from previous step is used
• Second: rotationally shift weight to achieve location-based addressing using convolution:
ෝ𝑤𝑡 𝑖 ←
𝑗=0
𝑁−1
𝑤𝑡𝑔𝑗 𝑠𝑡(𝑖 − 𝑗)
𝐬𝑡: shift amount. Three elements for how “much” to shift left, right or keep as it is.
It needs to be “sharp”. To keep it sharp, each head emits a scalar 𝛾𝑡 ≥ 1:
𝑤𝑡 𝑖 ←ෝ𝑤𝑡 𝑖
𝛾𝑡
σ𝑗 ෝ𝑤𝑡 𝑗𝛾𝑡
![Page 35: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/35.jpg)
Neural Turing Machines: Controller Network
• Free parameters
The size of the memory
Number of read-write heads
Range of allowed rotation shifts
Type of the neural network for controller
• Alternatives:
A recurrent network such as LSTM with its own memory
These memory units might be considered like “registers” on the CPU
A feed-forward network
Can use the memory to achieve recurrence
More transparent
![Page 36: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/36.jpg)
Neural Turing Machines: Training
• Binary targets Logistic sigmoid output layers
Cross-entropy loss
• Other schemes possible
• Tasks: Copy from input to output
Repeat Copy: Make n copies of the input
Associative recall: Present a part of a sequence to recall the remaining part
N-gram: Learn distribution of 6-grams and make predictions for the next bit based on this distribution
Priority sort: Associate a priority as part of each vector and as the target place the sequence according to the priority
![Page 37: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/37.jpg)
Neural Turing Machines: Training
![Page 38: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/38.jpg)
Reinforced version
![Page 39: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/39.jpg)
Other variants/attempts
![Page 40: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/40.jpg)
![Page 41: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/41.jpg)
Neural Programmer
![Page 42: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/42.jpg)
Pointer networks
2015
![Page 43: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/43.jpg)
Memory Networks
![Page 44: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/44.jpg)
Universal Turing Machine
2015
![Page 45: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/45.jpg)
2017
![Page 46: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/46.jpg)
Newer studies• https://deepmind.com/blog/differentiable-neural-
computers/
• Differentiable Neural Machines
![Page 47: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/47.jpg)
Unsupervised pre-training with Auto-encoders
![Page 48: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/48.jpg)
Now
• Manifold Learning
Principle Component Analysis
Independent Component Analysis
• Autoencoders
• Sparse autoencoders
• K-sparse autoencoders
• Denoising autoencoders
• Contraction autoencoders
48
![Page 49: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/49.jpg)
Manifold Learning
49
![Page 50: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/50.jpg)
Manifold Learning• Discovering the “hidden”
structure in the high-dimensional space
• Manifold: “hidden” structure.
• Non-linear dimensionality reduction
50
http://www.convexoptimization.c
om/dattorro/manifold_learning.h
tml
![Page 51: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/51.jpg)
Manifold Learning
• Many approaches:
Self-Organizing Map (Kohonen map/network)
Auto-encoders
Principles curves & manifolds: Extension of PCA
Kernel PCA, Nonlinear PCA
Curvilinear Component Analysis
Isomap: Floyd-Marshall + Multidimensional scaling
Data-driven high-dimensional scaling
Locally-linear embedding
…
51
![Page 52: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/52.jpg)
Manifold learning
• Autoencoders learn lower-dimensional manifolds embedded in higher-dimensional manifolds
• Assumption: “Natural data in high dimensional spaces concentrates close to lower dimensional manifolds”
Natural images occupy a very small fraction in a space of possible images
(Pascal Vincent) 52
![Page 53: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/53.jpg)
Manifold Learning
• Many approaches:
Self-Organizing Map (Kohonen map/network)
53https://en.wikipedia.org/wiki/Self-organizing_map
![Page 54: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/54.jpg)
(Pascal Vincent) 54
![Page 55: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/55.jpg)
(Pascal Vincent) 55
![Page 56: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/56.jpg)
Principle Component Analysis (PCA)
• Principle Components: Orthogonal directions with most variance
Eigen-vectors of the co-variance matrix
• Mathematical background: Orthogonality:
Two vectors 𝑢 and Ԧ𝑣 are orthogonal iff
𝑢 ⋅ Ԧ𝑣 = 0
Variance:
𝜎 𝑋 2 = 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝜇 2 =
𝑖
𝑝(𝑥𝑖) 𝑥𝑖 − 𝜇 2
where the (weighted) mean, 𝜇 = 𝐸 𝑋 = σ𝑖 𝑝 𝑥𝑖 𝑥𝑖 .
If 𝑝 𝑥𝑖 = 1/𝑁:
𝑉𝑎𝑟(𝑋) =1
𝑁
𝑖
𝑥𝑖 − 𝜇 2
𝜇 =1
𝑁
𝑖
𝑥𝑖
56
(Ole Winther)
[ Pearson 1901 ] [ Hotelling 1933 ]
![Page 57: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/57.jpg)
Mathematical background for PCA: Covariance
• Co-variance:
Measures how two random variables change wrteach other:
𝐶𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸 𝑋 𝑌 − 𝐸 𝑌
=1
𝑁
𝑖
(𝑥𝑖 − 𝐸 𝑋 )(𝑦𝑖 − 𝐸 𝑌 )
If big values of X & big values of Y “co-occur” and small values of X & small values of Y “co-occur” high co-variance.
Otherwise, small co-variance.
57
![Page 58: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/58.jpg)
Mathematical background for PCA: Covariance Matrix
• Co-variance Matrix:
Denoted usually by Σ
For an 𝑛-dimensional space:
Σ𝑖𝑗 = 𝐶𝑜𝑣 𝑋𝑖 , 𝑋𝑗 = 𝐸 𝑋𝑖 − 𝜇𝑖 𝑋𝑗 −𝑚𝑗
58
![Page 59: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/59.jpg)
Mathematical background for PCA: Covariance Matrix
• Co-variance Matrix:Σ𝑖𝑗 = 𝐶𝑜𝑣 𝑋𝑖 , 𝑋𝑗= 𝐸 𝑋𝑖 − 𝜇𝑖 𝑋𝑗 −𝑚𝑗
• Properties
60(Wikipedia)
(Wikipedia)
![Page 60: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/60.jpg)
Mathematical background for PCA: Eigenvectors & Eigenvalues
• Eigenvectors and eigenvalues:
Ԧ𝑣 is an eigenvector of a square matrix 𝐴 if
𝐴 Ԧ𝑣 = 𝜆 Ԧ𝑣
where 𝜆 is the eigenvalue (scalar) associated with Ԧ𝑣.
• Interpretation:
“Transformation” 𝐴 does not change the direction of the vector.
It changes the vector’s scale, i.e., the eigenvalue.
• Solution:
𝐴 − 𝜆𝐼 Ԧ𝑣 = 0 has a solution when the determinant |𝐴 − 𝜆𝐼| is zero.
Find the eigenvalues, then plug in those values to get the eigenvectors.
64
![Page 61: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/61.jpg)
Mathematical background for PCA: Eigenvectors & Eigenvalues Example
• Setting the determinant |𝐴 − 𝜆𝐼| to zero:
• The roots: 𝜆 = 1 and 𝜆 = 3
• If you plug in those eigenvalues, for 𝜆 = 1:
which gives 𝐯1 = {1, −1}. For 𝜆 = 3:
which gives 𝐯2 = {1,1}.
65Example from Wikipedia.
![Page 62: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/62.jpg)
PCA allows also dimensionality reduction
• Discard components whose eigenvalue is negligible.
70See the following tutorial for more on PCA:
http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
![Page 63: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/63.jpg)
Independent Component Analysis (ICA)
• PCA assumes Gaussianity:
Data along a component should be explainable by a mean and a variance.
This may be violated by real signals in the nature.
• ICA:
Blind-source separation of non-Gaussian and mutually-independent signals.
• Mutual independence:
71
https://cnx.org/contents/-
[email protected]:gFEtO206@1/Independent-Component-
Analysis
![Page 64: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/64.jpg)
Autoencoders
73
![Page 65: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/65.jpg)
Autoencoders
• Universal approximators So are Restricted Boltzmann Machines
• Unsupervised learning
• Dimensionality reduction
• 𝐱 ∈ ℝ𝐷 ⇒ 𝐡 ∈ ℝ𝑀 s.t. 𝑀 < 𝐷
74
![Page 66: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/66.jpg)
(Pascal Vincent) 75
![Page 67: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/67.jpg)
(Pascal Vincent) 76
![Page 68: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/68.jpg)
(Pascal Vincent) 77
![Page 69: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/69.jpg)
(Pascal Vincent) 78
![Page 70: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/70.jpg)
(Pascal Vincent) 80
![Page 71: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/71.jpg)
Stacking autoencoders: learn the first layer
81http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
![Page 72: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/72.jpg)
Stacking autoencoders:learn the second layer
82http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
![Page 73: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/73.jpg)
Stacking autoencoders:Add, e.g., a softmax layer for mapping to output
83http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
![Page 74: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/74.jpg)
Stacking autoencoders: Overall
84http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
![Page 75: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/75.jpg)
(Pascal Vincent) 85
![Page 76: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/76.jpg)
(Pascal Vincent) 86
![Page 77: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/77.jpg)
(Pascal Vincent) 87
![Page 78: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/78.jpg)
Making auto-encoders learn over-completerepresentationsThat are not one-to-one mappings
88
![Page 79: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/79.jpg)
Wait, what do we mean by over-complete?
• Remember distributed representations?
89Figure Credit: Moontae Lee
DistributedNot distributed
![Page 80: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/80.jpg)
• Four categories could also be represented by two neurons:
90
Distributed vs. undercomplete vs. overcomplete representations
Distributed
(Under complete)Not Distributed
(over complete)Distributed
(Over complete)
![Page 81: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/81.jpg)
Over-complete = sparse(in distributed representations)
• Why sparsity?
1. Because our brain relies on sparse coding. Why does it do so?
a. Because it is adapted to an environment which is composed of and can be sensed through the combination of primitive items/entities.
b. “Sparse coding may be a general strategy of neural systems toaugment memory capacity. To adapt to their environments,animals must learn which stimuli are associated with rewardsor punishments and distinguish these reinforced stimuli fromsimilar but irrelevant ones. Such task requires implementingstimulus-specific associative memories in which only a fewneurons out of a population respond to any given stimulus andeach neuron responds to only a few stimuli out of all possiblestimuli.”
– Wikipedia
c. Theoretically, it has shown that it increases capacity of memory.
91
![Page 82: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/82.jpg)
Over-complete = sparse(in distributed representations)
• Why sparsity?
2. Because of information theoretical aspects:
Sparse codes have lower entropy compared to non-sparse ones.
3. It is easier for the consecutive layers to learn from sparse codes, compared to non-sparse ones.
92
![Page 83: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/83.jpg)
93
Olshausen & Field,
“Sparse coding with
an overcomplete
basis set: A
strategy employed
by V1?”, 1997
![Page 84: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/84.jpg)
Mechanisms for enforcing over-completeness
• Use stochastic gradient descent
• Add sparsity constraint Into the loss function (sparse autoencoder)
Or, in a hard manner (k-sparse autoencoder)
• Add stochasticisity / randomness Add noise: Denoising Autoencoders, Contraction Autoencoders
Restricted Boltzmann Machines
97
![Page 85: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/85.jpg)
Auto-encoders with SGD
98
![Page 86: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/86.jpg)
Simple neural network
• Input: 𝐱 ∈ 𝑹𝒏
• Hidden layer: 𝐡 ∈ 𝑹𝒎
𝐡 = 𝒇𝟏(𝑾𝟏𝐱)
• Output layer: 𝐲 ∈ 𝑹𝒏
𝐲 = 𝒇𝟐 𝑾𝟐𝒇𝟏 𝑾𝟏𝐱
• Squared-error loss:
𝐿 =1
2
𝑑∈𝐷
𝐱𝑑 − 𝐲𝒅𝟐
• For training, use SGD.
• You may try different activation functions for 𝑓1 and 𝑓2.
99
![Page 87: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/87.jpg)
Sparse Autoencoders
101
![Page 88: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/88.jpg)
Sparse autoencoders
• Input: 𝐱 ∈ 𝑹𝒏
• Hidden layer: 𝐡 ∈ 𝑹𝒎
𝐡 = 𝒇𝟏(𝑾𝟏𝐱)
• Output layer: 𝐲 ∈ 𝑹𝒏
𝐲 = 𝒇𝟐 𝑾𝟐𝒇𝟏 𝑾𝟏𝐱
Over-completeness and sparsity:
• Require 𝑚 > 𝑛, and
Hidden neurons to produce only little activation for any input i.e., sparsity.
• How to enforce sparsity?
102
![Page 89: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/89.jpg)
Enforcing sparsity: alternatives
• How?
• Solution 1: 𝜆 |𝑤| We have seen before that this enforces sparsity.
However, this is not strong enough.
• Solution 2 Limit on the amount of average total activation for a
neuron throughout training!
• Solution 3
Kurtosis: 𝜇4
𝜎4=
𝐸 𝑋−𝜇 4
𝐸 𝑋−𝜇 2 2
Calculated over the activations of the whole network.
High kurtosis sparse activations.
“Kurtosis has only been studied for response distributions of model neurons where negative responses are allowed. It is unclear whether kurtosis is actually a sensible measure for realistic, non-negative response distributions.” -http://www.scholarpedia.org/article/Sparse_coding
• And many many other ways… 103
![Page 90: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/90.jpg)
Enforcing sparsity: a popular choice
• Limit the amount of total activation for a neuron throughout training!
• Use 𝜌𝑖 to denote the activation of neuron 𝑥 on input 𝑖. The average activation of the neuron over the training set:
ො𝜌𝑖 =1
𝑚
𝑖
𝑚
𝜌𝑖
• Now, to enforce sparsity, we limit to ො𝜌𝑖 = 𝜌0.
• 𝜌0: A small value. Yet another hyperparameter which may be tuned.
typical value: 0.05.
• The neuron must be inactive most of the time to keep its activations under the limit.
104
![Page 91: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/91.jpg)
Enforcing sparsity
ො𝜌𝑖 =1
𝑚
𝑖
𝑚
𝜌𝑖
• How to limit ො𝜌𝑖 = 𝜌0? How do we add integrate this as a penalty term into the loss function?
𝜌0 is called the sparsity parameter.
• Use Kullback-Leibler divergence:
𝑖
𝐾𝐿(𝜌0 | ො𝜌𝑖
Or, equivalently as (since this is between two Bernoulli variables with mean 𝜌0 and ො𝜌𝑖):
𝑖
𝜌0 log𝜌0ො𝜌𝑖
+ (1 − 𝜌0) log1 − 𝜌01 − ො𝜌𝑖
105
![Page 92: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/92.jpg)
Backpropagation and training
106
Reminder
-For each hidden unit ℎ,
calculate its error term 𝛿ℎ:
𝛿ℎ = 𝑜ℎ 1 − 𝑜ℎ
𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠
𝑤𝑘ℎ𝛿𝑘
-Update every weight 𝑤𝑗𝑖𝑤𝑗𝑖 = 𝑤𝑗𝑖 + 𝜂𝛿𝑗𝑥𝑗𝑖
𝑆 = 𝛽
𝑖
𝜌0 log𝜌0ො𝜌𝑖
+ (1 − 𝜌0) log1 − 𝜌01 − ො𝜌𝑖
𝑑𝑆
𝑑𝜌𝑖= 𝛽 −𝜌0
1
ො𝜌𝑖 ln 10+ 1 − 𝜌0
1
1 − ො𝜌𝑖 ln 10
• If you use ln in KL:
𝑑𝑆
𝑑𝜌𝑖= 𝛽 −
𝜌0ො𝜌𝑖+1 − 𝜌01 − ො𝜌𝑖
• So, if we integrate into the original error term:
𝛿ℎ = 𝑜ℎ 1 − 𝑜ℎ .
𝑘
𝑤𝑘ℎ𝛿𝑘 + 𝛽 −𝜌0ො𝜌ℎ+1 − 𝜌01 − ො𝜌ℎ
• Need to change 𝑜ℎ(1 − 𝑜ℎ) if you use a different activation function.
![Page 93: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/93.jpg)
Backpropagation and training
107
𝑆 = 𝛽
𝑖
𝜌0 log𝜌0ො𝜌𝑖
+ (1 − 𝜌0) log1 − 𝜌01 − ො𝜌𝑖
• Do you see a problem here?
• ො𝜌𝑖 should be calculated over the training set.
• In other words, we need to go through the whole dataset (or batch) once to calculate ො𝜌𝑖.
![Page 94: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/94.jpg)
Loss & decoders & encoders
• Be careful about the range of your activations and the range of the output
• Real-valued input:
Encoder: use sigmoid
Decoder: no need for non-linearity.
Loss: Squared-error Loss
• Binary-valued input:
Encoder: use sigmoid.
Decoder: use sigmoid.
Loss: use cross-entropy loss:
108
![Page 95: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/95.jpg)
Loss & decoders & encoders
• Kullback-Leibler divergence assumes that the variables are in the range [0,1]. I.e., you are bound to use sigmoid for the hidden
layer if you use KL to limit the activations of hidden units.
109
![Page 96: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/96.jpg)
k-Sparse Autoencoder
114
![Page 97: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/97.jpg)
• Note that it doesn’t have an activation function!
• Non-linearity comes from k-selection.
115
![Page 98: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/98.jpg)
116http://www.ericlwilkinson.com/blog/2014/11/19/deep-learning-sparse-autoencoders
![Page 99: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/99.jpg)
DenoisingAuto-encoders (DAE)
117
![Page 100: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/100.jpg)
Denoising Auto-encoders
• Simple idea:
randomly corrupt some of the inputs (as many as half of them) – e.g., set them to zero.
Train the autoencoder to reconstruct the input from a corrupted version of it.
The auto-encoder is to predict the corrupted (i.e. missing) values from the uncorrupted values.
This requires capturing the joint distribution between a set of variables
• A stochastic version of the auto-encoder.
118
![Page 101: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/101.jpg)
(Pascal Vincent) 119
![Page 102: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/102.jpg)
(Pascal Vincent) 120
![Page 103: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/103.jpg)
Loss in DAE
• You may give extra emphasis on “corrupted” dimensions:
121
Or, in cross-entropy-based loss:
![Page 104: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/104.jpg)
Denoising Auto-encoders
• To undo the effect of a corruption induced by the noise, the network needs to capture the statistical dependencies between the inputs.
• This can be interpreted from many perspectives (see Vincent et al., 2008): the manifold learning perspective,
stochastic operator perspective.
122
![Page 105: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/105.jpg)
(Pascal Vincent) 124
![Page 106: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/106.jpg)
(Pascal Vincent) 125
![Page 107: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/107.jpg)
Types of corruption
• Gaussian Noise (additive, isotropic)
• Masking Noise
Set a randomly selected subset of input to zero for each sample (the fraction ratio is constant, a parameter)
• Salt-and-pepper Noise:
Set a randomly selected subset of input to maximum or minimum for each sample (the fraction ratio is constant, a parameter)
126
![Page 108: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/108.jpg)
127(Vincent et al., 2010)Weight decay: L2 regularization.
![Page 109: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/109.jpg)
128(Vincent et al., 2010)
![Page 110: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/110.jpg)
Training DAE
• Training algorithm does not change
However, you may give different emphasis on the error of reconstruction of the corrupted input.
• SGD is a popular choice
• Sigmoid is a suitable choice unless you know what you are doing.
129
![Page 111: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/111.jpg)
Contractive Auto-encoder
130
![Page 112: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/112.jpg)
(Pascal Vincent) 131
![Page 113: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/113.jpg)
(Pascal Vincent) 133
![Page 114: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/114.jpg)
(Pascal Vincent) 134
![Page 115: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/115.jpg)
(Pascal Vincent) 135
![Page 116: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/116.jpg)
(Pascal Vincent) 138
![Page 117: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/117.jpg)
Convolutional AE
• Encoder:
Standard convolutional layer
You may use pooling (e.g., max-pooling)
Pooling is shown to regularize the features in the encoder (Masci et al., 2011)
• Decoder:
Deconvolution
• Loss is MSE.
2011
https://github.com/vdumoulin/conv_arithmetic
![Page 118: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/118.jpg)
Principles other than ‘sparsity’?
142
![Page 119: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/119.jpg)
Slowness
143http://www.scholarpedia.org/article/Slow_feature_analysis
![Page 120: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/120.jpg)
Slow Feature Analysis (SFA)from Wiskott et al.
144http://www.scholarpedia.org/article/Slow_feature_analysis
![Page 121: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/121.jpg)
Slow Feature Analysis (SFA)
145http://www.scholarpedia.org/article/Slow_feature_analysis
Optimal stimuli for the slowest components
extracted from natural image sequences.
![Page 122: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/122.jpg)
Visualizing the layers
146
![Page 123: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/123.jpg)
Visualizing the layers
• Question: What is the input that activates a hidden unit ℎ𝑖 most?
i.e., we are after 𝐱∗:𝐱∗ = arg max
𝐱 𝑠.𝑡. 𝐱 =𝜌ℎ𝑖(𝑊, 𝐱)
• For the first layer:
𝑥𝑗 =𝑤𝑖𝑗
σ𝑘 𝑤𝑖𝑘2
where we assume that σ𝑖 𝑥𝑖2 ≤ 1, and hence normalize
the weights to match the range of the input values.
• How about the following layers? Gradient ascent (not descent): find the gradient of ℎ𝑖(𝑊, 𝐱)
w.r.t 𝐱 and move 𝐱 in the direction of the gradient since we want to maximize ℎ𝑖 𝑊, 𝐱 .
147
![Page 124: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/124.jpg)
Visualizing the layers• Activation maximization:
Gradient ascent to maximize ℎ𝑖 𝑊,𝐱 .
Start with randomly generated input and move towards the gradient.
Luckily, different random initializations yield very similar filter-like responses.
• Applicable to any network for which we can calculate the gradient 𝜕ℎ𝑖 𝑊, 𝐱 /𝜕𝐱
• Need to tune parameters:
Learning rate
Stopping criteria
• Have the same problems of gradient descent
The space is non-convex
Local maxima etc.
148
![Page 125: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/125.jpg)
Activation Maximization results
149Erhan et al., “Understanding Representations Learned in
Deep Architectures”, 2010.
![Page 126: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/126.jpg)
New studies• Unsupervised pretraining with setting noise as target:
• https://arxiv.org/abs/1704.05310
151
![Page 127: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can](https://reader030.vdocuments.site/reader030/viewer/2022040500/5e1ef9545ee7da6b5b6d420b/html5/thumbnails/127.jpg)
A recent study• https://openreview.net/pdf?id=HkNDsiC9KQ