![Page 1: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/1.jpg)
CENG 783
Special topics in
Deep Learning
Week 12Sinan Kalkan
© AlchemyAPI
![Page 2: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/2.jpg)
Why do we need them?
A. Graves, “Supervised Sequence Labelling with Recurrent Neural Networks”, 2012.
![Page 3: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/3.jpg)
Recurrent Neural Networks (RNNs)
• RNNs are very powerful because:
Distributed hidden state that allows them to store a lot of information about the past efficiently.
Non-linear dynamics that allows them to update their hidden state in complicated ways.
• With enough neurons and time, RNNs can compute anything that can be computed by your computer.
• More formally, RNNs are Turing complete.
input
hid
den
ou
tpu
t
inp
ut
hid
de
noutp
ut
Feed-forward
networks
Recurrent
networks
Adapted from Hinton
![Page 4: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/4.jpg)
Unfolding
time
input
hid
den
ou
tpu
t
inp
ut
hid
de
noutp
ut
Feed-forward
networks
Recurrent
networks
Unfolding
inp
ut
hid
de
noutp
ut
t=0
input
hid
de
noutp
ut
t=1
inp
ut
hid
den
ou
tpu
t
t=2
…
![Page 5: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/5.jpg)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 6: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/6.jpg)
Reminder: Backpropagation with weight constraints
• It is easy to modify the backpropalgorithm to incorporate linear constraints between the weights.
• We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints.
– So if the weights started off satisfying the constraints, they will continue to satisfy them. 21
21
21
21
21
:
:
:
wandwforw
E
w
Euse
w
Eand
w
Ecompute
wwneedwe
wwconstrainTo
Slide: Hinton
![Page 7: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/7.jpg)
Backpropagation through time
• We can think of the recurrent net as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints.
• We can also think of this training algorithm in the time domain:
The forward pass builds up a stack of the activities of all the units at each time step.
The backward pass peels activities off the stack to compute the error derivatives at each time step.
After the backward pass we add together the derivatives at all the different times for each weight.
Slide: Hinton
![Page 8: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/8.jpg)
The problem of exploding or vanishing gradients
• What happens to the magnitude of the gradients as we backpropagate through many layers?
– If the weights are small, the gradients shrink exponentially.
– If the weights are big the gradients grow exponentially.
• Typical feed-forward neural nets can cope with these exponential effects because they only have a few hidden layers.
• In an RNN trained on long sequences (e.g. 100 time steps) the gradients can easily explode or vanish.– We can avoid this by
initializing the weights very carefully.
• Even with good initial weights, its very hard to detect that the current target output depends on an input from many time-steps ago.– So RNNs have difficulty
dealing with long-range dependencies.
Slide: Hinton
![Page 9: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/9.jpg)
Exploding and vanishing gradients problem
• Solution 1: Gradient clipping for exploding gradients:
For vanishing gradients: Regularization term that penalizes changes in the magnitudes of back-propagated gradients
![Page 10: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/10.jpg)
Exploding and vanishing gradients problem
• Solution 2:
Use methods like LSTM
![Page 11: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/11.jpg)
Meet LSTMs• How about we explicitly encode memory?
(C)
Dh
ruv B
atr
a
11Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
![Page 12: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/12.jpg)
LSTM in detail
• We first compute an activation vector, 𝑎:
𝑎 = 𝑊𝑥𝑥𝑡 +𝑊ℎℎ𝑡−1 + 𝑏
• Split this into four vectors of the same size:
𝑎𝑖 , 𝑎𝑓 , 𝑎𝑜, 𝑎𝑔 ← 𝑎
• We then compute the values of the gates:
where 𝜎 is the sigmoid.
• The next cell state 𝑐𝑡 and the hidden state ℎ𝑡:
where ⊙ is the element-wise product of vectors
12
𝑐𝑡−1
ℎ𝑡−1
𝑐𝑡
ℎ𝑡
Eqs: Karpathy
Image: C. Olah
Alternative formulation:
![Page 13: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/13.jpg)
CNN vs RNN
![Page 14: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/14.jpg)
Example: Character-level Text Modeling
![Page 15: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/15.jpg)
A simple scenario
• Alphabet: h, e, l, o
• Text to train to predict: “hello”
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 16: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/16.jpg)
Sample predictions (when trained on the works of Shakespeare):
• 3-level RNN with 512 hidden nodes in each layer
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 17: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/17.jpg)
Sample predictions (when trained on Wikipedia):
• Using LSTM
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 18: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/18.jpg)
Sample predictions (when trained on Latex documents):
• Using multi-layer LSTM
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 19: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/19.jpg)
He was elected President during the Revolutionary
War and forgave Opus Paul at Rome. The regime
of his crew of England, is now Arab women's icons
in and the demons that use something between
the characters‘ sisters in lower coil trains were
always operated on the line of the ephemerable
street, respectively, the graphic or other facility for
deformation of a given proportion of large
segments at RTUS). The B every chord was a
"strongly cold internal palette pour even the white
blade.”
Slide: Hinton
From Ilya Sutskever (using a variant of character-level RNN)
![Page 20: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/20.jpg)
Some completions produced by the model
• Sheila thrunges (most frequent)
• People thrunge (most frequent next character is space)
• Shiela, Thrungelini del Rey (first try)
• The meaning of life is literary recognition. (6th try)
• The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. (one of the first 10 tries for a model trained for longer).
Slide: Hinton
![Page 21: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/21.jpg)
RNNs for predicting the next word
• Tomas Mikolov and his collaborators have trained quite large RNNs on quite large training sets using BPTT. They do better than feed-forward neural nets.
They do better than the best other models.
They do even better when averaged with other models.
• RNNs require much less training data to reach the same level of performance as other models.
• RNNs improve faster than other methods as the dataset gets bigger. This is going to make them very hard to beat.
Slide: Hinton
![Page 22: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/22.jpg)
Word-level RNN for news title generation
http://clickotron.com/
https://github.com/larspars/word-rnn
![Page 23: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/23.jpg)
Today• Finalize RNNs
Language modeling with words
Word embedding
Example Application: Image Captioning
Example Application: Language Translation
• NOTES: Final Exam date: 3 January, 17:40 – BMB1
Lecture on 31st of December (Monday)
Project deadline (w/o Incomplete period): 26th of January
Project deadline (w Incomplete period): 2nd of February
ICLR 2019 Reproducibility challenge
![Page 24: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/24.jpg)
Word Embedding (word2vec)
Fig: http://www.languagejones.com/blog-1/2015/11/1/word-embedding
![Page 25: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/25.jpg)
Why do we embed words?• 1-of-n encoding is not suitable to learn from
It is sparse
Similar words have different representations
Compare the pixel-based representation of images: Similar images/objects have similar pixels
• Embedding words in a map allows
Encoding them with fixed-length vectors
“Similar” words having similar representations
Allows complex reasoning between words:
king - man + woman = queen
Table: https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
![Page 26: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/26.jpg)
More examples
http://deeplearning4j.org/word2vec
![Page 27: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/27.jpg)
More examples
![Page 28: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/28.jpg)
More examples
•Geopolitics: Iraq - Violence = Jordan
•Distinction: Human - Animal = Ethics
•President - Power = Prime Minister
•Library - Books = Hall
http://deeplearning4j.org/word2vec
![Page 29: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/29.jpg)
Using word embeddings
• E.g., for language modeling
• Given “I am eating …”, a language model can predict what can come next.
This both requires syntax and semantics (context)
• Before deep learning, n-gram (2-gram, 3-gram) models were state of the art.
Fig: https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
![Page 30: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/30.jpg)
Using word embeddings
Fig: https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
![Page 31: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/31.jpg)
Using word embeddings
Fig: https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
![Page 32: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/32.jpg)
word2vec
• “Similarity” to Sweden (cosine distance between their vector representations)
http://deeplearning4j.org/word2vec
![Page 33: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/33.jpg)
Two different ways to train
1. Using context to predict a target word (~ continuous bag-of-words)
2. Using word to predict a target context (skip-gram)
Produces more accurate results on large datasets
• If the vector for a word cannot predict the context, the mapping to the vector space is adjusted
• Since similar words should predict the same or similar contexts, their vector representations should end up being similar
http://deeplearning4j.org/word2vec
![Page 34: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/34.jpg)
Two different ways to train
1. Using context to predict a target word (~ continuous bag-of-words)
https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
![Page 35: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/35.jpg)
Two different ways to train
2. Using word to predict a target context (skip-gram)
Produces more accurate results on large datasets
• Given a sentence:
the quick brown fox jumped over the lazy dog
• For each word, take context to be
(N-words to the left, N-words to the right)
• If N = 1 (context, word):
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
![Page 36: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/36.jpg)
Some details
• CBOW is called continuous BOW since the context is regarded as a BOW and it is continuous.
• In both approaches, the networks are composed of linear units
• The output units are usually normalized with the softmax
• According to Mikolov:
“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words”
![Page 37: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/37.jpg)
An extra issue: Sampling
• Greedy sampling: Take the most likely word at each step
vs
• Beam search (or alternatives): Consider k most likely words at each step, and expand search.
Figure: http://mttalks.ufal.ms.mff.cuni.cz/index.php
Figure:
https://geekyisawesome.blogspot.com.tr/2016/10/using-beam-
search-to-generate-most.html
![Page 38: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/38.jpg)
Example: Image Captioning
Fig: https://github.com/karpathy/neuraltalk2
![Page 39: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/39.jpg)
Demo video
https://vimeo.com/146492001
![Page 40: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/40.jpg)
Overview
Pre-trained
word
embedding
is also used
Pre-trained CNN
(e.g., on imagenet)
Image: Karpathy
![Page 41: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/41.jpg)
Slide: Karpathy
Training
![Page 42: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/42.jpg)
![Page 43: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/43.jpg)
![Page 44: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/44.jpg)
![Page 45: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/45.jpg)
![Page 46: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/46.jpg)
• http://aiweirdness.com/post/171451900302/do-neural-nets-dream-of-electric-sheep
![Page 47: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/47.jpg)
Example: Neural Machine Translation
![Page 48: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/48.jpg)
Neural Machine Translation• Model
A B
C
W X Y
Z
Haitham Elmarakeby
![Page 49: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/49.jpg)
Neural Machine Translation• Model
Sutskever et al. 2014 Haitham Elmarakeby
Each box is an LSTM or GRU cell.
![Page 50: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/50.jpg)
Neural Machine Translation
• Model- encoder
Cho: From Sequence Modeling to Translation Haitham Elmarakeby
![Page 51: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/51.jpg)
Neural Machine Translation
• Model- encoder
Cho: From Sequence Modeling to Translation Haitham Elmarakeby
![Page 52: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/52.jpg)
Neural Machine Translation
• Model-decoder
Cho: From Sequence Modeling to Translation Haitham Elmarakeby
![Page 53: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/53.jpg)
Decoder in more detail
Given
(i) the “summary” (𝐜) of the input sequence,
(ii) the previous output / word (𝑦(𝑡 − 1))
(iii) the previous state (ℎ(𝑡 − 1))
the hidden state of the decoder is:
ℎ 𝑡 = 𝑓(ℎ 𝑡 − 1 , 𝑦 𝑡 − 1 , 𝐜)
Then, we can find the most likely next word:
𝑃 𝑦 𝑡 𝑦 𝑡 − 1 , 𝑦 𝑡 − 2 , … , 𝐜) = g(ℎ 𝑡 , 𝑦 𝑡 − 1 , 𝐜)
𝑓, 𝑔: activation functions of our choice. For 𝑔, we need a function that
maps to probabilities, e.g., softmax.
![Page 54: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/54.jpg)
Encoder-decoder
• Jointly trained to maximize
![Page 55: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/55.jpg)
Check the following tutorial• http://smerity.com/articles/2016/google_nmt_arch.html
![Page 56: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/56.jpg)
More on attention
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 57: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/57.jpg)
More on attention
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
Attention mechanism: A two-layer neural network.
Input: 𝑧𝑖 and ℎ𝑗Output: 𝑒𝑗, a scalar for the importance of word 𝑗.
The scores of words are normalized: 𝑎𝑗 = softmax(ej)
![Page 58: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/58.jpg)
More on attention
2017
![Page 59: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/59.jpg)
NMT can be done at char-level too
• http://arxiv.org/abs/1603.06147
![Page 60: Special topics in Deep Learning - Middle East Technical ...sinan/DL/week_12.pdf · se w E d w E e we d w w To n w w w w w w w w w w ' ' Slide: Hinton. Backpropagation through time](https://reader036.vdocuments.site/reader036/viewer/2022070708/5ead416dba2db9215c241255/html5/thumbnails/60.jpg)
This can be done with CNNs
2017