![Page 1: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/1.jpg)
Deep Learning for Natural Language Processing
Yoshimasa Tsuruoka
The University of Tokyo
![Page 2: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/2.jpg)
Outline
• Introduction• Word embeddings
– Word2vec, Skip-gram, fastText
• Recurrent Neural Networks– LSTM, GRU
• Neural Machine Translation– Encoder-decoder model– Transformer– Unusupervised NMT
• Pretraining methods– ELMo, GPT, BERT, UNI-LM– Sentiment Analysis, Textual Entailment, Question Answering,
Summarization
![Page 3: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/3.jpg)
Story generation
• GPT-2 [Radford et al., 2019]
https://blog.openai.com/better-language-models/
![Page 4: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/4.jpg)
![Page 5: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/5.jpg)
Question Answering
https://rajpurkar.github.io/SQuAD-explorer/
![Page 6: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/6.jpg)
Now machines are better than human!?
![Page 7: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/7.jpg)
Word representation
• Vector representation
– A word is represented as a real-valued vector
• Dimensionality: 50 ~ 1000
– Similar words are treated similarly
• Alleviates the problem of data sparsity
dog
cattea
coffee
![Page 8: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/8.jpg)
Word2Vec [Mikolov et al., 2013]
• How do you learn good word vectors?
• Optimize the vectors so that they can predict well the occurrence of the words in a document
• Two approaches:– Skip-grams
– Continuous Bag of Words (CBOW)
![Page 9: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/9.jpg)
Continuous Bag of Words (CBOW)
… to prevent another financial crisis such as …
… to prevent another financial crisis such as …
… to prevent another financial crisis such as …
• Predict the center word
![Page 10: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/10.jpg)
Skip-gram
• Predict each context word
… to prevent another financial crisis such as …
… to prevent another financial crisis such as …
… to prevent another financial crisis such as …
![Page 11: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/11.jpg)
Skip-gram
• Probability of word o given the center word c
• Each word in the vocabulary has two vectors– u: vector used when the word appears as a context
(outside) word– v: vector used when the word appears as the center
word
V
w
c
T
w
c
T
ocop
1
exp
exp
vu
vu
![Page 12: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/12.jpg)
Skip-gram
• Objective for training
– Maximize the likelihood
– Minimize the negative log likelihood
T
t jmjm
tjt wwpJ1 0,
;
T
t jmjm
tjt wwpT
J1 0,
;log1
![Page 13: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/13.jpg)
Skip gram with negative sampling
• Skip gram
– Needs to compute the summation for all words in the vocabulary
– Computational cost is large
• Skip gram with negative sampling
– Logistic regression problems
– Generate negative examples by random sampling
![Page 14: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/14.jpg)
1000-dimensional vectors learned with Skip-gram are converted to two-dimensional vectors by PCA
Mikolov, et al., Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013
![Page 15: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/15.jpg)
Analogical reasoning
• What is the female equivalent of a king?
• Analogical reasoning by arithmetic operation of word vectors
vec(king) – vec(man) + vec(woman) ≈ vec(queen)
(Mikolov et al., 2013)
![Page 16: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/16.jpg)
fastText [Bojanowski et al., 2017]
• Address the rare word problem– A word is represented as a bag of character n-grams
– Each character n-gram has a vector representation
– Each word is represented as the sum of character n-gram vectors
– Example) The word “interlink”• 3-gram: <in, int, nte, ter, erl, rli, lin, ink, nk>
• 4-gram: <int, inte, nter, terl, erli, rlin, link, ink>
• 5-gram: <inte, inter, nterl, terli, erlin, rlink, link>
• 6-gram: <inter, interl, nterli, terlin, erlink, rlink>
• Whole word: <interlink>
Bojanowski et al., Enriching Word Vectors with Subword Information, TACL 2017
![Page 17: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/17.jpg)
• Feed-forward Neural Network
Neural Network
Sigmoid function
Dx
1x
1z
Mz
Ky
1y
zWy
xWz
yz
zx
The sizes of the input and output vectors are fixed
xe
x
1
1
Input Output
![Page 18: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/18.jpg)
Recurrent Neural Network (RNN)
• Can process a sequence of any length
tt
ttt
hWy
hWxWh
yh
1
hhhx
tx
ty
th
…
equivalent
Share the weight parameters W
Input Vec
State Vec
Output Vec
18
1y 2y 3y 4y
1h 2h 3h 4h
1x2x 3x
4x
![Page 19: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/19.jpg)
1ty
1th th 1th
1tx tx 1tx
hxW
hhW
hhW
hhW
ty
yhW
yhW
yhW
1ty
hxW
hxW
![Page 20: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/20.jpg)
RNN and NLP
• In NLP, we process sequences of words and characters– Language modeling, part-of-speech tagging, machine translation, etc.
• E.g., Language modeling– Predict the next word
…
Baseball is popular in
is popular in ?Contextinformation
20
1h 2h 3h 4h
![Page 21: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/21.jpg)
RNN language model
• Words: w1,w2,…,wt-1,wt,wt+1,…,wT
• Word vectors: x1,x2,…,xt-1,xt,xt+1,…,xT
• Hidden states
• Word prediction
jttjt
tt
ywwvwP ,11
S
ˆ,...,
softmaxˆ
hWy
1
hhhx
ttt hWxWh
ℝSW hDV
ℝth hD
ℝtxd
ℝhxW
dDh
ℝhhW hh DD
![Page 22: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/22.jpg)
Softmax function
• Softmax function
– Convert real-valued scores to a probability distribution
• Example
n
i
i
n
xZ
Z
x
Z
x
Z
x
1
21
exp
exp,...,
exp,
expsoftmax
x
xxxx
0.2
2.1
5.3
181.0
007.0
812.01
0.2
2.1
5.3
0.22.15.3
e
e
e
eee
softmax( )
Add up to 1
![Page 23: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/23.jpg)
Training
• Objective– Cross Entropy Loss
– Equivalent to negative log-likelihood• Correct word should be assigned a high probability
T
t
V
j
jtjt yyT
J1 1
,,ˆlog
1
Takes on 1 for the correct word, 0 otherwise
![Page 24: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/24.jpg)
1ˆ
ty
1th
1tx
SW
Softmax
Cross Entropy1ty
+
Sigmoidhh
W
hxW
+
×
×
×
ty
th
tx
SW
Softmax
Cross Entropyty
+
Sigmoidhh
W
hxW
+
×
×
×
![Page 25: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/25.jpg)
Problems with vanilla RNNs
• Vanishing/exploding gradients
• Fail to capture long-distance dependencies
• Solutions
– Directly connect ht and ht-1 depending on context
– Gated Recurrent Units (GRUs)
– Long Short-Term Memory (LSTM)
![Page 26: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/26.jpg)
Gated Recurrent Unit (GRU) [Cho et al., 2014]
• Update gate:
• Reset gate:
• State update
– Ignore the previous state if the reset gate is 0
– Copy the previous state if the update gate is 1
1 t
z
t
z
t hUxWz 1 t
r
t
r
t hUxWr
1tanh~
tttt UhrWxh
ttttt hzhzh~
11
![Page 27: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/27.jpg)
Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997]
• Input gate:
• Forget gate:
• Output gate:
• Memory cell
• State update
o
t
o
t
o
t bhUxWo 1
i
t
i
t
i
t bhUxWi 1
f
t
f
t
f
t bhUxWf 1
1
~
1
~~
~tanh~
ttttt
c
t
c
t
c
t
cfcic
bhUxWc
ttt coh tanh
![Page 28: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/28.jpg)
LSTM (Long Short-Term Memory)
28
o
t
o
t
o
t bhUxWo 1
i
t
i
t
i
t bhUxWi 1
f
t
f
t
f
t bhUxWf 1
1
~
1
~~
~tanh~
ttttt
c
t
c
t
c
t
cfcic
bhUxWc
ttt coh tanh
![Page 29: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/29.jpg)
Performance of RNN language models
Kim et al., Character-Aware Neural Language Models, 2015
![Page 30: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/30.jpg)
Perplexity
• Perplexity (PP or PPL)
N
N
i ii
N
N
NN
wwwP
wwwP
wwwPW
1 11
21
1
21
...
1
...
1
...PP ・ Average branching factor of word prediction・ The smaller, the better
log PP 𝑊 = −1
𝑁
𝑖=1
𝑁
log 𝑃 𝑤𝑖|𝑤1, … , 𝑤𝑖−1
→ Cross-entropy loss per word
![Page 31: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/31.jpg)
Examples generated by LSTM
• Trained with LaTeX source (word-level)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 32: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/32.jpg)
Machine Translation
• Translate one language into another
• Train a translation model with a parallel corpus– E.g., WMT’14 English-to-French dataset
• 12 million sentences from Europarl, News Commentary, etc.• 300 million words (English)• 350 million words (Frence)
I'm here on vacation
Je suis là pour les vacances
32
![Page 33: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/33.jpg)
Japanese-English Subtitle Corpus[Pryzant et al., 2017]
![Page 34: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/34.jpg)
ASPEC corpus [Nakazawa et al., 2016]
![Page 35: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/35.jpg)
Neural Machine Translation
• Encoder-decoder model (Sutskever et al., 2014)
– Encoder RNN• Convert the source sentence into a real-valued vector
– Decoder RNN• Generate a sentence in the target language from the vector
35
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
He likes cheese <EOS> 彼 は チーズ が 好き
彼 は チーズ が 好き <EOS>
LSTM
![Page 36: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/36.jpg)
Example
Output of the system
Reference translation
36
Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014
![Page 37: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/37.jpg)
Results
The BLUE score of the state-of-the-art system was 37.0
• WMT’14 English to French
Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014
![Page 38: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/38.jpg)
BLEU score
• BLEU score [Papineni et al., 2002]
– Most widely used evaluation measure for MT
4
1
log4
1expBPBLUE
n
np
otherwise1exp
if1
BP
c
r
rc
Modified n-gram precision
Brevity Penalty
c: length of the generated sentencer: length of the reference sentence
![Page 39: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/39.jpg)
Vector representation of source sentences
Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014 39
![Page 40: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/40.jpg)
Problems• Represents the content of the source sentence with
a single vector
– Hard to represent a long sentence
• Sentence length vs translation accuracy
Encoder-decoder model Statistical Machine Translation
Cho et al., On the Properties of Neural Machine Translation: Encoder–Decoder Approaches, 2014
![Page 41: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/41.jpg)
Attention (Bahdanau et al., 2015)
• Look at the hidden state of each word in the source sentence when updating the states of the decoder
xT
j
jiji
1
hc
xT
k ik
ij
ij
e
e
1exp
exp
jiij dNNFeedForware hs ,1
Weighted average of hidden states in the decor
Weights (which add up to 1)
tc
![Page 42: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/42.jpg)
Bidirectional LSTM (BiLSTM)
• Stack two RNNs (forward and backward)
– Capture left and right context information
1h 2h 3h4h
1h 2h 3h4h
1x
1y
2x
2y
3x
3y
4x
4y
. . .
. . .
![Page 43: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/43.jpg)
(Bahdanau et al., 2015)
Attention Example(English to French)
![Page 44: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/44.jpg)
Sentence length and Translation Accuracy
(Bahdanau et al., 2015)
![Page 45: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/45.jpg)
Attention
• Improvements by Luong et al. (2015)
![Page 46: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/46.jpg)
Translation accuracy
• WMT’14 English-German results
– 4.5M sentence pairs
Luong et al., (2015)
![Page 47: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/47.jpg)
Examples
Luong et al., (2015)
![Page 48: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/48.jpg)
Google Neural Machine Translation system (GNMT) (Wu et al., 2016)
• Model– Encoder: 8-layer LSTM (the lowest layer is bidirectional)– Decoder: 8-layer LSTM– Attention: from the bottom layer of the decoder to the top
layer of the encoder– Unit: Wordpiece model
• Training data– For research
• WMT corpus: En-Fr (36M), En-De (5M), etc.
– For production• Two or three orders of magnitude larger than the WMT corpus
![Page 49: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/49.jpg)
GNMT system
(Wu et al., 2016)
![Page 50: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/50.jpg)
Transformer (Vaswani et al., 2017)
• Published in June of 2017
• No use of RNN
``Attention is all you need’’
• Outperformed GNMT with much less computational cost
– Training• 36M English-French sentence pairs
• 8 GPUs
• Completed in 3.5 day
Vaswani et al., Attention Is All You Need, NIPS 2017
![Page 51: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/51.jpg)
Transformer
He likes cheese 彼 は チーズ
Self-attention
Self-attention
Self-attention
Self-attention
Self-attention
Self-attention Self-attention + attention
が
Self-attention + attention
Self-attention + attention
Self-attention + attention
Self-attention + attention
Self-attention + attention
Encoder Decoder
![Page 52: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/52.jpg)
• Encoder
– Compute a contextual embedding for each word
He likes cheese
𝐱1 𝐱2 𝐱3
Self-attention & Feed Forward NN
𝐫𝟏(1)
𝐫𝟐(1)
𝐫𝟑(1)
Self-attention & Feed Forward NN
𝐫𝟏(6)
𝐫𝟐(6)
𝐫𝟑(6)
𝐫𝟏(5)
𝐫𝟐(5)
𝐫𝟑(5)
: : :
∈ ℝ512
∈ ℝ512
∈ ℝ512
∈ ℝ512
Sixlayers
Word embedding
![Page 53: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/53.jpg)
• Encoder self-attention (1st layer)
He likes cheese
𝐯1 𝐯2 𝐯3𝐤1 𝐤2 𝐤3
𝐪1 ∙ 𝐤1 𝐪1 ∙ 𝐤2 𝐪1 ∙ 𝐤3
0.3 0.5 0.2
𝐱1 𝐱2 𝐱3
𝑊𝐾 𝑊𝑉 𝑊𝐾 𝑊𝑉 𝑊𝐾 𝑊𝑉
He likes cheese𝐱1 𝐱2 𝐱3
𝒒1 𝒒2 𝒒3
𝑊𝑄 𝑊𝑄 𝑊𝑄
Softmax
𝐳1 = 0.3𝐯1 + 0.5𝐯2 + 0.2𝐯3𝐫1 = FFN 𝒛1
Input
← Output(Input to Next Layer)
Dot product of Query & Key
![Page 54: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/54.jpg)
Position-wise Feed-Forward Networks
• Compute the output of each attention layer
– Two-layer feed-forward neural network
– Linear transformation -> ReLU -> Linear Transformation
FFN 𝐱 = max 0, 𝐱𝑊1 + 𝐛1 𝑊2 + 𝐛2
𝑊1 ∈ ℝ512×2048 𝐛1 ∈ ℝ2048
𝑊2 ∈ ℝ2048×512 𝐛2 ∈ ℝ512
𝐱
ReLU
𝑊1
𝑊2
![Page 55: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/55.jpg)
Scaled dot product attention
• Compute the attention for all words at once
𝐪1
𝐪2
𝐪𝟑
𝐤1 𝐤2 𝐤3 =
𝐪1 ∙ 𝐤1 𝐪1 ∙ 𝐤2 𝐪1 ∙ 𝐤3𝐪2 ∙ 𝐤1 𝐪2 ∙ 𝐤2 𝐪2 ∙ 𝐤3𝐪𝟑 ∙ 𝐤1 𝐪3 ∙ 𝐤2 𝐪3 ∙ 𝐤3
𝑄 ∈ ℝ𝑛×𝑑𝑘
𝐾 ∈ ℝ𝑛×𝑑𝑘
𝑉 ∈ ℝ𝑛×𝑑𝑣
Attention 𝑄, 𝐾, 𝑉 = softmax𝑄𝐾𝑇
𝑑𝑘𝑉
0.1 0.2 0.70.4 0.3 0.30.8 0.1 0.1
𝐯1
𝐯2
𝐯𝟑
=
0.1𝐯1 + 0.2𝐯2 + 0.7𝐯3
0.4𝐯1 + 0.3𝐯2 + 0.3𝐯3
0.8𝐯1 + 0.1𝐯𝟐 + 0.1𝐯3
𝑸 𝑲𝑻
𝐬𝐨𝐟𝐭𝐦𝐚𝐱(∙) 𝑽
scaling
![Page 56: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/56.jpg)
Multi-Head Attention
• Use 8 heads with different parameters
Multihead 𝑄,𝐾, 𝑉 = Concat head1, … , head8 𝑊𝑂
head𝑖 = Attention 𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖
𝐾, 𝑉𝑊𝑖𝑉
𝑊𝑖𝑄∈ ℝ512×64
𝑊𝑖𝐾 ∈ ℝ512×64
𝑊𝑖𝑉 ∈ ℝ512×64
𝑊𝑂 ∈ ℝ512×512
Each head collects a differentkinds of information
head𝑖 ∈ ℝ𝑛×64
Multihead ∙ ∈ ℝ𝑛×512
DimensionalityReduction
![Page 57: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/57.jpg)
Decoder
• Six layers of attention– Each layer has two types of attention
mechanism
• Self-attention– Collects information on the generated words
• Attend to the output of the layer below
– Mask the information about the future
• Encoder-decoder attention– Collects information on the source sentence
• Attend to the output of the encoder
![Page 58: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/58.jpg)
Add & Norm
• Add (Residual Connection)
– Learn the difference between input and output
– Implementation
• Simply add input to output
– Reduces train/test error
• Norm (Layer Normalization)
– Normalize the output of each layer
• Mean = 0, Variance = 1
– Faster convergence
![Page 59: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/59.jpg)
Positional Encoding
• Encode position information
– 𝑃𝐸 𝑝𝑜𝑠,2𝑖 = sin 𝑝𝑜𝑠
10000 Τ2𝑖 512
– 𝑃𝐸 𝑝𝑜𝑠,2𝑖+1 = cos 𝑝𝑜𝑠
10000 Τ2𝑖 512
– 0 ≤ 𝑖 < 256
– Wavelength: 2𝜋~10000 ∙ 2𝜋
• Represents each position with soft binary notation
![Page 60: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/60.jpg)
http://jalammar.github.io/illustrated-transformer/
𝑖𝑛𝑑𝑒𝑥
𝑝𝑜𝑠
![Page 61: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/61.jpg)
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Transformer
![Page 62: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/62.jpg)
Visualizing attention
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The animal didn’t cross the street because it was too tired.
The animal didn’t cross the street because it was too wide.
![Page 63: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/63.jpg)
Performance
• Translation accuracy and training cost
![Page 64: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/64.jpg)
Results with different hyper-parameter settings
![Page 65: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/65.jpg)
Unsupervised NMT
• Unsupervised NMT [Artetxe et al., 2018; Lample et al., 2018]
– Build translation models by using only monolingual corpora
• Method1. Learn initial translation models
2. Repeat• Generate source and target sentences using the current
translation models
• Train new translation models using the generated sentences
![Page 66: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/66.jpg)
Example
Lample et al., Unsupervised Machine Translation Using Monolingual Corpora Only, ICLR 2018
![Page 67: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/67.jpg)
Initialization
• Approaches– Use a bilingual dictionary (Klementiev et al. 2012)
– Use a dictionary inferred in an unsupervised way (Conneau et al., 2018; Artetxe et al., 2017)
– Use a shared sub-word vocabulary between two languages (Lample et al., 2018)
1. Join the monolingual corpora
2. Apply BPE tokenization (Sennrich et al., 2016) on the resulting corpus
3. Learn token embeddings by fastText
![Page 68: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/68.jpg)
Language modeling (denoising auto-encoding)
EncoderCorpus(Source)
Corpus (Target)
Decoder(→ Source)
EncoderDecoder
(→ Target)
𝑥𝑠𝑟𝑐
Noise (some words dropped and swapped)
ො𝑥𝑠𝑟𝑐
𝑥𝑡𝑔𝑡
Noise
ො𝑥𝑡𝑔𝑡
The first token of the decoder specifies the output language
ℒ𝑙𝑚 = 𝐸𝑥~𝑆 − log𝑃𝑠→𝑠 𝑥|𝐶 𝑥 + 𝐸𝑥~𝑇 − log𝑃𝑡→𝑡 𝑥|𝐶 𝑥
![Page 69: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/69.jpg)
Back-translation
EncoderCorpus(Source)
Corpus (Target)
Decoder(→ Source)
𝑥𝑠𝑟𝑐 ො𝑥𝑠𝑟𝑠
𝑥𝑡𝑔𝑡
Model at Prev. Iter.(→ Target)
EncoderDecoder
(→ Target)ො𝑥𝑡𝑔𝑡
Model at Prev. Iter.
(→ Source)
ℒ𝑏𝑎𝑐𝑘 = 𝐸𝑥~𝑆 − log𝑃𝑡→𝑠 𝑥|𝑣∗ 𝑥 + 𝐸𝑥~𝑇 − log𝑃𝑠→𝑡 𝑥|𝑢∗ 𝑥
![Page 70: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/70.jpg)
Comparison to supervised MT
• WMT’14 En-Fr
Lample et al., Phrase-Based & Neural Unsupervised Machine Translation, EMNLP 2018
![Page 71: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/71.jpg)
Pretraining methods
• Deep learning models require a large amount of labeled data for training
• How can we build accurate models without using a large amount of labeled data?
• Pretraining methods– Train a language model with a large amount of raw
(unlabeled) text and then adapt it to various NLP tasks• Feature-based
– ELMo (Peters et al., 2018)
• Fine-tuning– GPT (Radford et al., 2018)– BERT (Devlin et al., 2019)
![Page 72: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/72.jpg)
ELMo (Peters et al., 2018)• Train two language models (left-to-right and right-to-left) on a
large raw corpus
• Use the hidden vectors of LSTMs as features for NLP models
Peters et al., Deep contextualized word representations, NAACL 2018
LSTM LSTM LSTM
𝑡1 𝑡2 𝑡3
LSTM LSTM LSTM
LSTM LSTM LSTM LSTM LSTM LSTM
𝑡2 𝑡3 END START 𝑡1 𝑡2
![Page 73: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/73.jpg)
ELMo (Peters et al., 2018)
Peters et al., Deep contextualized word representations, NAACL 2018
• Accuracy of existing NLP models can be improved by simply adding features produced by ELMo
![Page 74: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/74.jpg)
GPT (Radford et al., 2018)
• GPT (Generative Pre-trained Transformer)
• Method– Train a Transformer language model with a large
amount of raw text• 12-layer decoder-only Transformer
– sequences of up to 512 tokens.
• Took one month with 8 GPUs
• BooksCorpus: 7000 unpublished books (~5GB of text).
– Add a task-specific layer to the Transformer model and fine-tune it with labeled data• 3 epochs of training was sufficient for most cases
Radford et al., Improving language understanding by generative pre-training, 2018
![Page 75: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/75.jpg)
GPT (Radford et al., 2018)
Radford et al., Improving language understanding by generative pre-training, 2018
![Page 76: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/76.jpg)
MNLI (Multi-Genre Natural Language Inference, MultiNLI) [Williams et al., 2018]
Williams et al., A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference, NAACL-HLT 2018
• Entailment, Neural, or Contradiction
![Page 77: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/77.jpg)
RACE test (Lai et al., 2017)
https://openai.com/blog/language-unsupervised/
![Page 78: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/78.jpg)
• Results on natural language inference tasks
• Results on QA and common sense reasoning tasks
![Page 79: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/79.jpg)
GPT (Radford et al., 2018)
Number of layers and accuracy Zero-shot performance
![Page 80: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/80.jpg)
BERT (Devlin et al., 2019)
• BERT (Bidirectional Encoder Representations from Transformers)
• Method1. Pretraining with unlabeled data
• Learn deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context
2. Fine-tuning with labeled data• Add a task-specific layer to the model
• Fine-tune all the parameters
![Page 81: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/81.jpg)
Pretraining
• Masked Language Model (MLM)– Some of the tokens are replaced with a special token [MASK]
– The model is trained to predict them
• Next Sentence Prediction (NSP)– Predict whether the given two sentences are consecutive
sentences in a document
• Data– BooksCorpus (800M words)
– English Wikipedia (2,500M words)
The man went to the [MASK] to buy a [MASK] of milk
store gallon
![Page 82: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/82.jpg)
Input representation
• First token is always [CLS] (special token for classification)
• Segments are separated by [SEP]
• Add a segment-specific embedding to each token
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 83: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/83.jpg)
Figure adapted from Devlin et al. (2019)
NSP Masked LM Masked LM
Masked Sentence A Masked Sentence B
Pretraining
![Page 84: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/84.jpg)
Fine-tuning for Single Sentence Tagging Tasks
• CoNLL-2003 NER
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 85: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/85.jpg)
CoNLL-2003 NER[Tjong Kim Sang, 2003]
• Named Entity Recognition
B-ORG O B-PER O O B-LOC
U.N. official Ekeus heads for Baghdad
![Page 86: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/86.jpg)
Fine-turning for Single Sentence Classification tasks
• SST-2, CoLA
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 87: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/87.jpg)
Stanford Sentiment Treebank (SST)[Socher et al., 2013]
https://nlp.stanford.edu/sentiment/treebank.html
A
terrific
B movie
-- in
fact ,
the best in
recent memory
.
![Page 88: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/88.jpg)
CoLA (The Corpus of Linguistic Acceptability) [Warstadtet al., 2018]
• Grammatical or ungrammatical
![Page 89: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/89.jpg)
Fine-tuning for Sentence Pair Classification tasks
• MNLI, QQP, QNLI, STS-B, MPRC, RTE, SWAG
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 90: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/90.jpg)
Textual Entailment datasets
https://openai.com/blog/language-unsupervised/
![Page 91: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/91.jpg)
Evaluation
• GLUE test results– BERT outperformed GPT on all tasks
– BERTBASE• L=12, H=768, A=12, Total Param-eters=110M• Roughly the same size as OpenAI GPT
– BERTLARGE• L=24, H=1024, A=16, Total Parameters=340M
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 92: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/92.jpg)
GLUE (General Language Understanding Evaluation) benchmark [Wang et al., 2019]
• Nine tasks on natural language understanding– Single-Sentence Tasks
• CoLA (The Corpus of Linguistic Acceptability) [Warstadtet al., 2018]• SST-2 (The Stanford Sentiment Treebank) [Socheret al., 2013]
– Similarity and Paraphrase Tasks• MRPC (Microsoft Research Paraphrase Corpus) [Dolan and Brockett, 2005]• STS-B (The Semantic Textual Similarity Benchmark) [Cer et al., 2017]• QQP (Quora Question Pairs) [Chen et al., 2018]
– Inference Tasks• MNLI (Multi-Genre Natural Language Inference) [Williams et al., 2018]• QNLI (Question Natural Language Inference) [Wanget al., 2018]• RTE (Recognizing Textual Entailment) [Bentivogli et al., 2009]• WNLI (Winograd NLI) [Levesque et al., 2011]
Wang et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. ICLR 2019
![Page 93: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/93.jpg)
Fine-tuning for extractive QA
• SQuAD
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 94: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/94.jpg)
SQuAD [Rajpurkar et al., 2016]
![Page 95: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/95.jpg)
Evaluation
• Results on SQuAD 1.1
Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019
![Page 96: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/96.jpg)
UNI-LM (Dong et al., 2019)
• UNI-LM (Unified Pre-trained Language Model)– Model that can be fine-tuned for both NLU and NLG
• Pre-training– Three kinds of language models
• Unidirectional LM (left-to-right / right-to-left)• Bidirectional LM• Sequence-to-sequence LM
– Data• English Wikipedia & BooksCorpus
• Fine-tuning
Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS 2019
![Page 97: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/97.jpg)
Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS 2019
![Page 98: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/98.jpg)
Examples from Gigaword corpus
Nallapati et al., Abstractive Text Summarization using Sequence-to-sequence RNNs andBeyond, 2016
![Page 99: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/99.jpg)
Example from CNN/DailyMail corpus
See et al., Get To The Point: Summarization with Pointer-Generator Networks, 2016
![Page 100: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/100.jpg)
Evaluation
• CNN/DailyMail • Gigaword
Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS 2019
![Page 101: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,](https://reader033.vdocuments.site/reader033/viewer/2022053018/5f1d51f05ef267394434aadc/html5/thumbnails/101.jpg)
Summary
• Word embeddings– Word2vec, Skip-gram, fastText
• Recurrent Neural Networks– LSTM, GRU
• Neural Machine Translation– Encoder-decoder model– Transformer– Unusupervised NMT
• Pretraining methods– ELMo, GPT, BERT, UNI-LM– Sentiment Analysis, Textual Entailment, Question
Answering, Summarization