2+m``2migl2m` hgl2irq`fbgaaastraka/courses/npfl114/1819/... · 2019. 5. 20. · ls6grr9-gg2+im`2gn...

49

Upload: others

Post on 04-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • NPFL114, Lecture 9

    Recurrent Neural Networks III

    Milan Straka

    April 29, 2019

    Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

    unless otherwise stated

  • Recurrent Neural Networks

    Single RNN cell

    input

    output

    state

    Unrolled RNN cellsinput 1

    output 1

    state

    input 2

    output 2

    state

    input 3

    output 3

    state

    input 4

    output 4

    state

    2/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Basic RNN Cell

    input

    previous state

    output = new state

    Given an input and previous state , the new state is computed as

    One of the simplest possibilities is

    x(t) s(t−1)

    s =(t) f(s ,x ; θ).(t−1) (t)

    s =(t) tanh(Us +(t−1) V x +(t) b).

    3/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Basic RNN Cell

    Basic RNN cells suffer a lot from vanishing/exploding gradients (the challenge of long-termdependencies).

    If we simplify the recurrence of states to

    we get

    If has eigenvalue decomposition of , we get

    The main problem is that the same function is iteratively applied many times.

    Several more complex RNN cell variants have been proposed, which alleviate this issue to somedegree, namely LSTM and GRU.

    s =(t) Us ,(t−1)

    s =(t) U s .t (0)

    U U = QΛQ−1

    s =(t) QΛ Q s .t −1 (0)

    4/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Long Short-Term Memory

    xt

    ht−1

    httanh

    σ

    xt ht−1

    tanh

    σ

    xt ht−1

    ct

    σ

    ht−1xt

    Later in Gers, Schmidhuber & Cummins (1999) a possibility to forget information from memorycell was added.c t

    i t

    f t

    o t

    c t

    h t

    ← σ(W x + V h + b )i ti

    t−1i

    ← σ(W x + V h + b )f t f t−1f

    ← σ(W x + V h + b )o to

    t−1o

    ← f ⋅ c + i ⋅ tanh(W x + V h + b )t t−1 ty

    ty

    t−1y

    ← o ⋅ tanh(c )t t

    5/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Long Short-Term Memory

    http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png

    6/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Gated Recurrent Unit

    σ

    xt ht−1

    ht−1

    σ

    ht−1xt

    xt tanh

    ht−1

    1− +ht

    r t

    u t

    ĥt

    h t

    ← σ(W x + V h + b )r t r t−1r

    ← σ(W x + V h + b )u tu

    t−1u

    ← tanh(W x + V (r ⋅ h ) + b )h th

    t t−1h

    ← u ⋅ h + (1 − u ) ⋅ t t−1 t ĥt7/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Gated Recurrent Unit

    http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png

    8/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word Embeddings

    One-hot encoding considers all words to be independent of each other.

    However, words are not independent – some are more similar than others.

    Ideally, we would like some kind of similarity in the space of the word representations.

    Distributed RepresentationThe idea behind distributed representation is that objects can be represented using a set ofcommon underlying factors.

    We therefore represent words as fixed-size embeddings into space, with the vector elements

    playing role of the common underlying factors.

    Rd

    9/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word Embeddings

    The word embedding layer is in fact just a fully connected layer on top of one-hot encoding.However, it is important that this layer is shared across the whole network.

    Word in

    one-hot

    encoding

    V

    D1

    V

    D2

    V

    D3

    Word in

    one-hot

    encoding

    D

    D1

    D

    D2

    D

    D3

    V

    D

    10/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word Embeddings for Unknown Words

    Recurrent Character-level WEs

    Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096.

    11/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word Embeddings for Unknown Words

    Convolutional Character-level WEs

    Figure 1 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615.

    12/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Basic RNN Applications

    Sequence Element ClassificationUse outputs for individual elements.

    input 1

    output 1

    state

    input 2

    output 2

    state

    input 3

    output 3

    state

    input 4

    output 4

    state

    Sequence RepresentationUse state after processing the whole sequence (alternatively, take output of the last element).

    13/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Structured Prediction

    Consider generating a sequence of given input .

    Predicting each sequence element independently models the distribution .

    However, there may be dependencies among the themselves, which is difficult to capture by

    independent element classification.

    y , … , y ∈1 N Y N x , … ,x 1 N

    P (y ∣X)i

    y i

    14/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Linear-Chain Conditional Random Fields (CRF)

    Linear-chain Conditional Random Fields, usually abbreviated only to CRF, acts as an outputlayer. It can be considered an extension of a softmax – instead of a sequence of independentsoftmaxes, CRF is a sentence-level softmax, with additional weights for neighboring sequenceelements.

    s(X,y; θ,A) = (A +i=1

    ∑N

    y ,y i−1 i f (y ∣X))θ i

    p(y∣X) = softmax (s(X, z)) z∈Y N z

    log p(y∣X) = s(X,y) − logadd (s(X, z))z∈Y N

    15/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Linear-Chain Conditional Random Fields (CRF)

    ComputationWe can compute efficiently using dynamic programming. If we denote as

    probability of all sentences with elements with the last being .

    The core idea is the following:

    For efficient implementation, we use the fact that

    p(y∣X) α (k)tt y k

    α (k) =t f (y =θ t k∣X) + logadd (α (j) +j∈Y t−1 A ).j,k

    ln(a + b) = ln a + ln(1 + e ).ln b−ln a

    16/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Conditional Random Fields (CRF)

    DecodingWe can perform optimal decoding, by using the same algorithm, only replacing with

    and tracking where the maximum was attained.

    ApplicationsCRF output layers are useful for span labeling tasks, like

    named entity recognitiondialog slot filling

    logaddmax

    17/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Connectionist Temporal Classification

    Let us again consider generating a sequence of given input , but this

    time and there is no explicit alignment of and in the gold data.

    Figure 7.1 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.

    y , … , y 1 M x , … ,x 1 NM ≤ N x y

    18/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Connectionist Temporal Classification

    We enlarge the set of output labels by a – (blank) and perform a classification for every inputelement to produce an extended labeling. We then post-process it by the following rules(denoted ):

    1. We remove neighboring symbols.2. We remove the –.

    Because the explicit alignment of inputs and labels is not known, we consider all possiblealignments.

    Denoting the probability of label at time as , we define

    B

    l t p lt

    α (s)t =def p .labeling π:B(π )=y 1:t 1:s

    ∑t =1′∏t

    π t′t′

    19/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • CRF and CTC Comparison

    In CRF, we normalize the whole sentences, therefore we need to compute unnormalizedprobabilities for all the (exponentially many) sentences. Decoding can be performed optimally.

    In CTC, we normalize per each label. However, because we do not have explicit alignment, wecompute probability of a labeling by summing probabilities of (generally exponentially many)extended labelings.

    20/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Connectionist Temporal Classification

    ComputationWhen aligning an extended labeling to a regular one, we need to consider whether the extendedlabeling ends by a blank or not. We therefore define

    and compute as .

    α (s)−t

    α (s)∗t

    p =def

    labeling π:B(π )=y ,π =−1:t 1:s t

    ∑t =1′∏t

    π t′t′

    p =def

    labeling π:B(π )=y ,π =−1:t 1:s t

    ∑t =1′∏t

    π t′t′

    α (s)t α (s) +−t α (s)∗

    t

    21/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Connectionist Temporal Classification

    Figure 7.3 of the dissertation "SupervisedSequence Labelling with Recurrent Neural

    Networks" by Alex Graves.

    ComputationWe initialize s as follows:

    We then proceed recurrently according to:

    α

    α (0) ←−1 p −

    1

    α (1) ←∗1 p y 1

    1

    α (s) ←−t p (α (s) +−

    t−t−1 α (s))∗

    t−1

    α (s) ←∗t

    {p (α (s) + α (s − 1) + a (s − 1)), if y = y y st

    ∗t−1

    ∗t−1

    −t−1

    s s−1p (α (s) + a (s − 1)), if y = y y st

    ∗t−1

    −t−1

    s s−1

    22/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • CTC Decoding

    Unlike CRF, we cannot perform the decoding optimally. The key observation is that while anoptimal extended labeling can be extended into an optimal labeling of a larger length, the samedoes not apply to regular (non-extended) labeling. The problem is that regular labelingcoresponds to many extended labelings, which are modified each in a different way during anextension of the regular labeling.

    Figure 7.5 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.

    23/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • CTC Decoding

    Beam SearchTo perform beam search, we keep best regular labelings for each prefix of the extended

    labelings. For each regular labeling we keep both and and by best we mean such regular

    labelings with maximum .

    To compute best regular labelings for longer prefix of extended labelings, for each regularlabeling in the beam we consider the following cases:

    adding a blank symbol, i.e., updating both and ;

    adding any non-blank symbol, i.e., updating .

    Finally, we merge the resulting candidates according to their regular labeling and keep only the

    best.

    k

    α − a ∗α +− α ∗

    α − α ∗α ∗

    k

    24/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Unsupervised Word Embeddings

    The embeddings can be trained for each task separately.

    However, a method of precomputing word embeddings have been proposed, based ondistributional hypothesis:

    Words that are used in the same contexts tend to have similar meanings.

    The distributional hypothesis is usually attributed to Firth (1957).

    25/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word2Vec

    wt−2

    wt−1

    wt+1

    wt+2

    INPUT PROJECTION OUTPUT

    wtSUM

    CBOW (Continuous Bag Of Words)

    wt

    wt−2

    wt−1

    wt+1

    wt+2

    INPUT PROJECTION OUTPUT

    Skip-gram

    Mikolov et al. (2013) proposed two very simple architectures for precomputing wordembeddings, together with a C multi-threaded implementation word2vec.

    26/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word2Vec

    Table 8 of paper "Efficient Estimation of Word Representations in Vector Space", https://arxiv.org/abs/1301.3781.

    27/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word2Vec – SkipGram Model

    wt−2

    wt−1

    wt+1

    wt+2

    INPUT PROJECTION OUTPUT

    wtSUM

    CBOW (Continuous Bag Of Words)

    wt

    wt−2

    wt−1

    wt+1

    wt+2

    INPUT PROJECTION OUTPUT

    Skip-gram

    Considering input word and output , the Skip-gram model definesw i w o

    p(w ∣w )o i =def

    . e∑wW V w

    ⊤w i

    eW V w o⊤

    w i

    28/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word2Vec – Hierarchical Softmax

    Instead of a large softmax, we construct a binary tree over the words, with a sigmoid classifierfor each node.

    If word corresponds to a path , we definew n ,n , … ,n 1 2 L

    p (w∣w )HS i =def

    σ([+1 if n  is right child else -1] ⋅j=1

    ∏L−1

    j+1 W V ).n j⊤

    w i

    29/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Word2Vec – Negative Sampling

    Instead of a large softmax, we could train individual sigmoids for all words.

    We could also only sample the negative examples instead of training all of them.

    This gives rise to the following negative sampling objective:

    For , both uniform and unigram distribution work, but

    outperforms them significantly (this fact has been reported in several papers by differentauthors).

    l (w ,w )NEG o i =def log σ(W V ) +w o

    ⊤w i E log (1 −

    j=1

    ∑k

    w ∼P (w)j σ(W V )).w j⊤

    w i

    P (w) U(w)

    U(w)3/4

    30/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Recurrent Character-level WEs

    Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096.

    31/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Convolutional Character-level WEs

    Table 6 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615.

    32/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Character N-grams

    Another simple idea appeared simultaneously in three nearly simultaneous publications asCharagram, Subword Information or SubGram.

    A word embedding is a sum of the word embedding plus embeddings of its character n-grams.Such embedding can be pretrained using same algorithms as word2vec.

    The implementation can be

    dictionary based: only some number of frequent character n-grams is kept;hash-based: character n-grams are hashed into buckets (usually is used).K K ∼ 106

    33/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

    https://arxiv.org/abs/1607.02789https://arxiv.org/abs/1607.04606http://link.springer.com/chapter/10.1007/978-3-319-45510-5_21

  • Charagram WEs

    Table 7 of paper "Enriching Word Vectors with Subword Information", https://arxiv.org/abs/1607.04606.

    34/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Charagram WEs

    Figure 2 of paper "Enriching Word Vectors with Subword Information", https://arxiv.org/abs/1607.04606.

    35/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Sequence-to-Sequence Architecture

    Sequence-to-Sequence Architecture

    36/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Sequence-to-Sequence Architecture

    Figure 1 of paper "Sequence to Sequence Learning with Neural Networks", https://arxiv.org/abs/1409.0473.

    37/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Sequence-to-Sequence Architecture

    Figure 1 of paper "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation", https://arxiv.org/abs/1406.1078.

    38/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Sequence-to-Sequence Architecture

    BOS

    x̂(0)

    x(0)

    x̂(1)

    x(1)

    x̂(2)

    x(2)

    x̂(3)

    x(3)

    EOS

    BOS

    x̂(0)

    x̂(1)

    x̂(2)

    x̂(3)

    EOS

    TrainingThe so-called teacher forcing is used duringtraining – the gold outputs are used as inputsduring training.

    InferenceDuring inference, the network processes its ownpredictions.

    Usually, the generated logits are processed byan , the chosen word embedded and

    used as next input.

    arg max

    39/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Tying Word Embeddings

    Target word id

    Output layer

    Matrix V ×D

    Target word embedding

    RNN

    Matrix D × V

    Target word logits

    40/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Attention

    Figure 1 of paper "Neural Machine Translation by Jointly Learningto Align and Translate", https://arxiv.org/abs/1409.0473.

    As another input during decoding, we add context vector :

    We compute the context vector as a weighted combination ofsource sentence encoded outputs:

    The weights are softmax of over ,

    with being

    c i

    s =i f(s ,y , c ).i−1 i−1 i

    c =i α h j

    ∑ ij j

    α ij e ij j

    α =i softmax(e ),i

    e ij

    e =ij v tanh(V h +⊤ j Ws +i−1 b).41/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Attention

    Figure 3 of paper "Neural Machine Translation by Jointly Learning to Align and Translate", https://arxiv.org/abs/1409.0473.

    42/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Subword Units

    Translate subword units instead of words. The subword units can be generated in several ways,the most commonly used are

    BPE – Using the byte pair encoding algorithm. Start with characters plus a special end-of-word symbol . Then, merge the most occurring symbol pair by a new symbol ,

    with the symbol pair never crossing word boundary.

    Considering a dictionary with words low, lowest, newer, wider:

    Wordpieces – Joining neighboring symbols to maximize unigram language model likelihood.

    Usually quite little subword units are used (32k-64k), often generated on the union of the twovocabularies (the so-called joint BPE or shared wordpieces).

    ⋅ A,B AB

    r ⋅

    l o

    lo w

    e r⋅

    → r⋅

    → lo

    → low

    → er⋅

    43/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Google NMT

    Figure 1 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.

    44/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Google NMT

    Figure 5 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.

    45/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Google NMT

    Figure 6 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.

    46/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Beyond one Language Pair

    Figure 5 of "Show and Tell: Lessons learned from the 2015 MSCOCO...", https://arxiv.org/abs/1609.06647.

    47/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Beyond one Language Pair

    Figure 6 of "Multimodal Compact Bilinear Pooling for VQA and Visual Grounding", https://arxiv.org/abs/1606.01847.

    48/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

  • Multilingual Translation

    Many attempts at multilingual translation.

    Individual encoders and decoders, shared attention.

    Shared encoders and decoders.

    49/49NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT