recurrent neural networkstat.snu.ac.kr/mcp/lecture_8_rnn.pdfrecurrent neural network building units...

32
Recurrent Neural Network Recurrent Neural Network Seoul National University Deep Learning September-December, 2019 1 / 32

Upload: others

Post on 21-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Recurrent Neural Network

Seoul National University Deep Learning September-December, 2019 1 / 32

Page 2: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Recurrent Neural Network (RNN)

Sequence Modeling: Recurrent neural networks, or RNNs (Rumelhartet al., 1986a), are a family of neural networks for processingsequential data.

pt = softmax(c + Vht)

ht = tanh(b + wht−1 + Uxt)

Hidden unit at time t is a function of hidden unit at t − 1 and theinput at time t.

Unknown parameters do not depend on time.

Loss function: L({x1, · · · , xt}, {y1, · · · , yt})

Seoul National University Deep Learning September-December, 2019 2 / 32

Page 3: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Building units of RNN

Input: xt

Hidden unit: ht = tanh(b + wht−1 + Uxt)

Output unit: ot = c + Vht

Predicted probability: pt = softmax(ot)

Seoul National University Deep Learning September-December, 2019 3 / 32

Page 4: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Building units of RNN

Input: xt

Hidden unit: ht = tanh(b + wht−1 + Uxt)

Output unit: ot = c + Vht

Predicted probability: pt = softmax(ot)

Unknown parameters: (w ,U, b, c ,V )

Seoul National University Deep Learning September-December, 2019 4 / 32

Page 5: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN

source: Andrej Karpathy blog

one to one: typical

one to many: image captioning (image to sequence of words)

many to one: sentiment analysis (sequence of words to sentiment)

many to many: with lag, machine translation (sequence of words tosequence of words); without lag, video classification on frame level

Seoul National University Deep Learning September-December, 2019 5 / 32

Page 6: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: One-to-many

source: Stanford CS231 lecture note

Seoul National University Deep Learning September-December, 2019 6 / 32

Page 7: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: Many-to-one

source: Stanford CS231 lecture note

Seoul National University Deep Learning September-December, 2019 7 / 32

Page 8: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: Sequence-to-sequence

source: Stanford CS231 lecture note

Seoul National University Deep Learning September-December, 2019 8 / 32

Page 9: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: Sequence-to-sequence (Googletranslate)

source: Stanford CS231 lecture note

Seoul National University Deep Learning September-December, 2019 9 / 32

Page 10: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: Character-level language models

source: Andrej Karpathy blog

x1 = (1, 0, 0, 0), y1 = (0, 1, 0, 0), h1 = (0.3,−0.1, 0.9),o1 = (1.0, 2.2,−3.0, 4.2).

Seoul National University Deep Learning September-December, 2019 10 / 32

Page 11: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: Character-level language models:Test time

source: Andrej Karpathy blog

x1 = (1, 0, 0, 0), y1 = (0, 1, 0, 0), h1 = (0.3,−0.1, 0.9),o1 = (1.0, 2.2,−3.0, 4.2).

Seoul National University Deep Learning September-December, 2019 11 / 32

Page 12: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Applications of RNN: Image Captioning

source: Andrej Karpathy blog

Dataset: Microsoft COCO (Tsung-Yi Lin et al. 2014). mscoco.org.120K images, 5 sentences per eachImage captioning uses word-based model where input data are vectorsin Rd representing each word.

Seoul National University Deep Learning September-December, 2019 12 / 32

Page 13: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Image captioning

Figure by A. Karpathy

Seoul National University Deep Learning September-December, 2019 13 / 32

Page 14: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

at = b + wht−1 + Uxt

ht = tanh(at)

ot = c + Vht

pt = softmax(ot)

Loss function: L = −∑T

t=1 yTt log(pt) =

∑Tt=1 Lt .

∂Lt∂pt

= − ytyTt pt

∂Lt∂ot

= ∂L∂pt

∂pt∂ot

= −(diag(pt)− ptpTt ) yt

yTt pt

∂L

∂V=

T∑t=1

∂L

∂ot

∂ot∂V

=T∑t=1

∂L

∂otht

Seoul National University Deep Learning September-December, 2019 14 / 32

Page 15: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

at = b + wht−1 + Uxt

ht = tanh(at)

ot = c + Vht

pt = softmax(ot)

∂L

∂w=

T∑t=1

∂Lt∂w

Contribution from each time point is cumulative in time:

∂Lt∂w

=t∑

k=0

∂Lt∂ot

∂ot∂ht

∂ht∂hk

∂hk∂w

Seoul National University Deep Learning September-December, 2019 15 / 32

Page 16: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

Loss function is a sum of entire sequence.

As in CNN, SGD is used with minibatch. That is, loss function,forward pass and backward pass are calculated for the sequence in theminibatch.

Seoul National University Deep Learning September-December, 2019 16 / 32

Page 17: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

at = b + wht−1 + Uxtht = tanh(at)

ot = c + Vhtpt = softmax(ot)

∂Lt∂w

=t∑

k=0

∂Lt∂ot

∂ot∂ht

∂ht∂hk

∂hk∂w

∂Lt∂ot

=∂L

∂pt

∂pt∂ot

= −(diag(pt)− ptpTt )

yt

yTt pt

∂ot∂ht

= V

∂hk∂w

=∂hk∂ak

∂ak∂w

= diag(1− tanh2(ak))hk−1

Seoul National University Deep Learning September-December, 2019 17 / 32

Page 18: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

at = b + wht−1 + Uxt

ht = tanh(at)

ot = c + Vht

pt = softmax(ot)

∂Lt∂w

=t∑

k=0

∂Lt∂ot

∂ot∂ht

∂ht∂hk

∂hk∂w

Problematic part:

∂ht∂hk

=t−1∏j=k

∂hj+1

∂hj

∂ht∂ht−1

=∂ht∂at

∂at∂ht−1

Seoul National University Deep Learning September-December, 2019 18 / 32

Page 19: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

at = b +wht−1 +Uxt ; ht = tanh(at); ot = c +Vht ; pt = softmax(ot)

Denoting diag(1− tanh2(at)) by Dt

∂ht∂at

= diag(1− tanh2(at)) = Dt

∂ht∂ht−1

=∂ht∂at

∂at∂ht−1

= Dtw

∂ht∂hk

=t−1∏j=k

∂hj+1

∂hj=

t−1∏j=k

(Dj+1w)

Seoul National University Deep Learning September-December, 2019 19 / 32

Page 20: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN

Contribution of each time point is cumulative up to time t but ∂ht∂hk

isvery small if k is far from t. That is, long-range dependence cannotbe incorporated in updating weights.

∂Lt∂w

=t∑

k=0

∂Lt∂ot

∂ot∂ht

∂ht∂hk

∂hk∂w

=t∑

k=0

∂Lt∂ot

∂ot∂ht

[t−1∏j=k

(Dj+1w)]∂hk∂w

Seoul National University Deep Learning September-December, 2019 20 / 32

Page 21: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Back propagation in RNN and long-range dependence

Contribution of each time point is cumulative up to time t but ∂ht∂hk

isvery small if k is far from t. That is, long-range dependence cannotbe incorporated in updating weights.

Sentiment analysis:‘The game became interesting as the players warmed up although itwas boring for the first half.’When early part is not incorporated, the sentence may be classified asnegative.

Seoul National University Deep Learning September-December, 2019 21 / 32

Page 22: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Two sources of exploding or vanishing gradients in RNN

Two sources of problems: Nonlinearity and explosion

∂ht∂hk

=t−1∏j=k

∂hj+1

∂hj=

t−1∏j=k

(Dj+1w)

Nonlinearity: (∏t−1

j=k Dj+1) appears due to tanh(.) relationship.

Explosion: w t−k could vanish if |w | < 1 or explode if |w | > 1.

Attempts have been made to replace tanh(.) with ReLu or gradientclipping to alleviate explosion.

Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU)are designed to solve the nonlinearity and explosion problems.

Seoul National University Deep Learning September-December, 2019 22 / 32

Page 23: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Long Short-term Memory (LSTM)

Hochreiter and Schmidhuber, 1997

LSTM allows long-term dependence in time.

LSTM is designed to avoid the problem due to nonlinear relationshipand the problem of gradient explosion.

Successful in unconstrained handwriting recognition (Graveset al.,2009), speech recognition (Graves et al., 2013; Graves and Jaitly,2014), handwriting generation (Graves, 2013), machine translation(Sutskever et al., 2014), image captioning (Kiros et al., 2014b; Vinyalset al., 2014b; Xu et al., 2015), and parsing (Vinyals et al., 2014a).

Seoul National University Deep Learning September-December, 2019 23 / 32

Page 24: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Long Short-term Memory (LSTM)

input gate: it = σ(b + Uxt + wht−1

)forget gate: ft = σ

(bf + U f xt + w f ht−1

)new memory cell: ct = tanh

(bg + Ugxt + wght−1

)updated memory: ct = ftct−1 + it ct

output gate: ot = σ(bo + Uoxt + woht−1

).

ht = tanh(ct)ot

pt = softmax(c + Vht)

Existing and new key elements

Seoul National University Deep Learning September-December, 2019 24 / 32

Page 25: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Long Short-term Memory (LSTM)

LSTM introduces a memory cell state ct .

There is no direct relationship between ht−1 and ht , which caused theproblem of vanishing or exploding and lack of long-term dependence.

Update of ct controls the time dependence and the information flow.

Seoul National University Deep Learning September-December, 2019 25 / 32

Page 26: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

RNN vs. LSTM

Figure: Simple RNN

Figure: LSTM

Seoul National University Deep Learning September-December, 2019 26 / 32

Page 27: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Long Short-term Memory (LSTM)

Consider (c ,V ), (bo ,Uo ,wo), (bg ,Ug ,wg , bf ,U f ,w f , b,U,w)

For (c ,V ), ∂L∂w s =

∑Tt=1

∂Lt∂pt

∂pt∂V

For (bo ,Uo ,wo), ∂L∂wo =

∑Tt=1

∂Lt∂ot

∂ot∂wo

For (bg ,Ug ,wg , bf ,U f ,w f , b,U,w), derivatives, ∂L∂w , ∂L

∂wg , ∂L∂w f ,

involve ∂L∂ct

since time dependence is mediated by ct .

For w ,

∂L

∂w=

T∑t=1

∂Lt∂w

Contribution from time t,

∂Lt∂w

=t∑

k=0

∂Lt∂ct

∂ct∂ck

∂ck∂w

Seoul National University Deep Learning September-December, 2019 27 / 32

Page 28: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Long Short-term Memory (LSTM)

Contribution from time t,

∂Lt∂w

=t∑

k=0

∂Lt∂ct

∂ct∂ck

∂ck∂w

Problem part:

∂ct∂ck

=t−1∏j=k

∂cj+1

∂cj

But∂ct∂ct−1

= ft .

∂Lt∂w

=t∑

k=0

∂Lt∂ct

(t−1∏j=k

fj)∂ck∂w

Additive relationship and forget gate help alleviating vanishing orexploding problem.

Seoul National University Deep Learning September-December, 2019 28 / 32

Page 29: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Multi-layered RNN and LSTM

• RNN or LSTMoutput ht becomes input for time t + 1.• In multi-layered RNNand LSTM, hidden unit at the (l + 1)th

layer at time t + 1, h(l+1)t+1 , is a function

of h(l+1)t and hlt+1. For example, LSTM

input cells for the 1st and mth layers are

i(1)t = σ

(b + Uxt + wh

(1)t−1

)i(m)t = σ

(b + Uh

(m−1)t + wh

(m)t−1

)

Seoul National University Deep Learning September-December, 2019 29 / 32

Page 30: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Gated Recurrent Unit (GRU)

Proposed by Cho et al. (2014). Further studied by Chung et al.(2014, 2015a); Jozefowicz et al. (2015); Chrupala et al. (2015).

GRU addresses the two sources of problems of RNN with a simplerstructure than LSTM.

LSTM and GRU are the two popular architectures of RNN.

GRU are reportedly easier to train.

LSTM in principle can carry a longer memory.

Seoul National University Deep Learning September-December, 2019 30 / 32

Page 31: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Gated Recurrent Unit (GRU)

update gate: ut = σ(bu + Uuxt + wuht−1

)reset gate: rt = σ

(br + U rxt + w rht−1

)ht = tanh(Uhxt + rt�(whht−1) + bh)

ht = ut−1ht−1 + (1− ut−1)ht

pt = softmax(c + Vht)

When ut−1 = 0 and rt = 1, it reduces to the simple RNN.

Handles nonlinearity and explosion.

Seoul National University Deep Learning September-December, 2019 31 / 32

Page 32: Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

Recurrent Neural Network

Summary

RNNs can model sequence data.

Simple version of RNN has problems of vanishing or explodinggradient.

LSTM or GRU are designed to alleviate the problem.

When to use LSTM or GRU is not well understood.

Seoul National University Deep Learning September-December, 2019 32 / 32