recurrent neural networks - cse.iitk.ac.in · recurrent neural networks feedforward n/ws have xed...
TRANSCRIPT
![Page 1: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/1.jpg)
Recurrent neural networks
Feedforward n/ws have fixed size input, fixed size output, fixednumber of layers. RNNs can have variable input/output.
Figure: RNN schematic. Input: red, Output: blue, State: green.1-1: Std. feed forward, 1-M: image captioning,M-1: sentimentidentification, M-M: translation, M-Ms: Video frame labelling.(Source: Karpathy blog)
![Page 2: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/2.jpg)
RNN applications
RNNs (and their variants) have been used to model a wide rangeof tasks in NLP:
I Machine translation. Here we need a sentence aligned datacorpus between the two languages.
I Various sequence labelling tasks e.g. PoS tagging, NER, etc.
I Language models. A language model predicts the next word ina sequence given the earlier words.
I Text generation.
I Speech recognition.
I Image annotation.
![Page 3: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/3.jpg)
Recurrent neural network - basic
I Recurrent neural n/ws have feedback links.
Figure: An RNN node1
1Source: WildML blog
![Page 4: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/4.jpg)
Recurrent neural network schematic
Figure: RNN unfolded2
2Source: WildML blog
![Page 5: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/5.jpg)
RNN foward computation
I An RNN computation happens over an input sequence andproduces an output sequence using the tth element in theinput and the neuron state at the (t − 1)th step, ψt−1. Theassumption is that ψt−1 encodes earlier history.
I Let ψt−1, ψt be the states of the neuron at step t − 1 and trespectively, wt the tth input element and yt the tth outputelement (notation is differnt from figure to be consistent withour earlier notation). Assuming U, V , W are known ψt andyt are calculated by:
ψt = fa(Uwt + Wψt−1)
yt = fo(Vψt)
Here fa is the activation function, typically tanh, sigmoid orrelu which introduces non-linearity in the network. Similary, fois the output function, for example softmax which gives aprobability over the vocabulary vector.
![Page 6: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/6.jpg)
I Often inputs or more commonly outputs at some time stepsmay not be useful. This depends on the nature of the task.
I Note that the weights represented by U, V , W are the samefor all elements in the sequence.
The above is a standard RNN. There are many variants in use thatare better suited to specific NLP tasks. But in principle they aresimilar to the standard RNN.
![Page 7: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/7.jpg)
Training RNNs
I RNNs use a variant of the standard back propagationalgorithm called back propagation through time or BPTT. Tocalculate the gradient at step t we need the gradients atearlier time steps.
I The need for gradients at earlier time steps leads to twocomplications:
a) Computations are expensive (see later).b) The vanishing/exploding gradients (or unstable gradients)
problem seen in deep neural nets.
![Page 8: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/8.jpg)
Coding of input, output, dimensions of vectors, matrices
It is instructive to see how the input, output is represented andcalculate the dimensions of the vectors, matrices. Assume |V| isvocabulary size, h is hidden layer size. Assume we are training arecurrent network language model. Then:
I Let m be the number of words in the current sentence. Theinput sentence will look like:[START-SENTENCE, w1, . . . ,wi ,wi+1, . . . ,wm, END-SENTENCE].
I Each word wi in the sentence will be input as a 1-hot vectorof size |V| with 1 at index of wi in V and rest 0.
I Each desired output yi is a 1-hot vector of size |V| with 1 atindex of wi+1 in V and 0 elsewhere. The actual output y is asoftmax output vector of size |V|.
I The dimensions are:Sentence: m × |V|, U: h × |V|, V: |V| × h, W: h × h
![Page 9: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/9.jpg)
Update equations
At step t let yt be the true output value and yt the actual outputof the RNN for input wt then using cross-entropy error (often usedfor such networks instead of squared error) we get:
Et(yt , yt) = −Yt log(yt)
For the sequence from 0 to t which is one training sequence we get:
E(y , y) = −∑t
Yt log(yt)
BPTT will use backprop from 0 to t and use (stochastic) gradientdescent to learn appropriate values for U, V , W . Assume t = 3then the gradient for V is ∂E3
∂V
![Page 10: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/10.jpg)
Using chain rule:
∂E3∂V
=∂E3∂y3
∂y3∂V
=∂E3∂y3
∂y3∂z3
∂z3∂V
= (y3 − y3)⊗ ψ3
z3 = Vψ3, u⊗ v = uvT (outer/tensor product).
Observation: ∂E3∂V depends only on values at t = 3.
![Page 11: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/11.jpg)
For W (and similarly for U)
∂E3∂W
=∂E3∂y3
∂y3∂ψ3
∂ψ3
∂W
ψ3 = tanh(Uwt + Wψ2) and ψ2 depends on W , ψ2 etc. So, weget:
∂E3∂W
=3∑
t=0
∂E3∂y3
∂y3∂ψ3
∂ψ3
∂ψt
∂ψt
∂W
![Page 12: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/12.jpg)
Figure: Backprop for W .3
3Source: WildML blog
![Page 13: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/13.jpg)
Vanishing gradients
I Neural networks with a large number of layers using sigmoid ortanh activation functions face the unstable (typically vanishinggradients since explosions can be blocked by a threshold).
I ∂ψ3∂ψt
in the equation for ∂E3∂W is itself a product of gradients due
to the chain rule:
∂ψ3
∂ψt=
3∏i=t+1
∂ψi
∂ψi−1
So, ∂ψ3∂ψt
is a Jacobian matrix and we are doing O(m) matrixmultiplications (m is sentence length).
![Page 14: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/14.jpg)
How to address unstable gradients?
I Do not use gradient based weight updates.
I Use Relu which has a gradient of 0 or 1.(relu(x) = max(0, x)).
I Use a different architecture. LSTM (Long, Short TermMemory) or GRU (Gated Recurrent Unit). We go foward withthis option - the most popular in current literature.
![Page 15: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/15.jpg)
Schematic of LSTM cell
Figure: A basic LSTM memory cell. (From deeplearning4j.org).
![Page 16: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/16.jpg)
LSTM: step by step
I The key data entity in an LSTM memory cell is the hiddeninternal state Ct at time step t, which is a vector. This stateis not visible outside the cell.
I The main computational step calculates the internal state Ct
which depends on the input xt and the previous visible stateht−1. But this dependence is not direct. It is controlled ormodulated via two gates: the forget gate gf and the inputgate gi . The output state ht that is visible outside the cell isobtained by modulating the hidden internal state Ct by theoutput gate go .
![Page 17: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/17.jpg)
The gates
I Each gate gets as input the input vector xt and the previousvisible state ht−1, each linearly transformed by thecorresponding weight matrix.
I Intuitively, the gates control the following:gf : what part of the previous cell state Ct−1 will be carriedforward to step t.gi : what part of xt and visible state ht−1 will be used forcalculating Ct .go : what part of the current internal state Ct will be madevisible outside - that is ht .
I Ct does depend on the earlier state Ct−1 and the currentinput xt but not directly as in an RNN. The gates gf and gimodulate the two parts (see later).
![Page 18: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/18.jpg)
Difference between RNN and LSTM
The difference with respect to a standard RNN is:
1. There is no hidden state in an RNN. The state ht is directlycomputed from the previous state ht−1 and the current inputxt each linearly transformed via the respective weight matricesU and W .
2. There is no modulation or control. There are no gates.
![Page 19: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/19.jpg)
The gate equations
Each gate has as input the current input vector xt and the previousvisible state vector ht .
gf = σ(Wf xt + Whf ht−1 + bf )
gi = σ(Wixt + Whiht−1 + bi )
go = σ(Woxt + Whoht−1 + bo)
The gate non-linearity is always a sigmoid. This gives a valuebetween 0 and 1 for each gate output. The intuition is: a 0 valuecompletely forgets the past and a 1 value fully remembers the past.An in between value partially remembers the past. This controlslong term dependency.
![Page 20: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/20.jpg)
Calculation of Ct , ht
I There are two vectors that must be calculated. The hiddeninternal state Ct and the visible cell state ht that will be thepropagated to step (t + 1).
I The Ct calculation is done in two steps. First an update Ct iscalculated based on the two inputs to the memory cell, xt andht−1.
Ct = tanh(Wcxt + Whcht−1 + bc)
Ct is calculated by modulating the previous state with gf andadding to it the update Ct modulated by gi giving:
Ct = (gf ⊗ Ct−1)⊕ (gi ⊗ Ct)
I Similarly, the new visible state or output ht is calculated by:
ht = go ⊗ tanh(Ct)
![Page 21: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/21.jpg)
Detailed schematic of LSTM cell
Figure: Details of LSTM memory cell. (From deeplearning4j.org).
![Page 22: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/22.jpg)
Peepholes
I A variant of the standard LSTM has peep holes. This meansthat all the gates have access to the earlier and current hiddenstates Ct−1 and Ct . So, the gate equations become:
gf = σ(Wf xt + Whf ht−1 + Wcf Ct−1 + bf )
gi = σ(Wixt + Whiht−1 + WciCt−1 + bi )
go = σ(Woxt + Whoht + WcoCt + bo)
This is shown by the blue dashed lines and non-dashed line in theearlier diagram.
![Page 23: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/23.jpg)
Gated Recurrent Units (GRU)
I A GRU is a simpler version of an LSTM. The differences froman LSTM are:
I A GRU does not have an internal hidden cell state (like Ct).Instead it has only state ht .
I A GRU has only two gates gr (reset gate) and gu update gate.These together perform like the earlier gf and gi LSTM gates.There is no output gate.
I Since there is no hidden cell state or output gate the output isht without a tanh non-linearity.
![Page 24: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/24.jpg)
GRU equations
I The gate equations are:
gr = σ(Wrixt + Wrhht−1 + br )
gu = σ(Wuixt + Wuhht−1 + bu)
I The state update is:
ht = tanh(Whixt + Whh(gr ⊗ ht−1) + bh)
The gate gr is used to control what part of the previous statewill be used in the update.
I The final state is:
ht = ((1− gu)⊗ ht)⊕ (gu ⊗ ht−1)
The update gate gu controls the relative contribution of theupdate (ht) and the previous state ht−1.
![Page 25: Recurrent neural networks - cse.iitk.ac.in · Recurrent neural networks Feedforward n/ws have xed size input, xed size output, xed number of layers. RNNs can have variable input/output](https://reader036.vdocuments.site/reader036/viewer/2022062607/605411a93cc8c06cad2f1ba9/html5/thumbnails/25.jpg)
GRU schematic
Figure: Details of GRU memory cell. (From wikimedia.org).