deep learning neural network with memoryspeech.ee.ntu.edu.tw › ~tlkagk › courses › mlds_2015_2...

Neural Network with Memory

Hung-yi Lee

Memory is important

47

47

11

1 2 3

Input:2 dimensions

Output:1 dimension

𝑥1 𝑥2 𝑥3

𝑦1 𝑦2 𝑦3

1 4 4

1 7 7

1

+

Network needs memory to achieve this

11

23

Memory is important

x1 x2

y

c1

10

10

Network with Memory 1-10

1 1 1 1

1

1

MemoryCell4

747

11

1 2 3

𝑥1 𝑥2 𝑥3

𝑦1 𝑦2 𝑦3

0

4 7

1111

111

1

1

Memory is important

x1 x2

y

c1

10

10


1 1 1 1

1

1

MemoryCell4

747

11

1 2 3

𝑥1 𝑥2 𝑥3

𝑦1 𝑦2 𝑦3

0

4 7

1212

121

2

11

Memory is important

x1 x2

y

c1

10

10


1 1 1 1

1

1

MemoryCell4

747

11

1 2 3

𝑥1 𝑥2 𝑥3

𝑦1 𝑦2 𝑦3

0

1 1

33

30

3

110

Outline

Long Short-term Memory (LSTM)

Variants of RNN

Vanilla Recurrent Neural Network (RNN)

Application

• (Simplified) Speech Recognition

TSI TSI TSI I I N N N

x1 x2 x3 x4

y1 y2 y3 y4 ……

……

S S @ @ @ @

x1 x2 x3 x4

y1 y2 y3 y4 ……

……

Utterance 1 Utterance 2

We use DNN. All the frames are considered independently.

RNN input: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁

RNN

x1

y1=softmax(Woa1)

Input of RNN is one utterance

WiWh

Wo

a1=σ(Wix1+Wh0)memory

copy

The order cannot change.

y1

a10


RNN

x2

y2=softmax(Woa2)

WiWh

Wo

a2=σ(Wix2+Wha1)memory

copy

y1 y2

a10 a2




RNN

x3

y3=softmax(Woa3)

WiWh

Wo

a3=σ(Wix3+Wha2)memory

copy

y1 y2

a10 a2

y3

a3



RNN

copy copy

x1 x2 x3

y1 y2 y3

Output yi depends on x1, x2, …… xi

Wi

Wo

Wh Wh WhWi Wi

Wo Wo

0a1

a1 a2

a2 a3

The same network is used again and again.

Input data: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁 Input of RNN is one utterance

init

RNN

Output yi depends on x1, x2, …… xi

The same network is used again and again.

x1 x2 x3

y1 y2 y3

Wi

WhWo

……Wh Wh

Wi

Wo

Wi

Wo

0

Input data: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁 Input of RNN is one utterance

init

Cost


RNN output: 𝑦1 𝑦2 𝑦3 …… 𝑦𝑁

RNN output: 𝑦1 𝑦2 𝑦3 …… 𝑦𝑁

𝐶 =1

2

𝑛=1

𝑁

𝑦𝑛 − 𝑦𝑛 2 𝐶 =1

2

𝑛=1

𝑁

−𝑙𝑜𝑔𝑦𝑟𝑛𝑛

Training

RNN Training is very difficult in practice.

Backpropagation through time (BPTT)

𝑦1 𝑦2 𝑦3

x1 x2 x3

y1 y2 y3

Wi

WhWo

init……

Wh Wh

Wi

Wo

Wi

Wo

0

𝑤 is an element in Wh, Wi or Wo 𝑤 ← 𝑤 − 𝜂𝜕𝐶 ∕ 𝜕𝑤

More Applications

• Input and output are vector sequences with the same length

POS Tagging

John saw the saw.

PN V D N

x1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3

y1 y2 y3

More Applications

• Name entity recognition• Identifying names of people, places, organizations, etc.

from a sentence

• Harry Potter is a student of Hogwarts and lived on Privet Drive.

• people, organizations, places, not a name entity

• Information extraction• Extract pieces of information relevant to a specific

application, e.g. flight booking

• I would like to leave Boston on November 2nd and arrive in Taipei before 2 p.m.

• place of departure, destination, time of departure, time of arrival, other

Outline


Variants of RNN


Elman Network & Jordan Network

xt xt+1

yt yt+1

……Wh

Wi

Wo

Wi

Wo

……

xt xt+1

yt yt+1

……

Wh

Wi

Wo

Wi

Wo

……

Elman Network Jordan Network

Deep RNN

…… ……

xt xt+1 xt+2

……

……

yt

……

……

yt+1…

…yt+2

……

……

Bidirectional RNN

yt+1

…… ……

…………yt+2yt

xt xt+1 xt+2

xt xt+1 xt+2

Many to one

• Input is a vector sequence, but output is only one vector

Sentiment Analysis

……

我覺太得糟了

超好雷

好雷

普雷

負雷

超負雷

看了這部電影覺得很高興 …….

這部電影太糟了…….

這部電影很棒 …….

Positive (正雷) Negative (負雷) Positive (正雷)

……

Many to Many (Output is shorter)

• Both input and output are vector sequences, but the output is shorter.

Speech Recognition

x1 x2 x3 x4 ……

好好好

Trimming

棒棒棒棒棒

“好棒”

You can never recognize “好棒棒” !

Many to Many (Output is shorter)

• Both input and output are vector sequences, but the output is shorter.

• Connectionist Temporal Classification (CTC)

• Add an extra symbol “φ” (同上)

好 φ φ 棒 φ φ φ φ

好 φ φ 棒 φ 棒 φ φ

“好棒”

“好棒棒”

learnin

g

Many to Many (No Limitation)

• Both input and output are vector sequences with different lengths. → Sequence to sequence learning

Machine Translation

mach

ine

機器學習

……

……

never-ending

慣性


•推文接龍• Ref: http://pttpedia.pixnet.net/blog/post/168133002-

%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87

推xxx: ptt萬歲推dd: 歲平安噓dddf: 全推zzzzzzzzzzz: 家就是你家

……

推tlkagk: =========斷==========


• Both input and output are vector sequences with different lengths. → Sequence to sequence learning

Add a symbol “===“ (斷)learn

ing

mach

ine

機器學習

===

One to Many

• Input is one vector, but output is a vector sequence

Caption generationInput

a woman is throwing a

……

…… ===(斷)

Outline


Variants of RNN


MemoryCell


Input Gate

Output Gate

Signal control the input gate

Signal control the output gate

Forget Gate

Signal control the forget gate

Other part of the network

Other part of the network

(Other part of the network)



LSTM

Special Neuron:4 inputs, 1 output

𝑧

𝑧𝑖

𝑧𝑓

𝑧𝑜

𝑔 𝑧

𝑓 𝑧𝑖multiply

multiplyActivation function f is usually a sigmoid function

Between 0 and 1

Mimic open and close gate

c

𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓

ℎ 𝑐′𝑓 𝑧𝑜

𝑎 = ℎ 𝑐′ 𝑓 𝑧𝑜

𝑔 𝑧 𝑓 𝑧𝑖

𝑐′

𝑓 𝑧𝑓

𝑐𝑓 𝑧𝑓

𝑐

x1 x2 Input

Original Network:

Simply replace the neurons with LSTM

𝑎1 𝑎2

……

……

𝑧1 𝑧2

x1 x2

+

+

+

+

+

+

+

+

Input

𝑎1 𝑎2

4 times of parameters

LSTM - Example

100

0

𝑥1

𝑥2

𝑥3

𝑦

310

0

200

0

410

0

200

0

101

7

3-10

0

610

0

101

6

When x2 = 1, add the numbers of x1 into the memory

When x3 = 1, output the number in the memory.

0 0 3 3 7 7 7 0 6

When x2 = -1, reset the memory

𝑥1

𝑥2

𝑥3

𝑦

310

0

410

0

200

0

101

7

3-10

0

+

+

+

+

x1 x2 x3

x1

x2

x3

x1

x2

x3

x1

x2

x3

y

1 0 0

100

0

0

1

1

1

1

0

-10

0

0

100

-10

0

010

1000

𝑥1

𝑥2

𝑥3

𝑦

310

0

410

0

200

0

101

7

3-10

0

+

+

+

+

3 1 0

3

1

0

3

1

0

3

1

0

y

1 0 0

100

0

0

1

1

1

1

0

-10

0

0

100

-10

0

010

100

3

≈1 3

≈103

3≈0

≈0

𝑥1

𝑥2

𝑥3

𝑦

310

0

410

0

200

0

101

7

3-10

0

+

+

+

+

4 1 0

4

1

0

4

1

0

4

1

0

y

1

1

1

1

4

≈1 4

≈137

7≈0

≈0

1 0 0

100

0

0

0

-10

0

0

100

-10

0

010

100

𝑥1

𝑥2

𝑥3

𝑦

310

0

410

0

200

0

101

7

3-10

0

+

+

+

+

2 0 0

2

0

0

2

0

0

2

0

0

y

1

1

1

1

2

≈0 0

≈17

7≈0

≈0

1 0 0

100

0

0

0

-10

0

0

100

-10

0

010

100

𝑥1

𝑥2

𝑥3

𝑦

310

0

410

0

200

0

101

7

3-10

0

+

+

+

+

1 0 1

1

0

1

1

0

1

1

0

1

y

1

1

1

1

1

≈0 0

≈17

7≈1

≈7

1 0 0

100

0

0

0

-10

0

0

100

-10

0

010

100

𝑥1

𝑥2

𝑥3

𝑦

310

0

410

0

200

0

101

7

3-10

0

+

+

+

+

3 -1 0

3

-1

0

3

-1

0

3

-1

0

y

1

1

1

1

3

≈0 0

≈07

0≈0

≈0

0

1 0 0

100

0

0

0

-10

0

0

100

-10

0

010

100

What is the next wave?

• Attention-based Model

Reading Head Controller

Input x

Reading Head Writing Head

DNN

Writing Head Controller

output y

…… ……Memory

Recommended Reading List

• The Unreasonable Effectiveness of Recurrent Neural Networks

• http://karpathy.github.io/2015/05/21/rnn-effectiveness/

• Understanding LSTM Networks

• http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Acknowledgement

•感謝葉軒銘同學於上課時發現投影片上的錯誤

deep learning neural network with memoryspeech.ee.ntu.edu.tw › ~tlkagk › courses › mlds_2015_2...

Documents