neural machine translation (d3l4 deep learning for speech and language upc 2017)

37
[course site] Day 3 Lecture 4 Neural Machine Translation Marta R. Costa-jussà

Upload: xavier-giro

Post on 07-Feb-2017

38 views

Category:

Data & Analytics


8 download

TRANSCRIPT

Page 1: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 3 Lecture 4

Neural Machine Translation

Marta R. Costa-jussà

Page 2: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

2

Acknowledgments

Kyunghyun Cho, NVIDIA BLOGS:https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/

Page 3: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

3

Previous concepts from this course

● Recurrent neural network (LSTM and GRU) (handle variable-length sequences)

● Word embeddings

● Language Modeling (assign a probability to a sentence)

Page 4: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

4

Machine Translation background

Machine Translation is the application that is able to automatically translate from source (S) to target (T).

Page 5: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

5

Rule-based approach

Main approaches have been either rule-based or statistical-based

Page 6: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

6

Statistical-based approach

Main approaches have been either rule-based or statistical-based

Page 7: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

7

Why a new approach?

We need years to develop a nice rule-based approach

Regarding statistical systems:(1) Word alignment and Translation are optimized separately(2) Translation at the level of words, but difficulties with high variations in morphology (e.g. translation

English-to-Finnish)(3) Translation by language pairs

(a) difficult to think of an automatic interlingua(b) bad performance with low resourced-languages

Page 8: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

8

Why Neural Machine Translation?● Integrated MT paradigm

● Trainable at the subword/character level

● Multilingual advantages

Page 9: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

9

What do we need?● Parallel Corpus

Same requirement than phrase-based systems

Page 10: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

10

Sources of parallel corpus● European Plenary Parlament Speeches (EPPS)

transcriptions● Canadian Handsards● United Nations● CommonCrawl● ...

International evaluation campaigns: Conference on Machine Translation (WMT) International Workshop on Spoken Language Translation (IWSLT)

Page 11: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

11

What else do we need?

Same requirements than phrase-based systems

Automatic measure

Page 12: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

12

Towards Neural Machine Translation

Page 13: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

13

Encoder-DecoderFront View Side View

Representation of the sentenceKyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

Page 14: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Encoder

Page 15: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

15

Encoder in three steps

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

(2)

(3) (1) One hot encoding(2) Continuous space

representation(3) Sequence summarization

Page 16: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

16

Step 1: One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K).

Word One-hot encoding

economic 000010...

growth 001000...

has 100000...

slowed 000001...

From previous lecture on language modeling

Page 17: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

17

Step 2: Projection to continuous space

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights.

K

K si= Ewi

Page 18: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

18

Step 3: Recurrence

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

Page 19: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

19

Step 3: Recurrence

Sequence

Figure: Cristopher Olah, “Understanding LSTM Networks” (2015)

Activation function should be either LSTM or GRU

Page 20: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Decoder

Page 21: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

21

Decoder

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

RNN’s internal state zi depends on: summary vector ht, previous output word ui-1 and previous internal state zi-1.

NEW INTERNAL STATE

Page 22: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

22

Decoder

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

With zi ready, we can score each word k in the vocabulary with a dot product given this hidden state...

RNN internal

state

Neuron weights for

word k

Page 23: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

23

Decoder

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

A score is higher if word vector wk and the decoder’s internal state zi are similar to each other.

RNN internal

state

Neuron weights for

word k

Remember:a dot product gives the length of the projection of one vector onto another. Similar vectors (nearly parallel) the projection is longer than if they are very different (nearly perpendicular)

Page 24: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

24

Decoder

Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters." NIPS 1989

...we can finally normalize to word probabilities with a softmax.

Probability that the ith word is word k

Previous words Hidden state

Given the score for word k

Page 25: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

25

Decoder

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

go back to the 1st step…

(1) computing the decoder’s internal state(2) score and normalize target words(3) select the next word

Page 26: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

26

Decoder

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

More words for the decoded sentence are generated until a <EOS> (End Of Sentence) “word” is predicted.

EOS

Page 27: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Training

Page 28: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

28

Training: Maximum Likelihood Estimation

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

Page 29: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

29

Computational Complexity

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

Page 30: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Why this may not work?

Page 31: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Why this may not work?We are encoding the entire source sentence

into a single context vector

Page 32: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

How to solve this?With the attention-based mechanism…

more details tomorrow

Page 33: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

33

Summary● Machine Translation is faced as a sequence-to-sequence

problem● The source sentence is encoded into a fixed length vector

and this fixed length vector is decoded into the final most probable target sentence

● Only parallel corpus and automatic evaluation measures are required to train a neural machine translation system

Page 34: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

34

Learn more

Natural Language Understanding with Distributed Representation, Kyunghyun Cho, Chapter 6, 2015 (available in github)

Page 36: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Page 37: Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

37

Another useful image for encoding-decoding

Kyunghyun Cho, “Natural Language Understanding with Distributed Representations” (2015)

ENCODER

DECODER

input words

output words