machine translation wise 2016/2017 - hasso-plattner · pdf filemachine translation wise...

Machine TranslationWiSe 2016/2017

Neural Machine Translation

Dr. Mariana Neves January 30th, 2017

Overview

● Introduction

● Neural networks

● Neural language models

● Attentional encoder-decoder

● Google NMT

Overview

● Introduction

● Neural networks

● Google NMT

Neural MT

● „Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016.“ (NMT ACL‘16)

● Google Scholar

● Since 2012: 28,600

● Since 2015: 22,500

● Since 2016: 16,100

Neural MT

5[Picture from NMT ACL16 slides]

Neural MT

● „Neural Machine Translation is the approach of modeling the entire MT process via one big artificial neural network“ (NMT ACL‘16)

[Picture from NMT ACL16 slides]]

Overview

● Introduction

● Neural networks

● Google NMT

Artificial neuron

● Input are the dendrites; Output are the axons

● Activation occurs if the sum of the weighted inputs is higher than a threshold (message is passed)

8 (http://www.innoarchitech.com/artificial-intelligence-deep-learning-neural-networks-explained/)

Artificial neural networks (ANN)

● Statistical models inspired on biological neural networks

● They model and process nonlinear relationships between input and output

● They are based on adptative weights and a cost function

● Based on optimization techniques, e.g., gradient descent and stochastic gradient descent

Basic architecture of ANNs

● Layers of artificial neurons

● Input layer, hidden layer, output layer

● Overfitting can occur with increasing model complexity

10 (http://www.innoarchitech.com/artificial-intelligence-deep-learning-neural-networks-explained/)

Deep learning

● Certain types of NN that consume very raw input data

● Data is processed through many layers of nonlinear transformations

Deep learning – feature learning

● Deep learning excels in unspervised feature extraction, i.e., automatic derivation of meaningful features from the input data

● They learn which features are important for a task

● As opposed to feature selection and engineering, usual tasks in machine learning approaches

Deep learning - architectures

● Feed-forward neural networks

● Recurrent neural network

● Multi-layer perceptrons

● Convolutional neural networks

● Recursive neural networks

● Deep belief networks

● Convolutional deep belief networks

● Self-Organizing Maps

● Deep Boltzmann machines

● Stacked de-noising auto-encoders

Overview

● Introduction

● Neural networks

● Google NMT

Language Models (LM) for MT

LM for MT

N-gram Neural LM with feed-forward NN

17(https://www3.nd.edu/~dchiang/papers/vaswani-emnlp13.pdf)

Input as one-hot representations of the words in context u (n-1),

where n is the order of the language model

N-gram Neural LM with

feed-forward NN

● Input: context of n-1 previous words

● Output: probability distribution for next word

● Size of input/output: vocabulary size

● One or many hidden layers

● Embedding layer is lower dimensional and dense

– Smaller weight matrices

– Learns to map similar words to similar points in the vector space

18(https://www3.nd.edu/~dchiang/papers/vaswani-emnlp13.pdf)

One-hot representation

● Corpus: „the man runs.“

● Vocabulary = {man,runs,the,.}

● Input/output for p(runs|the man)

Softmax function

● It normalize the output vectors to probability distribution (sum=1)

● Its computational cost is linear to vocabulary size

● When combined with stochastic gradient descend, it minimizes cross-entropy (perpexity)

p( y= j∣x)=e xT w j

∑k=1

exT wk

xT w is the inner product of x (sample vector) and w (weight vector)

Softmax function

● Example:

– input = [1,2,3,4,1,2,3]

– softmax = [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]

● The output has most of its weight where the '4' was in the original input.

● The function highlights the largest values and suppress values which are significantly below the maximum value.

21(https://en.wikipedia.org/wiki/Softmax_function)

Classical neural language model (Bengio et al. 2003)

(http://sebastianruder.com/word-embeddings-1/)

Feed-forward neural language model (FFNLM) in SMT

● One more feature in the log-linear phrase-based model

Recurrent neural networks language model (RNNLM)

● Recurrent neural networks (RNN) is a class of NN in which connections between the units form a directed cycle

● It makes use of sequential information

● It does not assume independence between input and output

(http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)

● Condition on arbitrarly long contexts

● No Markov assumption

● It reads one word at a time, updates network incrementally

(http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

Overview

● Introduction

● Neural networks

● Google NMT

Translation modelling

● Source sentence S of length m: x1, . . . , xm

● Target sentence T of length n: y1, . . . , yn

T *=argmax

tP (T∣S )

P (T∣S)=P ( y1 , ... , yn∣x1 , ... , xm)

P (T∣S)=∏i=1

P( yi∣y0 , ... , yi−1 , x1 , ... , xm)

Encoder-Decoder

● Two RNNs (usually LSTM):

– encoder reads input and produces hidden state representations

– decoder produces output, based on last encoder hidden state

28[Picture from NMT ACL16 slides]

Long short-term memory (LSTM)

● It is a special kind of RNN

● It connects previous information to the present task

● It is capable to learn long-term dependencies

29 (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Long short-term memory (LSTM)

● LSTMs have four interating layers

● But there are many variations of the architecture

30 (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Encoder-Decoder

● Encoder-decoder are learned jointly

● Supervision signal from parallel corpora is backpropagated

31 (https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

The Encoder (continuous-space representation)

● The encoder linearly projects the 1-of-K coded vector wi with a matrix E which has as many columns as there are words in the source vocabulary and as many rows as you want (typically, 100 – 500.)

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

The Encoder (summary vector)

● Last encoder hidden state summarizes source sentence

● But quality decreases for long sentences (fixed-size vector)

The Encoder (summary vector)

● Projection to 2D using Principal Componnet Analysis (PCA)

The Decoder

● The inverse of the encoder

● Based on the softmax function

Problem with simple E-D architectures

● Fixed-size vector from which the decoder needs to generate a full translation

● The context vector must contain every signgle detail of the source sentence

● The dimensionality of the contect vector must be large enough that a sentence of any length can be compressed

Problem with simple E-D architectures

● Large models are necessary to cope with large sentences

(experiments withsmall models)

Bidirectional recurrent neural network (BRNN)

● Use a memory with as many banks as source words, instead of a fixed-size context vector

● BRNN = forward RNN + backwards RNN

● At any point, the forward and backward vectors summarizes a whole input sentece

● This mechanism allows storage of a source sentence as a variable-length representation

Soft Attention mechanism

● It is a small NN that takes as input the previous decoder‘s hidden state (what has been translated)

● It contains one hidden layer and outputs a scalar

● Normalization (to sum up to 1) is done with softmax function

● The model learn attention (alignment) between two languages

● With this mechanism, the quality of the translation does not drop as the sentence length increases

Overview

● Introduction

● Neural networks

● Google NMT

Google Multilingual NMT system (Nov/16)

● Simplicity:

– Single NMT model to translate between multiple languages, instead of many models (1002)

● Low-resource language improvement:

– Improve low-resource language pair by mixing with high-resource languages into a single model

● Zero-shot translation:

– It learns to perform implicit bridging between language pairs never seen explicitly during training

46 (https://arxiv.org/abs/1611.04558)

Google NMT system (Sep-Oct/16)

● Deep LSTM network with 8 encoder and 8 decoder layers

(https://arxiv.org/abs/1609.08144)

● Normal LSTM (left) vs. stacked LSTM (right) with residual connections

● Output from LSTMf and LSTMb are first concatenated and then fed to the next LSTM layer LSTM1

● Wordpiece model (WPM) implementation initially developed to solve a Japanese/Korean segmentation problem

● Data-driven approach to maximize the language-model likelihood of the training data

(“_” is a special character added to mark the beginning of a word.)

● ??

● Introduction of an artificial token at the beginning of the input sentence to indicate the target language the model should translate to.

● Experiments: Many to one

● Experiments: One to many

● Experiments: Many to many

● Experiments: Zero-Shot translation

Summary

● Very brief introduction to neural networks

– One-hot representations (1-of-K coded vector)

– Softmax function

● Neural machine translation

– Recurrent NN; LSTM

– Encoder and Decoder

– Soft attention mechanism (BRNN)

● Google MT

– Architecture and multilingual experiments

machine translation wise 2016/2017 - hasso-plattner · pdf filemachine translation wise...

Documents

wise driving tips - wise driving guide:

savemoneyindia _ paisa wise, rupee wise too

mtech - college-wise, courses-wise cutoff rank 2014

cet-2015 college-wise, course-wise, ctegory-wise fees...

state wise, district wise & district wise loaction of ......

wise - modbus rtu control/wise gen.1... · wise - modbus...

machine translation wise 2016/2017 - hasso plattner...

mn waste wise wise annual … · mn waste wise 2019 annual...

lib.nu.edu.salib.nu.edu.sa/uploads/word111/6.docx · web...

01 formation translation and tranmission of islamic...

state wise. district wise & centre wise location of...

wise ones, wise thoughts

detailed course plan : (module wise / lecture wise)

image-to-image translation via group-wise deep whitening

powerpoint presentation · pdf filemachine learning deep...

machine translation wise 2015/2016 - hasso plattner...

state wise, district wise & centre wise location of ... of...

dist wise taluk wise center address

machine translation: a concise history - william john · pdf...

s w 2017/2018 · 0 1000 2000 3000 4000 5000 6000 7000 8000...