processing 11. recurrent neural networks and natural ... · lstm and gru fran˘cois fleuret ee-559...

EE-559 – Deep learning

11. Recurrent Neural Networks and Natural LanguageProcessing

Francois Fleuret

https://fleuret.org/dlc/

June 17, 2018

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Inference from sequences

Francois Fleuret EE-559 – Deep learning / 11. Recurrent Neural Networks and Natural Language Processing 2 / 69

Many real-world problems require to process a signal with a sequence structure.Although they may often be cast as standard “fixed size input” tasks, they mayinvolve a (very) variable “time horizon”.

Sequence classification:

• sentiment analysis,

• activity/action recognition,

• DNA sequence classification,

• action selection.

Sequence synthesis:

• text synthesis,

• music synthesis,

• motion synthesis.

Sequence-to-sequence translation:

• speech recognition,

• text translation,

• part-of-speech tagging.


Given a set X, if S(X) be the set of sequences of elements from X:

S(X) =∞⋃t=1

Xt .

We can define formally:

Sequence classification: f : S(X)→ {1, . . . ,C}

Sequence synthesis: f : RD → S(X)

Sequence-to-sequence translation: f : S(X)→ S(Y)

In the rest of the slides we consider only time sequences, although it generalizesto arbitrary sequences of variable size.


RNN and backprop through time


A recurrent model maintains a recurrent state updated at each time step.

With X = RD , given an input sequence x ∈ S(RD), and an initial recurrent

state h0 ∈ RQ , the model would compute the sequence of recurrent statesiteratively

∀t = 1, . . . ,T (x), ht = Φ(xt , ht−1),

whereΦw : RD × RQ → RQ .

A prediction can be computed at any time step from the recurrent state

yt = Ψ(ht)

withΨw : RQ → RC .


h0 Φ

x1

h1 Φ

x2

. . . Φ hT−1

xT−1

hTΦ

xT

Ψ

yT

Ψ

yT−1

Ψ

y1

w

Even though the number of steps T depends on x, this is a standard graphof tensor operations, and autograd can deal with it as usual. This is referred toas “backpropagation through time” (Werbos, 1988).


We consider the following simple binary sequence classification problem:

• Class 1: the sequence is the concatenation of two identical halves,

• Class 0: otherwise.

E.g.

x y(1, 2, 3, 4, 5, 6) 0(3, 9, 9, 3) 0(7, 4, 5, 7, 5, 4) 0(7, 7) 1(1, 2, 3, 1, 2, 3) 1(5, 1, 1, 2, 5, 1, 1, 2) 1


In what follows we use the three standard activation functions:

• The rectified linear unit:

ReLU(x) = max(x , 0)

• The hyperbolic tangent:

tanh(x) =ex − e−x

ex + e−x

• The sigmoid:

sigm(x) =1

1 + e−x


We can build an “Elman network” (Elman, 1990), with h0 = 0, the update

ht = ReLU(W(x h)xt + W(h h)ht−1 + b(h)

)(recurrent state)

and the final prediction

yT = W(h y)hT + b(y).

class RecNet(nn.Module):

def __init__(self , dim_input , dim_recurrent , dim_output):

super(RecNet , self).__init__ ()

self.fc_x2h = nn.Linear(dim_input , dim_recurrent)

self.fc_h2h = nn.Linear(dim_recurrent , dim_recurrent , bias = False)

self.fc_h2y = nn.Linear(dim_recurrent , dim_output)

def forward(self , input):

h = Variable(input.data.new(1, self.fc_h2y.weight.size (1)).zero_ ())

for t in range(input.size (0)):

h = F.relu(self.fc_x2h(input.narrow(0, t, 1)) + self.fc_h2h(h))

return self.fc_h2y(h)

B To simplify the processing of variable-length sequences, we are processingsamples (sequences) one at a time here.


We encode the symbol at time t as a one-hot vector xt , and thanks to autograd,the training can be implemented as

sequence_generator = SequenceGenerator(nb_symbols = 10,

pattern_length_min = 1, pattern_length_max = 10,

one_hot = True , variable = True)

model = RecNet(dim_input = 10,

dim_recurrent = 50,

dim_output = 2)

cross_entropy = nn.CrossEntropyLoss ()

optimizer = torch.optim.Adam(model.parameters (), lr = lr)

for k in range(args.nb_train_samples):

input , target = sequence_generator.generate ()

output = model(input)

loss = cross_entropy(output , target)

optimizer.zero_grad ()

loss.backward ()

optimizer.step()


0

0.1

0.2

0.3

0.4

0.5

0 50000 100000 150000 200000 250000

Err

or

Nb. sequences seen

Baseline

0

0.1

0.2

0.3

0.4

0.5

2 4 6 8 10 12 14 16 18 20

Err

or

Sequence length

Baseline


Gating


When unfolded through time, the model is deep, and training it involves inparticular dealing with vanishing gradients.

An important idea in the RNN models used in practice is to add in a form oranother a pass-through, so that the recurrent state does not go repeatedlythrough a squashing non-linearity.


For instance, the recurrent state update can be a per-component weightedaverage of its previous value ht−1 and a full update ht , with the weighting ztdepending on the input and the recurrent state, and playing the role of a“forget gate”.

So the model has an additional “gating” output

f : RD × RQ → [0, 1]Q ,

and the update rule takes the form

zt = f (xt , ht−1)ht = Φ(xt , ht−1)ht = zt � ht−1 + (1− zt)� ht ,

where � stands for the usual component-wise Hadamard product.


We can improve our minimal example with such a mechanism, from our simple


)(recurrent state)

to


)(full update)

zt = sigm(W(x z)xt + W(h z)ht−1 + b(z)

)(forget gate)

ht = zt � ht−1 + (1− zt)� ht (recurrent state)


class RecNetWithGating(nn.Module):

def __init__(self , dim_input , dim_recurrent , dim_output):

super(RecNetWithGating , self).__init__ ()

self.fc_x2h = nn.Linear(dim_input , dim_recurrent)

self.fc_h2h = nn.Linear(dim_recurrent , dim_recurrent , bias = False)

self.fc_x2z = nn.Linear(dim_input , dim_recurrent)

self.fc_h2z = nn.Linear(dim_recurrent , dim_recurrent , bias = False)

self.fc_h2y = nn.Linear(dim_recurrent , dim_output)


h = Variable(input.data.new(1, self.fc_h2y.weight.size (1)).zero_ ())

for t in range(input.size (0)):

z = F.sigmoid(self.fc_x2z(input.narrow(0, t, 1)) + self.fc_h2z(h))

hb = F.relu(self.fc_x2h(input.narrow(0, t, 1)) + self.fc_h2h(h))

h = z * h + (1 - z) * hb

return self.fc_h2y(h)


0

0.1

0.2

0.3

0.4

0.5

0 50000 100000 150000 200000 250000

Err

or

Nb. sequences seen

Baselinew/ Gating

0

0.1

0.2

0.3

0.4

0.5

2 4 6 8 10 12 14 16 18 20

Err

or

Sequence length

Baselinew/ Gating


LSTM and GRU


The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber(1997), has an update with gating of the form

ct = ct−1 + it � gt

where ct is a recurrent state, it is a gating function and gt is a full update. Thisassures that the derivatives of the loss wrt ct does not vanish.


It is noteworthy that this model implemented 20 years before the residualnetworks of He et al. (2015) the exact same strategy to deal with depth.

This original architecture was improved with a forget gate (Gers et al., 2000),resulting in the standard LSTM in use.

In what follows we consider notation and variant from Jozefowicz et al. (2015).


The recurrent state is composed of a “cell state” ct and an “output state” ht .Gate ft modulates if the cell state should be forgotten, it if the new updateshould be taken into account, and ot if the output state should be reset.

ft = sigm(W(x f)xt + W(h f)ht−1 + b(f)

)(forget gate)

it = sigm(W(x i)xt + W(h i)ht−1 + b(i)

)(input gate)

gt = tanh(W(x c)xt + W(h c)ht−1 + b(c)

)(full cell state update)

ct = ft � ct−1 + it � gt (cell state)

ot = sigm(W(x o)xt + W(h o)ht−1 + b(o)

)(output gate)

ht = ot � tanh(ct) (output state)

As pointed out by Gers et al. (2000), the forget bias b(f) should be initializedwith large values so that initially ft ' 1 and the gating has no effect.

This model was extended by Gers et al. (2003) with “peephole connections”that allow gates to depend on ct−1.


h1t−1

c1t−1

h1t

c1t

. . .

. . .

. . .

. . .

LSTM cell

xt

Ψ

yt−1

Ψ

yt

B Prediction is done from the ht state, hence called the output state.


Several such “cells” can be combined to create a multi-layer LSTM.

Two layer LSTM

h1t−1

c1t−1

h1t

c1t

. . .

. . .

. . .

. . .

LSTM cell

xt

h2t−1

c2t−1

h2t

c2t

. . .

. . .

. . .

. . .

LSTM cell

Ψ

yt−1

Ψ

yt

Ψ

yt−1

Ψ

yt


PyTorch’s torch.nn.LSTM implements this model.

Its processes several sequences, and returns two tensors, with D the number oflayers and T the sequence length:

• the outputs for all the layers at the last time step: h1T and hDT .

• the outputs of the last layer at each time step: hD1 , . . . , hDT , and

The initial recurrent states h10, . . . , h

D0 and c1

0 , . . . , cD0 can also be specified.


PyTorch’s RNNs can process batches of sequences of same length, that can beencoded in a regular tensor, or batches of sequences of various lengths using thetype nn.utils.rnn.PackedSequence .

Such an object can be created with nn.utils.rnn.pack padded sequence :

>>> from torch.nn.utils.rnn import pack_padded_sequence

>>> pack_padded_sequence(Variable(Tensor ([[[ 1 ], [ 2 ]],

... [[ 3 ], [ 4 ]],

... [[ 5 ], [ 0 ]]])),

... [3, 2])

PackedSequence(data=Variable containing:

1

2

3

4

5

[torch.FloatTensor of size 5x1]

, batch_sizes =[2, 2, 1])

B The sequences must be sorted by decreasing lengths.

nn.utils.rnn.pad packed sequence converts back to a padded tensor.


class LSTMNet(nn.Module):

def __init__(self , dim_input , dim_recurrent , num_layers , dim_output):

super(LSTMNet , self).__init__ ()

self.lstm = nn.LSTM(input_size = dim_input ,

hidden_size = dim_recurrent ,

num_layers = num_layers)

self.fc_o2y = nn.Linear(dim_recurrent , dim_output)


# Makes this a batch of size 1

# The first index is the time , sequence number is the second

input = input.unsqueeze (1)

# Get the activations of all layers at the last time step

output , _ = self.lstm(input)

# Drop the batch index

output = output.squeeze (1)

# Keep the output state of the last LSTM cell alone

output = output.narrow(0, output.size (0) - 1, 1)

return self.fc_o2y(F.relu(output))


0

0.1

0.2

0.3

0.4

0.5

0 50000 100000 150000 200000 250000

Err

or

Nb. sequences seen

Baselinew/ Gating

LSTM

0

0.1

0.2

0.3

0.4

0.5

2 4 6 8 10 12 14 16 18 20

Err

or

Sequence length

Baselinew/ Gating

LSTM


The LSTM were simplified into the Gated Recurrent Unit (GRU) by Cho et al.(2014), with a gating for the recurrent state, and a reset gate.

rt = sigm(W(x r)xt + W(h r)ht−1 + b(r)

)(reset gate)

zt = sigm(W(x z)xt + W(h z)ht−1 + b(z)

)(forget gate)

ht = tanh(W(x h)xt + W(h h)(rt � ht−1) + b(h)

)(full update)

ht = zt � ht−1 + (1− zt)� ht (hidden update)


class GRUNet(nn.Module):

def __init__(self , dim_input , dim_recurrent , num_layers , dim_output):

super(GRUNet , self).__init__ ()

self.gru = nn.GRU(input_size = dim_input ,

hidden_size = dim_recurrent ,

num_layers = num_layers)

self.fc_y = nn.Linear(dim_recurrent , dim_output)


# Makes this a batch of size 1

input = input.unsqueeze (1)

# Get the activations of all layers at the last time step

_, output = self.gru(input)

# Drop the batch index

output = output.squeeze (1)

output = output.narrow(0, output.size (0) - 1, 1)

return self.fc_y(F.relu(output))


0

0.1

0.2

0.3

0.4

0.5

0 50000 100000 150000 200000 250000

Err

or

Nb. sequences seen

Baselinew/ Gating

LSTMGRU

0

0.1

0.2

0.3

0.4

0.5

2 4 6 8 10 12 14 16 18 20E

rro

rSequence length

Baselinew/ Gating

LSTMGRU


The specific form of these units prevent the gradient from vanishing, but it maystill be excessively large on certain mini-batch.

The standard strategy to solve this issue is gradient norm clipping (Pascanuet al., 2013), which consists of re-scaling the [norm of the] gradient to a fixedthreshold δ when if it was above:

∇f =∇f‖∇f ‖

min (‖∇f ‖, δ) .


The function torch.nn.utils.clip grad norm applies this operation to thegradient of a model, as defined by an iterator through its parameters:

>>> x = Variable(Tensor (10))

>>> x.grad = Variable(x.data.new(x.data.size()).normal_ ())

>>> y = Variable(Tensor (5))

>>> y.grad = Variable(y.data.new(y.data.size()).normal_ ())

>>> torch.cat((x.grad.data , y.grad.data)).norm()

4.656265393931142

>>> torch.nn.utils.clip_grad_norm ((x, y), 5.0)

4.656265393931143


4.656265393931142

>>> torch.nn.utils.clip_grad_norm ((x, y), 1.25)

4.656265393931143


1.249999658884575


Jozefowicz et al. (2015) conducted an extensive exploration of differentrecurrent architectures through meta-optimization, and even though some unitssimpler than LSTM or GRU perform well, they wrote:

“We have evaluated a variety of recurrent neural network architectures inorder to find an architecture that reliably out-performs the LSTM. Thoughthere were architectures that outperformed the LSTM on some problems,we were unable to find an architecture that consistently beat the LSTM andthe GRU in all experimental conditions.”

(Jozefowicz et al., 2015)


Temporal Convolutions


In spite of the often surprising good performance of Recurrent Neural Networks,the trend is to use Temporal Convolutional Networks (Waibel et al., 1989; Baiet al., 2018) for sequences.

These models are standard 1d convolutional networks, in which long timehorizon is achieved through dilated convolutions.


Input

Hidden

Hidden

Output

T

Increasing exponentially the filter sizes makes the required number of layersgrow in log of the time window T taken into account.

Thanks to the dilated convolutions, the model size is O(log T ). The memoryfootprint and computation are O(T log T ).


An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Table 1. Evaluation of TCNs and recurrent architectures on synthetic stress tests, polyphonic music modeling, character-level languagemodeling, and word-level language modeling. The generic TCN architecture outperforms canonical recurrent networks across acomprehensive suite of tasks and datasets. Current state-of-the-art results are listed in the supplement. h means that higher is better.` means that lower is better.

Sequence Modeling Task Model Size (≈)Models

LSTM GRU RNN TCN

Seq. MNIST (accuracyh) 70K 87.2 96.2 21.5 99.0Permuted MNIST (accuracy) 70K 85.7 87.3 25.3 97.2Adding problem T=600 (loss`) 70K 0.164 5.3e-5 0.177 5.8e-5Copy memory T=1000 (loss) 16K 0.0204 0.0197 0.0202 3.5e-5Music JSB Chorales (loss) 300K 8.45 8.43 8.91 8.10Music Nottingham (loss) 1M 3.29 3.46 4.05 3.07Word-level PTB (perplexity`) 13M 78.93 92.48 114.50 89.21Word-level Wiki-103 (perplexity) - 48.4 - - 45.19Word-level LAMBADA (perplexity) - 4186 - 14725 1279Char-level PTB (bpc`) 3M 1.41 1.42 1.52 1.35Char-level text8 (bpc) 5M 1.52 1.56 1.69 1.45

about 268K. The dataset contains 28K Wikipedia articles(about 103 million words) for training, 60 articles (about218K words) for validation, and 60 articles (246K words)for testing. This is a more representative and realistic datasetthan PTB, with a much larger vocabulary that includes manyrare words, and has been used in Merity et al. (2016); Graveet al. (2017); Dauphin et al. (2017).

LAMBADA. Introduced by Paperno et al. (2016), LAM-BADA is a dataset comprising 10K passages extracted fromnovels, with an average of 4.6 sentences as context, and 1 tar-get sentence the last word of which is to be predicted. Thisdataset was built so that a person can easily guess the miss-ing word when given the context sentences, but not whengiven only the target sentence without the context sentences.Most of the existing models fail on LAMBADA (Papernoet al., 2016; Grave et al., 2017). In general, better resultson LAMBADA indicate that a model is better at capturinginformation from longer and broader context. The trainingdata for LAMBADA is the full text of 2,662 novels withmore than 200M words. The vocabulary size is about 93K.

text8. We also used the text8 dataset for character-levellanguage modeling (Mikolov et al., 2012). text8 is about20 times larger than PTB, with about 100M characters fromWikipedia (90M for training, 5M for validation, and 5M fortesting). The corpus contains 27 unique alphabets.

5. ExperimentsWe compare the generic TCN architecture described in Sec-tion 3 to canonical recurrent architectures, namely LSTM,GRU, and vanilla RNN, with standard regularizations. Allexperiments reported in this section used exactly the same

TCN architecture, just varying the depth of the network nand occasionally the kernel size k so that the receptive fieldcovers enough context for predictions. We use an expo-nential dilation d = 2i for layer i in the network, and theAdam optimizer (Kingma & Ba, 2015) with learning rate0.002 for TCN, unless otherwise noted. We also empiri-cally find that gradient clipping helped convergence, and wepick the maximum norm for clipping from [0.3, 1]. Whentraining recurrent models, we use grid search to find a goodset of hyperparameters (in particular, optimizer, recurrentdrop p ∈ [0.05, 0.5], learning rate, gradient clipping, andinitial forget-gate bias), while keeping the network aroundthe same size as TCN. No other architectural elaborations,such as gating mechanisms or skip connections, were addedto either TCNs or RNNs. Additional details and controlledexperiments are provided in the supplementary material.

5.1. Synopsis of Results

A synopsis of the results is shown in Table 1. Note thaton several of these tasks, the generic, canonical recurrentarchitectures we study (e.g., LSTM, GRU) are not the state-of-the-art. (See the supplement for more details.) With thiscaveat, the results strongly suggest that the generic TCNarchitecture with minimal tuning outperforms canonical re-current architectures across a broad variety of sequencemodeling tasks that are commonly used to benchmark theperformance of recurrent architectures themselves. We nowanalyze these results in more detail.

5.2. Synthetic Stress Tests

The adding problem. Convergence results for the addingproblem, for problem sizes T = 200 and 600, are shown

(Bai et al., 2018)


Word embeddings and CBOW


An important application domain for machine intelligence is Natural LanguageProcessing (NLP).

• Speech and (hand)writing recognition,

• auto-captioning,

• part-of-speech tagging,

• sentiment prediction,

• translation,

• question answering.

While language modeling was historically addressed with formal methods, inparticular generative grammars, state-of-the-art and deployed methods are nowheavily based on statistical learning and deep learning.


A core difficulty of Natural Language Processing is to devise a proper densitymodel for sequences of words.

However, since a vocabulary is usually of the order of 104 − 106 words,empirical distributions can not be estimated for more than triplets of words.


The standard strategy to mitigate this problem is to embed words into ageometrical space to take advantage of data regularities for further [statistical]modeling.

The geometry after embedding should account for synonymy, but also foridentical word classes, etc. E.g. we would like such an embedding to make “cat”and “tiger” close, but also “red” and “blue”, or “eat” and “work”, etc.

Even though they are not “deep”, classical word embedding models are keyelements of NLP with deep-learning.


Letkt ∈ {1, . . . ,W }, t = 1, . . . ,T

be a training sequence of T words, encoded as IDs through a vocabulary of Wwords.

Given an embedding dimension D, the objective is to learn vectors

Ek ∈ RD , k ∈ {1, . . . ,W }

so that “similar” words are embedded with “similar” vectors.


A common word embedding is the Continuous Bag of Words (CBOW) versionof word2vec (Mikolov et al., 2013a).

In this model, they embedding vectors are chosen so that a word can bepredicted from [a linear function of] the sum of the embeddings of wordsaround it.


More formally, let C ∈ N∗ be a “context size”, and

Ct = (kt−C , . . . , kt−1, kt+1, . . . , kt+C )

be the “context” around kt , that is the indexes of words around it.

C C

Ct

k1 · · · kt−C · · · kt−1 kt kt+1 · · · kt+C . . . kT


The embeddings vectors

Ek ∈ RD , k = 1, . . . ,W ,

are optimized jointly with an array

M ∈ RW×D

so that the predicted vector of W scores

ψ(t) = M∑k∈Ct

Ek

is a good predictor of the value of kt .


Ideally we would minimize the cross-entropy between the vector of scoresψ(t) ∈ RW and the class kt

∑t

− log

(expψ(t)kt∑Wk=1 expψ(t)k

).

However, given the vocabulary size, doing so is numerically unstable andcomputationally demanding.


The “negative sampling” approach uses a loss estimated on the prediction forthe correct class kt and only Q �W incorrect classes κt,1, . . . , κt,Q sampledat random.

In our implementation we take the later uniformly in {1, . . . ,W } and use thesame loss as Mikolov et al. (2013b):

∑t

log(

1 + e−ψ(t)kt

)+

Q∑q=1

log(

1 + eψ(t)κt,q).

We want ψ(t)kt to be large and all the ψ(t)κt,q to be small.


Although the operationx 7→ Ex

could be implemented as the product between a one-hot vector and a matrix, itis far more efficient to use an actual lookup table.


The PyTorch module nn.Embedding does precisely that. It is parametrizedwith a number N of words to embed, and an embedding dimension D.

It gets as input a LongTensor of arbitrary dimension A1 × · · · × AU ,

containing values in {0, . . . ,N − 1} and it returns a float tensor of dimensionA1 × · · · × AU × D.

If w are the embedding vectors, x the input tensor, y the result, we have

y [a1, . . . , aU , d ] = w [x[a1, . . . , aU ]][d ].


>>> e = nn.Embedding (10, 3)

>>> x = Variable(torch.LongTensor ([[1, 1, 2, 2], [0, 1, 9, 9]]))

>>> e(x)

Variable containing:

(0 ,.,.) =

-0.1815 -1.3016 -0.8052

-0.1815 -1.3016 -0.8052

0.6340 1.7662 0.4010

0.6340 1.7662 0.4010

(1 ,.,.) =

-0.3555 0.0739 0.4875

-0.1815 -1.3016 -0.8052

-0.0667 0.0147 0.7217

-0.0667 0.0147 0.7217

[torch.FloatTensor of size 2x4x3]


Our CBOW model has as parameters two embeddings

E ∈ RW×D and M ∈ RW×D .

Its forward gets as input a pair of torch.LongTensor s corresponding to abatch of size B:

• c of size B × 2C contains the IDs of the words in a context, and

• d of size B × R contains the IDs, for each of the B contexts, of the Rwords for which we want the prediction score (that will be the correct oneand Q negative ones).

it returns a tensor y of size B × R containing the dot products.

y [n, j] =1

DMd [n,j] ·

(∑i

Ec[n,i ]

).


class CBOW(nn.Module):

def __init__(self , voc_size = 0, embed_dim = 0):

super(CBOW , self).__init__ ()

self.embed_dim = embed_dim

self.embed_E = nn.Embedding(voc_size , embed_dim)

self.embed_M = nn.Embedding(voc_size , embed_dim)

def forward(self , c, d):

sum_w_E = self.embed_E(c).sum (1).unsqueeze (1).transpose (1, 2)

w_M = self.embed_M(d)

return w_M.matmul(sum_w_E).squeeze (2) / self.embed_dim


Regarding the loss, we can use nn.BCEWithLogitsLoss which implements∑t

yt log(1 + exp(−xt)) + (1− yt) log(1 + exp(xt)).

It takes care in particular of the numerical problem that may arise for largevalues of xt if implemented “naively”.


Before training a model, we need to prepare data tensors of word IDs from atext file. We will use a 100Mb text file taken from Wikipedia and

• make it lower-cap,

• remove all non-letter characters,

• replace all words that appear less than 100 times with ’*’,

• associate to each word a unique id.

From the resulting sequence of length T stored in a LongTensor , and thecontext size C , we will generate mini-batches, each of two tensors

• a ’context’ LongTensor c of dimension B × 2C , and

• a ’word’ LongTensor w of dimension B.


If the corpus is “The black cat plays with the black ball.”, we will get thefollowing word IDs:

the: 0, black: 1, cat : 2, plays: 3, with: 4, ball: 5.

The corpus will be encoded as

the black cat plays with the black ball

0 1 2 3 4 0 1 5

and the data and label tensors will be

Words IDs c wthe black cat plays with 0 1 2 3 4 0, 1, 3, 4 2

black cat plays with the 1 2 3 4 0 1, 2, 4, 0 3cat plays with the black 2 3 4 0 1 2, 3, 0, 1 4

plays with the black ball 3 4 0 1 5 3, 4, 1, 5 0


We can train the model for an epoch with:

for k in range(0, id_seq.size (0) - 2 * context_size - batch_size , batch_size):

c, w = extract_batch(id_seq , k, batch_size , context_size)

d = LongTensor(batch_size , 1 + nb_neg_samples).random_(voc_size)

d[:, 0] = w

target = FloatTensor(batch_size , 1 + nb_neg_samples).zero_()

target.narrow(1, 0, 1).fill_ (1)

target = Variable(target)

c, d, target = Variable(c), Variable(d), Variable(target)

output = model(c, d)

loss = bce_loss(output , target)

optimizer.zero_grad ()

loss.backward ()

optimizer.step()


Some nearest-neighbors for the cosine distance between the embeddings

d(w ,w ′) =Ew · Ew′

‖Ew‖‖Ew′‖.

paris bike cat fortress powerful

0.61 parisian 0.61 bicycle 0.55 cats 0.61 fortresses 0.47 formidable

0.59 france 0.51 bicycles 0.54 dog 0.55 citadel 0.44 power

0.55 brussels 0.51 bikes 0.49 kitten 0.55 castle 0.44 potent

0.53 bordeaux 0.49 biking 0.44 feline 0.52 fortifications 0.40 fearsome

0.51 toulouse 0.47 motorcycle 0.42 pet 0.51 forts 0.40 destroy

0.51 vienna 0.43 cyclists 0.40 dogs 0.50 siege 0.39 wielded

0.51 strasbourg 0.42 riders 0.40 kittens 0.49 stronghold 0.38 versatile

0.49 munich 0.41 sled 0.39 hound 0.49 castles 0.38 capable

0.49 marseille 0.41 triathlon 0.39 squirrel 0.48 monastery 0.38 strongest

0.48 rouen 0.41 car 0.38 mouse 0.48 besieged 0.37 able


An alternative algorithm is the skip-gram model, which optimizes theembedding so that a word can be predicted by any individual word in itscontext (Mikolov et al., 2013a).

w(t-2)

w(t+1)

w(t-1)

w(t+2)

w(t)

SUM

INPUT PROJECTION OUTPUT

w(t)

INPUT PROJECTION OUTPUT

w(t-2)

w(t-1)

w(t+1)

w(t+2)

CBOW Skip-gram

Figure 1: New model architectures. The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current word.

R words from the future of the current word as correct labels. This will require us to do R × 2word classifications, with the current word as input, and each of the R + R words as output. In thefollowing experiments, we use C = 10.

4 Results

To compare the quality of different versions of word vectors, previous papers typically use a tableshowing example words and their most similar words, and understand them intuitively. Althoughit is easy to show that word France is similar to Italy and perhaps some other countries, it is muchmore challenging when subjecting those vectors in a more complex similarity task, as follows. Wefollow previous observation that there can be many different types of similarities between words, forexample, word big is similar to bigger in the same sense that small is similar to smaller. Exampleof another type of relationship can be word pairs big - biggest and small - smallest [20]. We furtherdenote two pairs of words with the same relationship as a question, as we can ask: ”What is theword that is similar to small in the same sense as biggest is similar to big?”

Somewhat surprisingly, these questions can be answered by performing simple algebraic operationswith the vector representation of words. To find a word that is similar to small in the same sense asbiggest is similar to big, we can simply compute vector X = vector(”biggest”)−vector(”big”)+vector(”small”). Then, we search in the vector space for the word closest to X measured by cosinedistance, and use it as the answer to the question (we discard the input question words during thissearch). When the word vectors are well trained, it is possible to find the correct answer (wordsmallest) using this method.

Finally, we found that when we train high dimensional word vectors on a large amount of data, theresulting vectors can be used to answer very subtle semantic relationships between words, such asa city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectorswith such semantic relationships could be used to improve many existing NLP applications, suchas machine translation, information retrieval and question answering systems, and may enable otherfuture applications yet to be invented.

5

(Mikolov et al., 2013a)


Trained on large corpora, such models reflect semantic relations in the linearstructure of the embedding space. E.g.

E[paris] − E[france] + E[italy ] ' E[rome]

Table 8: Examples of the word pair relationships, using the best word vectors from Table 4 (Skip-gram model trained on 783M words with 300 dimensionality).

Relationship Example 1 Example 2 Example 3France - Paris Italy: Rome Japan: Tokyo Florida: Tallahasseebig - bigger small: larger cold: colder quick: quicker

Miami - Florida Baltimore: Maryland Dallas: Texas Kona: HawaiiEinstein - scientist Messi: midfielder Mozart: violinist Picasso: painterSarkozy - France Berlusconi: Italy Merkel: Germany Koizumi: Japan

copper - Cu zinc: Zn gold: Au uranium: plutoniumBerlusconi - Silvio Sarkozy: Nicolas Putin: Medvedev Obama: Barack

Microsoft - Windows Google: Android IBM: Linux Apple: iPhoneMicrosoft - Ballmer Google: Yahoo IBM: McNealy Apple: Jobs

Japan - sushi Germany: bratwurst France: tapas USA: pizza

assumes exact match, the results in Table 8 would score only about 60%). We believe that wordvectors trained on even larger data sets with larger dimensionality will perform significantly better,and will enable the development of new innovative applications. Another way to improve accuracy isto provide more than one example of the relationship. By using ten examples instead of one to formthe relationship vector (we average the individual vectors together), we have observed improvementof accuracy of our best models by about 10% absolutely on the semantic-syntactic test.

It is also possible to apply the vector operations to solve different tasks. For example, we haveobserved good accuracy for selecting out-of-the-list words, by computing average vector for a list ofwords, and finding the most distant word vector. This is a popular type of problems in certain humanintelligence tests. Clearly, there is still a lot of discoveries to be made using these techniques.

6 Conclusion

In this paper we studied the quality of vector representations of words derived by various models ona collection of syntactic and semantic language tasks. We observed that it is possible to train highquality word vectors using very simple model architectures, compared to the popular neural networkmodels (both feedforward and recurrent). Because of the much lower computational complexity, itis possible to compute very accurate high dimensional word vectors from a much larger data set.Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-grammodels even on corpora with one trillion words, for basically unlimited size of the vocabulary. Thatis several orders of magnitude larger than the best previously published results for similar models.

An interesting task where the word vectors have recently been shown to significantly outperform theprevious state of the art is the SemEval-2012 Task 2 [11]. The publicly available RNN vectors wereused together with other techniques to achieve over 50% increase in Spearman’s rank correlationover the previous best result [31]. The neural network based word vectors were previously appliedto many other NLP tasks, for example sentiment analysis [12] and paraphrase detection [28]. It canbe expected that these applications can benefit from the model architectures described in this paper.

Our ongoing work shows that the word vectors can be successfully applied to automatic extensionof facts in Knowledge Bases, and also for verification of correctness of existing facts. Resultsfrom machine translation experiments also look very promising. In the future, it would be alsointeresting to compare our techniques to Latent Relational Analysis [30] and others. We believe thatour comprehensive test set will help the research community to improve the existing techniques forestimating the word vectors. We also expect that high quality word vectors will become an importantbuilding block for future NLP applications.

10

(Mikolov et al., 2013a)


The main benefit of word embeddings is that they are trained with unsupervisedcorpora, hence possibly extremely large.

This modeling can then be leveraged for small-corpora tasks such as

• sentiment analysis,

• question answering,

• topic classification,

• etc.


Sequence-to-sequence translation


sequence of words representing the answer. It is therefore clear that a domain-independent methodthat learns to map sequences to sequences would be useful.

Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs andoutputs is known and fixed. In this paper, we show that a straightforward application of the LongShort-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use anotherLSTM to extract the output sequencefrom that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model[28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfullylearn on data with long range temporal dependencies makes ita natural choice for this applicationdue to the considerable time lag between the inputs and theircorresponding outputs (fig. 1).

There have been a number of related attempts to address the general sequence to sequence learningproblem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18]who were the first to map the entire input sentence to vector, and is related to Cho et al. [5] althoughthe latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10]introduced a novel differentiable attention mechanism that allows neural networks to focus on dif-ferent parts of their input, and an elegant variant of this idea was successfully applied to machinetranslation by Bahdanau et al. [2]. The Connectionist Sequence Classification is another populartechnique for mapping sequences to sequences with neural networks, but it assumes a monotonicalignment between the inputs and the outputs [11].

Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. Themodel stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads theinput sentence in reverse, because doing so introduces manyshort term dependencies in the data that make theoptimization problem much easier.

The main result of this work is the following. On the WMT’14 English to French translation task,we obtained a BLEU score of34.81 by directly extracting translations from an ensemble of 5 deepLSTMs (with 384M parameters and 8,000 dimensional state each) using a simple left-to-right beam-search decoder. This is by far the best result achieved by direct translation with large neural net-works. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalizedwhenever the reference translation contained a word not covered by these 80k. This result showsthat a relatively unoptimized small-vocabulary neural network architecture which has much roomfor improvement outperforms a phrase-based SMT system.

Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline onthe same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).

Surprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of otherresearchers with related architectures [26]. We were able to do well on long sentences because wereversed the order of words in the source sentence but not thetarget sentences in the training and testset. By doing so, we introduced many short term dependenciesthat made the optimization problemmuch simpler (see sec. 2 and 3.3). As a result, SGD could learnLSTMs that had no trouble withlong sentences. The simple trick of reversing the words in the source sentence is one of the keytechnical contributions of this work.

A useful property of the LSTM is that it learns to map an input sentence of variable length intoa fixed-dimensional vector representation. Given that translations tend to be paraphrases of thesource sentences, the translation objective encourages the LSTM to find sentence representationsthat capture their meaning, as sentences with similar meanings are close to each other while different

2

(Sutskever et al., 2014)


English to French translation.

Training:

• corpus 12M sentences, 348M French words, 30M English words,

• LSTM with 4 layers, one for encoding, one for decoding,

• 160, 000 words input vocabulary, 80, 000 output vocabulary,

• 1, 000 dimensions word embedding, 384M parameters total,

• input sentence is reversed,

• gradient clipping.

The hidden state that contains the information to generate the translationis of dimension 8, 000.

Inference is done with a “beam search”, that consists of greedily increasing thesize of the predicted sequence while keeping a bag of K best ones.


Comparing a produced sentence to a reference one is complex, since it is relatedto their semantic content.

A widely used measure is the BLEU score, that counts the fraction of groups ofone, two, three and four words (aka “n-grams”) from the generated sentencethat appear in the reference translations (Papineni et al., 2002).

The exact definition is complex, and the validity of this score is disputable sinceit poorly accounts for semantic.


Method test BLEU score (ntst14)Bahdanau et al. [2] 28.45

Baseline System [29] 33.30Single forward LSTM, beam size 12 26.17Single reversed LSTM, beam size 12 30.59

Ensemble of 5 reversed LSTMs, beam size 1 33.00Ensemble of 2 reversed LSTMs, beam size 12 33.27Ensemble of 5 reversed LSTMs, beam size 2 34.50Ensemble of 5 reversed LSTMs, beam size 12 34.81

Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note thatan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam ofsize 12.

Method test BLEU score (ntst14)Baseline System [29] 33.30

Cho et al. [5] 34.54Best WMT’14 result [9] 37.0

Rescoring the baseline 1000-best with a single forward LSTM 35.61Rescoring the baseline 1000-best with a single reversed LSTM 35.85

Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5Oracle Rescoring of the Baseline 1000-best lists ∼45

Table 2: Methods that use neural networks together with an SMT system on the WMT’14 Englishto French test set (ntst14).

task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM iswithin 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of thebaseline system.

3.7 Performance on long sentences

We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-tively in figure 3. Table 3 presents several examples of long sentences and their translations.

3.8 Model Analysis

−8 −6 −4 −2 0 2 4 6 8 10−6

−5

−4

−3

−2

−1

0

1

2

3

4

John respects Mary

Mary respects JohnJohn admires Mary

Mary admires John

Mary is in love with John

John is in love with Mary

−15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

I gave her a card in the garden

In the garden , I gave her a card

She was given a card by me in the garden

She gave me a card in the gardenIn the garden , she gave me a card

I was given a card by her in the garden

Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtainedafter processing the phrases in the figures. The phrases are clustered by meaning, which in these examples isprimarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice thatboth clusters have similar internal structure.

One of the attractive features of our model is its ability to turn a sequence of words into a vectorof fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearlyshows that the representations are sensitive to the order ofwords, while being fairly insensitive to the

6



Type SentenceOur model Ulrich UNK , membre du conseil d’ administration du constructeur automobile Audi ,

affirme qu’ il s’ agit d’ une pratique courante depuis des années pour que les telephonesportables puissent etre collectes avant les reunions duconseil d’ administration afin qu’ ilsne soient pas utilises comme appareils d’ ecoute a distance .

Truth Ulrich Hackenberg , membre du conseil d’ administration du constructeur automobile Audi ,declare que la collecte des telephones portables avant les reunions du conseil , afin qu’ ilsne puissent pas etre utilises comme appareils d’ ecoute à distance , est une pratique courantedepuis des annees .

Our model “ Les telephones cellulaires , qui sont vraiment une question , non seulement parce qu’ ilspourraient potentiellement causer des interferences avec les appareils de navigation , maisnous savons , selon la FCC , qu’ ils pourraient interferer avec les tours de telephone cellulairelorsqu’ ils sont dans l’ air ” , dit UNK .

Truth “ Les telephones portables sont veritablement un problème , non seulement parce qu’ ilspourraient eventuellement creer des interferences avec les instruments de navigation , maisparce que nous savons , d’ apres la FCC , qu’ ils pourraient perturber les antennes-relais detelephonie mobile s’ ils sont utilises a bord ” , a declare Rosenker .

Our model Avec la cremation , il y a un “ sentiment de violence contre lecorps d’ un etre cher ” ,qui sera “ reduit a une pile de cendres ” en tres peu de tempsau lieu d’ un processus dedecomposition “ qui accompagnera les etapes du deuil ” .

Truth Il y a , avec la cremation , “ une violence faite au corps aime” ,qui va etre “ reduit a un tas de cendres ” en tres peu de temps , et non apres un processus dedecomposition , qui “ accompagnerait les phases du deuil ” .

Table 3: A few examples of long translations produced by the LSTM alongside the ground truthtranslations. The reader can verify that the translations are sensible using Google translate.

4 7 8 12 17 22 28 35 79

test sentences sorted by their length

20

25

30

35

40

BLEU

score

LSTM (34.8)

baseline (33.3)

0 500 1000 1500 2000 2500 3000 3500

test sentences sorted by average word frequency rank

20

25

30

35

40

BLEU

score

LSTM (34.8)

baseline (33.3)

Figure 3: The left plot shows the performance of our system as a function of sentence length, where thex-axis corresponds to the test sentences sorted by their length and is marked by the actual sequence lengths.There is no degradation on sentences with less than 35 words,there is only a minor degradation on the longestsentences. The right plot shows the LSTM’s performance on sentences with progressively more rare words,where the x-axis corresponds to the test sentences sorted bytheir “average word frequency rank”.

replacement of an active voice with a passive voice. The two-dimensional projections are obtainedusing PCA.

4 Related work

There is a large body of work on applications of neural networks to machine translation. So far,the simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a

7


Type SentenceOur model Ulrich UNK , membre du conseil d’ administration du constructeur automobile Audi ,

affirme qu’ il s’ agit d’ une pratique courante depuis des années pour que les telephonesportables puissent etre collectes avant les reunions duconseil d’ administration afin qu’ ilsne soient pas utilises comme appareils d’ ecoute a distance .

Truth Ulrich Hackenberg , membre du conseil d’ administration du constructeur automobile Audi ,declare que la collecte des telephones portables avant les reunions du conseil , afin qu’ ilsne puissent pas etre utilises comme appareils d’ ecoute à distance , est une pratique courantedepuis des annees .

Our model “ Les telephones cellulaires , qui sont vraiment une question , non seulement parce qu’ ilspourraient potentiellement causer des interferences avec les appareils de navigation , maisnous savons , selon la FCC , qu’ ils pourraient interferer avec les tours de telephone cellulairelorsqu’ ils sont dans l’ air ” , dit UNK .

Truth “ Les telephones portables sont veritablement un problème , non seulement parce qu’ ilspourraient eventuellement creer des interferences avec les instruments de navigation , maisparce que nous savons , d’ apres la FCC , qu’ ils pourraient perturber les antennes-relais detelephonie mobile s’ ils sont utilises a bord ” , a declare Rosenker .

Our model Avec la cremation , il y a un “ sentiment de violence contre lecorps d’ un etre cher ” ,qui sera “ reduit a une pile de cendres ” en tres peu de tempsau lieu d’ un processus dedecomposition “ qui accompagnera les etapes du deuil ” .

Truth Il y a , avec la cremation , “ une violence faite au corps aime” ,qui va etre “ reduit a un tas de cendres ” en tres peu de temps , et non apres un processus dedecomposition , qui “ accompagnerait les phases du deuil ” .

Table 3: A few examples of long translations produced by the LSTM alongside the ground truthtranslations. The reader can verify that the translations are sensible using Google translate.

4 7 8 12 17 22 28 35 79

test sentences sorted by their length

20

25

30

35

40

BLEU

score

LSTM (34.8)

baseline (33.3)

0 500 1000 1500 2000 2500 3000 3500

test sentences sorted by average word frequency rank

20

25

30

35

40

BLEU

score

LSTM (34.8)

baseline (33.3)

Figure 3: The left plot shows the performance of our system as a function of sentence length, where thex-axis corresponds to the test sentences sorted by their length and is marked by the actual sequence lengths.There is no degradation on sentences with less than 35 words,there is only a minor degradation on the longestsentences. The right plot shows the LSTM’s performance on sentences with progressively more rare words,where the x-axis corresponds to the test sentences sorted bytheir “average word frequency rank”.

replacement of an active voice with a passive voice. The two-dimensional projections are obtainedusing PCA.

4 Related work

There is a large body of work on applications of neural networks to machine translation. So far,the simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a

7


Method test BLEU score (ntst14)Bahdanau et al. [2] 28.45

Baseline System [29] 33.30Single forward LSTM, beam size 12 26.17Single reversed LSTM, beam size 12 30.59

Ensemble of 5 reversed LSTMs, beam size 1 33.00Ensemble of 2 reversed LSTMs, beam size 12 33.27Ensemble of 5 reversed LSTMs, beam size 2 34.50Ensemble of 5 reversed LSTMs, beam size 12 34.81

Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note thatan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam ofsize 12.

Method test BLEU score (ntst14)Baseline System [29] 33.30

Cho et al. [5] 34.54Best WMT’14 result [9] 37.0

Rescoring the baseline 1000-best with a single forward LSTM 35.61Rescoring the baseline 1000-best with a single reversed LSTM 35.85

Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5Oracle Rescoring of the Baseline 1000-best lists ∼45

Table 2: Methods that use neural networks together with an SMT system on the WMT’14 Englishto French test set (ntst14).

task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM iswithin 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of thebaseline system.

3.7 Performance on long sentences

We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-tively in figure 3. Table 3 presents several examples of long sentences and their translations.

3.8 Model Analysis

−8 −6 −4 −2 0 2 4 6 8 10−6

−5

−4

−3

−2

−1

0

1

2

3

4

John respects Mary

Mary respects JohnJohn admires Mary

Mary admires John

Mary is in love with John

John is in love with Mary

−15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

I gave her a card in the garden

In the garden , I gave her a card

She was given a card by me in the garden

She gave me a card in the gardenIn the garden , she gave me a card

I was given a card by her in the garden

Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtainedafter processing the phrases in the figures. The phrases are clustered by meaning, which in these examples isprimarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice thatboth clusters have similar internal structure.

One of the attractive features of our model is its ability to turn a sequence of words into a vectorof fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearlyshows that the representations are sensitive to the order ofwords, while being fairly insensitive to the

6



References

S. Bai, J. Kolter, and V. Koltun. An empirical evaluation of generic convolutional andrecurrent networks for sequence modeling. CoRR, abs/1803.01271, 2018.

K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio.Learning phrase representations using RNN encoder-decoder for statistical machinetranslation. CoRR, abs/1406.1078, 2014.

J. L. Elman. Finding structure in time. Cognitive Science, 14(2):179 – 211, 1990.

F. A. Gers, J. A. Schmidhuber, and F. A. Cummins. Learning to forget: Continualprediction with lstm. Neural Computation, 12(10):2451–2471, 2000.

F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstmrecurrent networks. Journal of Machine Learning Research (JMLR), 3:115–143, 2003.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR,abs/1512.03385, 2015.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrentnetwork architectures. In International Conference on Machine Learning (ICML), pages2342–2350, 2015.

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representationsin vector space. CoRR, abs/1301.3781, 2013a.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representationsof words and phrases and their compositionality. In Neural Information ProcessingSystems (NIPS), 2013b.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automaticevaluation of machine translation. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL), pages 311–318. Association forComputational Linguistics, 2002.

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neuralnetworks. In International Conference on Machine Learning (ICML), 2013.

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.In Neural Information Processing Systems (NIPS), pages 3104–3112, 2014.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognitionusing time-delay neural networks. IEEE Transactions on Acoustics, Speech, and SignalProcessing, 37(3):328–339, 1989.

P. J. Werbos. Generalization of backpropagation with application to a recurrent gas marketmodel. Neural Networks, 1(4):339–356, 1988.

processing 11. recurrent neural networks and natural ... · lstm and gru fran˘cois fleuret ee-559...

Documents