improving mongolian-chinese neural machine translation ... · et al.,2015;hermann et...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 123–129Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

123

Improving Mongolian-Chinese Neural Machine Translation withMorphological Noise

Yatu JI*[email protected]

Hongxu HOU*†[email protected]

*Computer Science dept, Inner Mongolia University / Hohhot, China

Nier WU*[email protected]

Junjie CHEN*[email protected]

Abstract

For the translation of agglutinative languagesuch as typical Mongolian, unknown (UN-K) words not only come from the quiterestricted vocabulary, but also mostly frommisunderstanding of the translation modelto the morphological changes. In this s-tudy, we introduce a new adversarial train-ing model to alleviate the UNK problemin Mongolian→Chinese machine translation.The training process can be described asthree adversarial sub models (generator, val-ue screener and discriminator), playing awin−win game. In this game, the addedscreener plays the role of emphasizing thatthe discriminator pays attention to the addedMongolian morphological noise1 in the formof pseudo-data and improving the training effi-ciency. The experimental results show that thenewly emerged Mongolian→Chinese task isstate-of-the-art. Under this premise, the train-ing time is greatly shortened.

1 Introduction

The dominant neural machine translation (NMT)(Sutskever et al., 2014) models are based on recur-rent (RNN, (Mikolov et al., 2011)), convolution-al neural networks (CNN, (Gehring et al., 2017))or entirely eliminates recurrent connections andrelies instead on a repeated attention mechanism(Transformer, (Vaswani et al., 2017)) which areachieved by an attention mechanism (Bahdanauet al., 2014). A considerable weakness in theseNMT systems is their inability to correctly trans-late very rare words: end-to-end NMTs tend tohave relatively small vocabularies with a single< unk > symbol that represents every possible

1These morphological noises exist in most agglutinativelanguages in the form of appended stem, which are used todetermine the presentation or tense of words. Some typicalnoises, such as suffixes and cases in Mongolian, Korean andJapanese.

out-of-vocabulary (OOV) word. The problem ismore prominent in agglutinative language tasks,because the varied morphology brings great confu-sion to model decoding. The change of suffix andcomponent case2 in Mongolian largely deceivesthe translation model directly resulting in a largeamount of OOV during decoding. This OOV isthen crudely considered the same as an < unk >symbol.

Generally, there are three ways to solve thisproblem. A usual practice is to speed up training(Morin and Bengio, 2005; Jean et al., 2015; Mnihet al., 2013), these approaches can maintain a verylarge vocabulary. However, it works well whenthere are only a few unknown words in the targetsentence. These approaches have been observedthat the translation performance degrades rapidlyas the number of unknown words increases. An-other aspect is the information in context (Luonget al., 2015; Hermann et al., 2015; Gulcehre et al.,2016), they motivate their work from a psycholog-ical evidence that humans naturally have a tenden-cy to point towards objects in the context. The lastaspect is the input/output change, these approach-es change to a smaller resolution, such as char-acters (Graves, 2013a) and bytecodes (Sennrichet al., 2016). However, it is worth thinking thatthe training process usually becomes much hard-er because of the length of sequences considerableincreases.

For NLP tasks, generative adversarial network(GAN) is immature. Some studies, such as (Chenet al., 2016; Zhang et al., 2017), used GAN forsemantic analysis and domain adaptation. (Yuet al., 2016; Zhen et al., 2018; Wu et al., 2018)successfully applied GAN to sequence generationtasks. (Zhang Y, 2017) propose matching thehigh-dimensional latent feature distributions of re-

2Mongolian-case is a special suffix used to determine itsrelationship to other words in a sentence.

124

al and synthetic sentences, via a kernelized dis-crepancy metric. This eases adversarial trainingby alleviating the mode-collapsing problem.

In the present study, GAN is used for UNKproblem. The motivation for this is GAN

′s advan-

tage in approaching real data effectively based onnoise in a game training. To obtain generalizableadversarial training, we propose a noise-added s-trategy to add noise samples into the training setin the form of pseudo data. The noise is the maincause of UNK, such as the segmentation of suffix-es and the handling of case components in Mongo-lian. A representative example is used to illustratethe decoding search process of Mongolian sen-tences in adversarial training (Fig. 1). During de-coding, decoder usually can not solve the problemof morphological variability of words (caused bymorphological noise) through vocabulary, whichleads to OOV. Therefore, we introduce GAN mod-

I

Mongolian

unk

Mongolialanguage

learn ~inglearning

unka lot of books

readread

unk

Attribute case

From case

Object case

Object case

unkMongolian caseStem-suffixAdversarial process

Sentence:

(I read a lot of books when I

was learning Mongolian)

decoding

decoding

decoding

FE

FE

FE

FE

Figure 1: Given a sentence, Mongolian words face dif-ferent suffix and case noises in each decoding processof the adversarial training, which are the main reason-s for < unk >. For instance, the verb ‘(learn)’ and‘(read)’ need to add the verb-suffixes and tense-casesin order to associate with nouns in Mongolian. Con-versely, (“learning”) and (read+‘ ’), which are con-fused by suffixes and cases, do not appear in the vocab-ulary. This will directly cause <unk> appear in the de-coding process. The proposed model aims to improvethe generalization ability of noise through adversarialtraining.

el with a value screener (VS-GAN), a generaliza-tion of GAN, which makes the adversarial train-ing specific to the noise. The model also improvesthe efficiency of GAN training by value iterationnetwork (VIN) (Tamar et al., 2016) and address-es the problem of optimal parameter updating inReinforcement Learning(RL) training. These areour two contributions. The third contribution isa thorough empirical evaluation on four differen-

t noises. We compare several strong baselines,including MIXER (Ranzato et al., 2015), Trans-former (Vaswani et al., 2017), and BR-CSGAN(Zhen et al., 2018). The experimental results showthat VS-GAN achieves much better time efficien-cy and the newly emerged state-of-the-art result onMongolian-Chinese MT.

2 GAN with the Value Screener

In this section, we describe the architecture of VS-GAN in detail. VS-GAN consists of the follow-ing components: generator G, value screener, anddiscriminator D. Given the source language se-quence {x1, ..., xNx} with length N, G aims togenerate sentences

{y′1, ..., y

′Ny

}, which are in-

distinguishable by D. D attempts to discrimi-nate between

{y′1, ..., y

′Ny

}and human translated

ones{y1, ..., yNy

}. The value screener uses the

reward information generated by G to convert thedecoding cost into a simple value, and determineswhether the predictions of current state need to bepassed to D.

2.1 Generator G

The selection of G is individualized and targeted.In this work, we focus on long short term memory(LSTM(Graves, 2013b)) with attention mechanis-m and Transformer (Vaswani et al., 2017). Thetemporal structure of LSTM enables it to capturedependency semantics in agglutinative language.Transformer has refreshed state-of-the-art perfor-mance on several languages pairs. For the nec-essary policy optimization in GAN training, wefocus our problem on the RL framework (Mnihet al., 2013). The approach can solve the long-term reward problem because a standard modelfor sequential decision making and planning is themarkov decision process (MDP) (Dayan and Ab-bott, 2003) in RL training. G can be viewed asan agent which interacts with the external envi-ronment (the words and the context vector at everytimestep). The parameters of agent define a policyθ, whose execution results in the agent is selectingan action aεA. In NMT, an action represents theprediction of the next word y

′t in the sequence at t-

th timestep. After taking an action, the agent willupdate its internal state sεS (i.e., the hidden unit-s). RL will observe a reward R(s, a) once the endof a sequence (or the maximum sequence length)is reached. We can choose any reward function,

125

and in this case, we choose BLEU because it is themetric we used at the test time.

2.2 Value Screener

So far, the constructed G is still confused by noisebecause the effect of noise has not been fully u-tilized due to the lack of attention from D. Tosolve this problem, we add a VIN implementedvalue screener between G and D to enhance thegeneralization ability of G to the noise. In VIN,the < unk > symbol corresponds to a low train-ing reward, whereas the low training reward cor-responds to a low value. This is what the screenerwants to emphasize.

To achieve VIN, we introduce an interpretationof an approximate VI algorithm as a particular for-m of a standard CNN. Specifically, VI in this form,which makes learning the MDP (R., 1957; Bert-sekas., 2012) parameters and reward function nat-ural by backpropagation through the network. Wecan train the entire policy end-to-end on the ba-sis of its simplification by backpropagation. Forthe training process, each iteration of VI algorithmcan be seen as passing the previous value of Vt−1and reward R by a convolution layer and max-pooling layer. In this analogy, the active functionin the convolution layer corresponds to theQ func-tion. We can formulate the value iteration as:

yt = maxaQ(s, a)

= max

[R(s, a) +

N∑t=1

P (s|st−1, a)Vt−1

](1)

where Q(s,a) indicates the value of action a understate s at t-th timestep, the rewardR(s, a) and dis-counted transition probabilities P (s|st−1, a) areobtained from G which mentioned in Section 2.1.N denotes the length of the sequence. Thus,the value of sequence Vn will be produced byapplying the convolution layer recurrently sev-eral times according to the length of the sen-tence, and for a batch, n is valued between 1and batchsize of training. The optimal valueVupdate = Average(V1, ..., Vbatchsize) is the av-erage long-term return possible from a state. Thevalue of current predictions represents the cost ofdecoding at current state. We select the value ofoptimal pre-training model as the initial V ∗andcompare it with Vupdate. Subsequently, we ob-serve the decoding effect of the current batch;thus, we can decide the necessity of taking the neg-ative example as an input of D. The conditions of

screening are as follows:{direct input to D if Vupdate < V ∗

screening and V ∗ = Vupdate if Vupdate > V ∗

(2)Since VIN is simply a form of CNN, once a VIN

design is selected, implementing the screener is s-traightforward. The networks in the experimentsall require only several lines of Tensor code.

2.3 Discriminator D

We implementD on the basis of CNN. The reasonfor this is that CNN has advantages in dealing withvariable length sequences. The CNN padding isused to transform the sentences to sequences withfixed length. A source matrix X1:N and a targetmatrix Y1:N are created to represent {x1, ..., xNx}and

{y′1, ..., y

′Ny

}. We concatenate every k di-

mensional word embedding into the final matrixx1:N and y1:N respectively. A kernel wjεRl×k ap-plies a convolutional operation to a window sizeof l words to produce a series of feature maps:

cji = a(wj ⊕Xi:i+l−1 + b), (3)

where b is a bias term and ⊕ is the summation ofelement production. We use Relu as the functionto implement the nonlinear activation function a.Then a max-pooling operation is leveraged overthe feature maps:

cj∼max = max(cj1 , ..., cjN−l+1). (4)

For different window sizes, we set the corre-sponding kernel to extract the valid features, andthen we concatenate them to form the source sen-tence representation cx for D. And the target sen-tence representation cy can be extracted from thetarget matrix Y1:N . Then given the source sen-tence, the probability that the target sentence is be-ing real can be computed as:

pD = sigmoid(T [cx; cy]), (5)

where T is the transform matrix which transformsthe concatenation of cx and cy into a 2-dimensionembedding. We can get the final probability if weuse the matrix of 2-dimensional mapping as theinput of the sigmoid function.

2.4 Training Process

We present a standard VS-GAN training processin the form of data flow directions (Fig. 2):• Pre-training G with RL algorithm. Note that

we pre-train G to ensure that an optimal parame-ter is directly involved in training, and provides a

126

good search space for beam search.• Observe the reinforcement reward. Once the

end of sentence (or the maximum sequence length)is reached, a cumulative reward matrix R is gener-ated. The observed reward can measure the cumu-lative value of agent (G) in the prediction process(action) of a set of sequences.• Value screening. The reward R is fed into a

convolutional layer and a linear activation func-tion. This layer corresponds to a particular actionQ. The next-iteration value is then stacked withthe reward and fed back into the convolutional lay-er N times, where N depends on the length of thesequence. Subsequently, a long-term value Vupdateis generated by decoding a sentence. The batch isscreened by the set conditions, as shown in Eq.( 2).• Stay awake. D is dedicated to differentiat-

ing the screened negative result with the human-translated sentences, which provide the probabili-ty pD.• Adversarial game. When a win-win situation

is achieved, adversarial training will converge toan optimal state. That is,G can generate confusingnegative samples, and D has an efficient discrimi-nation ability for negative and human translations.Thus, the training objective is as follows:

Jθ = E(x,y)[logpD(x, y)]+E(x,y′ )[log(1−pD(x, y′))],

(6)where (x, y) is the ground truth sentence pair,(x, y

′) is the sampled translation pair, as positive

and negative training data respectively. pD(., .)represents a probability which mentioned in Dabout the similarity. Jθ can be regard as a gameprocess between maximum and minimum expec-tations. That is, the maximum expectation for thegeneration G, and the minimum expectation forD.

A common shortcoming of adversarial trainingin NLP applications is that it is non-trivial to de-sign the training process, i.e., texts (Huszr, 2015).Given that the discretely sampled y

′makes it dif-

ficult to back-propagate the error signals from Dto G directly, making Jθ nondifferentiable w.r.t.G′s model parameters θ. To solve this problem,

the Monte Carlo search under the policy of G isapplied to sample the unknown tokens for the es-timation of the signals. The objective of trainingG can be described as minimizing the followingloss:

Loss = E(x,y′ )[log(1− pD(x, y′))]. (7)

We use log(1− pD(x, y′) as a Monte-Carlo es-

timation of the signals. By simple derivation, wecan get the corresponding gradient of θ:∂Loss∂θ∼G

= Ey′ [log(1− pD(x, y′))

∂

∂θ∼GlogG(y

′ |x)],

(8)

where ∂∂θ∼G logG(y

′ |x) represents the gradientsspecified with parameters of the translation mod-el based on RL. Therefore, the gradient update ofparameters can be described as:

θ∼G ← θ∼G + l∂

∂θ∼G, (9)

where l is the learning rate, and we back propagatethe gradient along negative direction. Note that wehave not observed a high variance is accompaniedby such a computation.

Figure 2: Presentation of VS-GAN model, in whichdifferent colors represent each component in VS-GAN.

3 Experiment and Analysis

3.1 Dataset and Noise Addition

We verify the effectiveness of our model on a lan-guage pair where one of the languages involved isagglutinative: Mongolian-Chinese(M-C). We usethe data from CLDC and CWMT2017 evaluationcampaign. To avoid allocating excessive trainingtime on long sentences, all sentence pairs longerthan 50 words either on the source or target sideare discarded. Finally, by adding noise, we di-vide the training data of Mongolian into five cat-egories3:{Original, BPE4, Original&Suffixes, O-riginal&Case, Original&Suffixes&Case}. For thetarget Chinese besides BPE processing, we adoptcharacter granularity to provide a smaller unit cor-responding to the morphological noise. Some ef-fective work on morphological segmentation (Ata-man D, 2017; ThuyLinh Nguyen, 2010) can be ap-

3(Original&*) represents a mixed sample of the originaldata and the * segmentation of the original data.

4Note that BPE can only represent an open vocabularythrough the variable-length character sequence, it is insensi-tive to morphological noise.

127

plied to agglutinative language. However, in or-der to be more specific and accurately, we perfor-m independent-developed Mongolian segmenter.The final training corpus consists of about 230Koriginal sentences (including 1000 validation and1000 test) and corresponding pseudo-data sen-tences. We tried several num-operands of BPE5

on the data set, and the final selection is: Mongo-lian: 35,000, Chinese: 15,000.

3.2 Experimental Setup

We select three strong baselines. Transformerpresents an outstanding approach to most MTtasks. MIXER addresses exposure bias problemin traditional NMT well through RL, and BR-CSGAN is among the best endeavors to introducethe generative adversarial training into NMT.

The screening conditions mentioned in Section2 enable the model to be trained efficiently. Oneproblem is that under such conditions, V will grad-ually increase. Therefore, in the screening pro-cess, one situation should be considered, e.g., inbatch1 {V1 = 1, ..., Vn = 10}, Vupdate = 5.5, inbatch2 {V1 = 4, ..., Vn = 6}, Vupdate = 5. Wehave observed that batch1 has worse sentencesworth noting byD. However, because of the high-er average value, the batch1 will be screened out.In fact, we insist that such an operation is still rea-sonable, because the higher value batches occuronly at the end of the training, and the n-gram nat-ural of BLEU calculation indicates that the batch2needs more attention.

For the LSTM and MIXER, we set the dimen-sion of word embedding as 512 and dropout rateas 0.1/0.1/0.3. We use a beam search with abeam size of 4 and length penalty of 0.6. For theTransformer, the Transformer base configurationin (Vaswani et al., 2017) is an effective experiencesetting for our experiments. We set the G to gen-erate 500 negative examples per iteration, and thenumber for Monte Carlo search is set as 20.

3.3 Main Results and Analysis

We mainly analyze the experimental results inthree aspects: BLEU evaluation, the number of< unk > symbols in the translations, and the timeefficiency of model training.• BLEU We use BLEU (Papineni et al., 2002)

score as an evaluation metric to measure the sim-5https://github.com/rsennrich/

subword-nmt

ilarity degree between the generation and the hu-man translation.

For G, we select the model with 50 epochs ofpre-training as the initial state, and 80 adversarialtraining epochs is used to joint trainG andD. Theresults (Table 1) show that the GAN-based mod-el is obviously superior to baseline systems in anykind of noisy corpus, and VS-GAN performs bet-ter than each baseline with average 2-3 BLEU. Forthe same model, the added noise provides the ex-cellent generalization ability in testing, a notableresult shows that VS-GAN improves 3.8 BLEU onthe basis of the original corpus by adding both t-wo kinds of noise. We notice that in the trainingof adding case noise only, the effect of VS-GANis not outstanding. The reason for this is that theindividual Mongolian-case is not obviously ’help-ful’ to the production of< unk > symbol in Mon-golian, so the screener is insensitive to it.• UNK We count the number of < unk > sym-

bols in each system with 50 epochs of training to

0

100

200

300

400

500

600

700

800

900

1000

1 10 20 30 40 50

UNK

EPOCHS

Transformer(BPE)

MIXER

BR-CSGAN(LSTM)

VS-GAN(LSTM)

BR-CSGAN(Transformer)

VS-GAN(Transformer)

Figure 3: Number of < unk > symbols in the transla-tions of different models in each epoch.

translate the source sentences (Fig. 3). For BR-CSGAN and VS-GAN, we directly count the num-ber of < unk > symbols in the negative example.

In comparison with Transformer, MIXER opti-mizes BLEU through RL training, which can di-rectly enhance the BLEU score of the translation.However, in terms of UNK, it is inefficient. Theoptimal initial state cannot be effectively main-tained in the rest of the training (orange lines). Wecan see that the change of BLEU coincides withthe change of UNK number in combination withTable 1 and Fig. 3. Furthermore, we note that UN-K not only affects the accuracy of source word de-coding, but also affects the semantic prediction of

https://github.com/rsennrich/subword-nmt

https://github.com/rsennrich/subword-nmt

128

Table 1: Training time consuming and BLEU score of systems under different noise modes. We stop the pre-training of G (including Transformer and LSTM) until the validation accuracy achieves at δ which is set to 0.6 inBR-CSGAN and VS-GAN. For the pre-training of D, we consider the threshold of classification accuracy and setit to 0.7.

Original BPE Original&suffixes Original&Case Original&Suffixes&CaseMIXER(Ranzato et al., 2015) 29.7 26 30.4 28.6 31.3

BR-CSGAN(G:LSTM)(Zhen et al., 2018)

29.9|(15+17)h 30.8|(21+30)h 31.7|(22+25)h 31.1|(15+19)h 32.3|(27+32)h

VS-GAN(G:LSTM)

29.6|(15+11)h 31.5|(21+18)h 32.5|(22+19)h 30.8|(15+15)h ?35.4|(27+21)h

Transformer(Vaswani et al., 2017) 30.5 31.5 30.2 29.8 30.5BR-CSGAN

(G:Transformer)(Zhen et al., 2018)27.4|(22+29)h 28|(27+34)h 28.9|(23+25)h 28.1|(22+30)h 32.1|(38+36)h

VS-GAN(G:Transformer)

28.8|(22+18)h 31.1|(27+26)h 32|(23+18)h 29.4|(22+18)h 33.2|(38+22)h

the entire sentence in translation.• Training Efficiency In terms of training ef-

ficiency, we compared the two GAN-based mod-els by counting the time of pre-training and ad-versarial training(italics in Table 1), (e.g., 15 + 17indicates 15 h of pre-training and 17 h of adver-sarial training). Reinforcement pre-training is thesame for BR-CSGAN and VS-GAN. In adversar-ial training, VS-GAN has a remarkable time re-duction in each noise training strategy. This resultdepends on the screener for negative generations,so that D can regulate G, following UNK directly.Such combination of structures can converge to anoptimal state rapidly. From the results in Table1,in the case of the two GANs the training time forthe LSTM is shorter than for the Transformer. Weattribute this to two reasons: i) the time consumedby LSTM is mainly used to explore long-distancedependencies in sequences. However, most of ourcorpus consists of short sentences (<50 words),which bridges the gap between LSTM and Trans-former and even exceeds Transformer (when itachieves the same accuracy of validation set). i-i) in fact, according to our extensive experimen-tal results on Mongolian-based NMT(includingMongolian-Chinese and Mongolian-Cyrillic Mon-golian), Transformer usually converges slowerthan LSTM when the corpus size exceeds 0.2M.

4 Conclusion

We propose a GAN model with an additional VINapproximation of value screener to solve the UN-K problem in Mongolian→Chinese MT, which iscaused by the change of suffixes or componentcases in Mongolian and the limited vocabulary. Inour experiment, we adopt the pretreatment method

on the basis of noise addition to enhance the gen-eralization ability of the model for UNK prob-lem. Experimental results show that our approachsurpasses the state-of-the-art results in a varietyof noise-based training strategies and significant-ly saves training time. In future research, we willfocus more on the combination of GAN and lan-guage features to enhance other agglutinative lan-guage NMT tasks, such as the guidance of syntaxtree for GAN training. On the contrary, it is alsoa worthwhile attempt to modify the grammar treeconstructed by adversarial training.

5 Acknowledgments

This study is supported by the Natu-ral Science Foundation of Inner Mongolia(No.2018MS06005), and Mongolian LanguageInformation Special Support Project of InnerMongolia (No.MW-2018-MGYWXXH-302).

ReferencesTurchi M et al. Ataman D, Negri M. 2017. Lin-

guistically motivated vocabulary reduction for neu-ral machine translation from turkish to english.The Prague Bulletin of Mathematical Linguistics,108(1):331–342.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv, 1409(0473).

D Bertsekas. 2012. Dynamic programming and opti-mal control. Athena Scientific, (04).

Xilun Chen, Yu Sun, and et al. 2016. Adversarial deepaveraging networks for cross-lingual sentiment clas-sification. In Association for Computational Lin-guistics (ACL), pages 557–570.

129

P. Dayan and L F Abbott. 2003. Theoretical neu-roscience: Computational and mathematical mod-elling of neural systems. Journal of Cognitive Neu-roscience, 15(1):154–155.

Jonas Gehring, Michael Auli, David Grangier, andet al. 2017. Convolutional sequence to sequencelearning. In International conference on machinelearning(ICML), pages 1243–1252.

Alex Graves. 2013a. Generating sequences with re-current neural networks. arXiv preprint arXiv,1308(0850).

Alex Graves. 2013b. Generating sequences with re-current neural networks. arXiv preprint arXiv,1308(0850).

Caglar Gulcehre, Sungjin Ahn, and et al. 2016. Point-ing the unknown words. In Association for compu-tational linguistics (ACL), pages 140–149.

Karl Moritz Hermann, Tom Kocisk, and et al. 2015.Teaching machines to read and comprehend. InNeural Information Processing Systems (NIPS),pages 1693–1701.

Ferenc Huszr. 2015. How (not) to train your generativemodel: Scheduled sampling, likelihood, adversary?Computer Science, 1511(05101).

Sbastien Jean, Kyunghyun Cho, Roland Memisevic,and Yoshua Bengio. 2015. On using very large tar-get vocabulary for neural machine translation. InInternational joint conference on natural languageprocessing (IJCNLP), pages 1–10.

Minh-Thang Luong, Ilya Sutskever, and et al. 2015.Addressing the rare word problem in neural machinetranslation. In International joint conference on nat-ural language processing (IJCNLP), pages 11–19.

Tom Mikolov, Stefan Kombrink, Luk Burget, and et al.2011. Extensions of recurrent neural network lan-guage model. In International conference on acous-tics, speech, and signal processing, pages 5528–5531.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,and et al. 2013. Playing atari with deep reinforce-ment learning. arXiv, 1312(5602).

Frederic Morin and Yoshua Bengio. 2005. Hierarchicalprobabilistic neural network language model. In In-ternational conference on artificial intelligence andstatistics (Aistats), volume 5, pages 246–252.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic e-valuation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics(ACL), pages 311–318.

Bellman R. 1957. Terminal control, time lags, and dy-namic programming. National Academy of Sciencesof the United States of America, 43(10):927–930.

Marc’Aurelio Ranzato, Sumit Chopra, and et al. 2015.Sequence level training with recurrent neural net-works. arXiv, 1511(06732).

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Association for computational lin-guistics (ACL), pages 1715–1725.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural network-s. In Conference and Workshop on Neural Informa-tion Processing Systems (NIPS), pages 3104–3112.

A Tamar, Y Wu, G Thomas, and et al. 2016. Value it-eration networks. In Neural Information ProcessingSystems (NIPS), pages 2154–2162.

Noah A.Smith. ThuyLinh Nguyen, Stephan Vogel.2010. Nonparametric word segmentation for ma-chine translation. In Proceedings of the Internation-al Conference on Computational Linguistics, (COL-ING10), pages 815–823.

Ashish Vaswani, Noam Shazeer, and et al. 2017. Atten-tion is all you need. In Conference and Workshopon Neural Information Processing Systems (NIPS),pages 5998–6008.

L Wu, Y Xia, L Zhao, and et al. 2018. Adversarialneural machine translation. In Asian Conference onMachine Learning (ACML), pages 374–385.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2016. Seqgan: Sequence generative adversarial net-s with policy gradient. In The Association for theAdvancement of Artificial Intelligence (AAAI), pages2852–2858.

Yuan Zhang, Regina Barzilay, and Tommi Jaakkola.2017. Aspect-augmented adversarial networks fordomain adaptation. Transactions of the Associationfor Computational Linguistics, 5(1):515–528.

Fan K et al. Zhang Y, Gan Z. 2017. Adversarial fea-ture matching for text generation. In Proceedings ofthe International conference on machine learning,(ICML17), pages 4006–4015.

Yang Zhen, Chen Wei, Wang Feng, and Xu Bo. 2018.Improving neural machine translation with condi-tional sequence generative adversarial nets. InThe North American Chapter of the Association forComputational Linguistics (NAACL), pages 1346–1355.

improving mongolian-chinese neural machine translation ... · et al.,2015;hermann et...

Documents