is neural machine translation ready for deployment? a … · is neural machine translation ready...

Is Neural Machine Translation Ready for Deployment?A Case Study on 30 Translation Directions

Marcin Junczys-Dowmunt1,2, Tomasz Dwojak1,2, Hieu Hoang3

1Faculty of Mathematics and Computer Science, Adam Mickiewicz University in Poznan2School of Informatics, University of Edinburgh

3Moses Machine Translation CIC{junczys,t.dwojak}@amu.edu.pl [email protected]

Abstract

In this paper we provide the largest published compar-ison of translation quality for phrase-based SMT andneural machine translation across 30 translation direc-tions. For ten directions we also include hierarchicalphrase-based MT. Experiments are performed for therecently published United Nations Parallel Corpus v1.0and its large six-way sentence-aligned subcorpus. In thesecond part of the paper we investigate aspects of trans-lation speed, introducing AmuNMT, our efficient neuralmachine translation decoder. We demonstrate that cur-rent neural machine translation could already be usedfor in-production systems when comparing words-per-second ratios.

1. Introduction

We compare the performance of phrase-based SMT, hi-erarchical phrase-based, and neural machine translation(NMT) across fifteen language pairs and thirty transla-tion directions. [1] recently published the United Na-tions Parallel Corpus v1.0. It contains a subcorpus ofca. 11M sentences fully aligned across six languages(Arabic, Chinese, English, French, Russian, and Span-ish) and official development and test sets, which makesthis an ideal resource for experiments across multiplelanguage pairs. It is also a compelling use case forin-domain translation with large bilingual in-house re-sources. We provide BLEU scores for the entire trans-lation matrix for all official languages from the fullyaligned subcorpus.

We also introduce AmuNMT1, our efficient neuralmachine translation decoder and demonstrate that thecurrent set-up could already be used instead of Moses

1https://github.com/emjotde/amunmt

in terms of translation speed when a single GPU permachine is available. Multiple GPUs would surpass thespeed of the proposed in-production Moses configura-tion by far.

2. Training data

2.1. The UN corpus

The United Nations Parallel Corpus v1.0 [1] consistsof human translated UN documents from the last 25years (1990 to 2014) for the six official UN languages,Arabic, Chinese, English, French, Russian, and Span-ish. Apart from the pairwise aligned documents, a fullyaligned subcorpus for the six official UN languages isdistributed. This subcorpus consists of sentences thatare consistently aligned across all languages with theEnglish primary documents. Statistics for the data areprovided below:

Documents Lines English Tokens

86,307 11,365,709 334,953,817

Table 1: Statistics for fully aligned subcorpus

Documents released in 2015 (excluded from themain corpus) were used to create official developmentand test sets for machine translation tasks. Develop-ment data was randomly selected from documents thatwere released in the first quarter of 2015, test data wasselected from the second quarter. Both sets comprise4,000 sentences that are 1-1 alignments across allofficial languages. As in the case of the fully alignedsubcorpus, any translation direction can be evaluated.

arX

iv:1

610.

0110

8v3

[cs

.CL

] 3

0 N

ov 2

016

2.2. Preprocessing

Sentences longer than 100 words were discarded. Welowercased the training data as it was done in [1]; thedata was tokenized with the Moses tokenizer. For Chi-nese segmentation we applied Jieba2 first.

2.3. Subword units

To avoid the large-vocabulary problem in NMT models[2], we use byte-pair-encoding (BPE) to achieve open-vocabulary translation with a fixed vocabulary of sub-word symbols [3]. For all languages we set the numberof subword units to 30,000. Segmentation into subwordunits is applied after any other preprocessing step. Dur-ing evaluation, subwords are reassembled.

3. Phrase-based SMT baselines

[1] provided baseline BLEU scores for Moses [4] con-figurations that were trained on the 6-way subcorpus.Their configurations were the following:

The training corpora were split into four equallysized parts that were aligned with MGIZA++ [5],running 5 iterations of Model 1 and the HMM modelon each part. A 5-gram language model was trainedfrom the target parallel data, with 3-grams or higherorder being pruned if they occured only once. Apartfrom the default configuration with a lexical reorder-ing model, a 5-gram operation sequence model [6](all n-grams pruned if they occur only once) and a9-gram word-class language model with word-classesproduced by word2vec [7] (3-grams and 4-grams arepruned if they occur only once, 5-grams and 6-grams ifthey occur only twice, etc.) were added, both trainedwith KenLM [8]. Significance pruning [9] was appliedto the phrase-table and the compact phrase-table andreordering data structures [10] were used. Duringdecoding with cube-pruning algorithm, stack sizeand cube-pruning pop limits of 1,000 were chosen.This configuration resembles the in-house translationsystems deployed at the United Nations.

4. Neural translation systems

The neural machine translation system is an attentionalencoder-decoder [11], which has been trained withNematus [12]. We used mini-batches of size 40, amaximum sentence length of 100, word embeddings of

2https://github.com/fxsjy/jieba

size 500, and hidden layers of size 1024. We clip thegradient norm to 1.0 [13]. Models were trained withAdadelta [14], reshuffling the training corpus betweenepochs. The models have been trained for 1.2M iter-ations (one iteration corresponds to one mini-batch),saving every 30, 000 iterations. On our NVidia GTX1080 this corresponds to roughly 4 epochs and 8 daysof training time. Models with English as their sourceor target data were later trained for another 1.2Miterations (another 2 epochs, 8 days). For ensembling,we chose the last four model checkpoints.

5. Phrase-based vs. NMT – full matrix

In Figure 1 we present the results for all thirty lan-guage pairs in the United Nations parallel corpusfor the officially included test set. BLEU results arecase-insensitive, tokenized with the Moses tokenizer.For Chinese we applied Jieba for segmentation.BPE-subwords were concatenated.

Here we compare with NMT models that weretrained for 4 epochs or 1.2M iterations. With the excep-tion of fr-es, the neural system is always comparableor better than the phrase-based system. Especiallyin cases where Chinese is one of the languages in alanguage pair, the improvement of NMT over PB-SMTis dramatic with between 7 and 9 BLEU points. Wealso see large improvements for translations out of andinto Arabic. No special preprocessing has been appliedfor Arabic. Improvements are also present in the caseof the highest scoring translation directions, en-es andes-en.

6. Phrase-based vs. Hiero vs. NMT – languagepairs with English

The impressive results for any translation direction in-volving Chinese motivated us to experiment with hi-erarchical phrase-based machine translation (Hiero) asimplemented in Moses . Hiero has been confirmed tooutperform phrase-based SMT for the Chinese-Englishlanguage pair. We decided to expand our experimentwith all language pairs that include English as theseare the main translation directions the UN are workingwith. For these ten translation directions we created ahierarchical PB-SMT system with the same preprocess-ing settings as the shallow PB-SMT system.

To test the effects of prolonged training time,we also continued training of our neural systems foranother four epochs or 1.2M iterations (2.4M in total)

ar-en ar-es ar-fr ar-ru ar-zh en-ar en-es en-fr en-ru en-zh

30

40

50

60

53.1

49.8

42.8

36.0

31.6

42.0

61.3

50.1

43.3

37.8

56.0

52.2

44.7

38.8

39.7 4

5.1

62.1

51.5

45.3

46.8

Pb-SMTNMT 1.2M

es-ar es-en es-fr es-ru es-zh fr-ar fr-en fr-es fr-ru fr-zh

30

40

50

60

38.1

59.9

49.8

39.7

31.3 34.4

52.2

52.4

36.5

30.0

39.7

61.1

49.8

41.0

41.2

35.5

52.4

51.8

37.3

37.4

ru-ar ru-en ru-es ru-fr ru-zh zh-ar zh-en zh-es zh-fr zh-ru

30

40

50

60

34.4

52.6

49.6

43.4

32.6

28.0

43.0

39.6

34.4

29.6

35.7

52.9

50.5

44.1

39.4

35.4

51.8

48.3

42.0

36.4

Figure 1: Comparison between Moses baseline systems and neural models for the full language pair matrix of the6-way corpus.

en-ar en-es en-fr en-ru en-zh

40

50

6042.0

61.3

50.1

43.3

37.8

42.6

60.4

49.3

43.2

41.745.1

62.1

51.5

45.3

46.8

45.6

62.7

51.9

46.0

47.6

Pb-SMTHieroNMT 1.2MNMT 2.4M

ar-en es-en fr-en ru-en zh-en

40

50

60

53.1

59.9

52.2

52.6

43.0

54.7

60.7

53.0

53.6

47.3

56.0

61.1

52.4

52.9

51.8

56.9

61.7

53.6

53.2

53.1

Figure 2: For all language pairs involving English, we experimented also with hierarchal machine translation and moretraining iterations for the neural models. NMT 1.2M means 1.2 million iterations with a batch size of 40, trainingtime ca. 8 days. NMT 2.4 means 2.4 million iterations accordingly.

which increased training time to 16 days in total perneural system.

Figure 2 summarizes the results. As expected,Hiero outperforms PB-SMT by a significant marginfor Chinese-English and English-Chinese, but does notreach half the improvement of the NMT systems. Forother languages pairs we see mixed results with modelstrained for 1.2M iterations; for French-English andRussian-English, where results for PB-SMT and NMTare close, Hiero is the best system.

However, training the NMT system for anothereight days helps, with gains between 0.3 and 1.3BLEU. We did not observe improvements beyond 2Miterations. For our setting, it seems that stopping train-ing after 10 days is a viable heuristic. Training withother corpus sizes, architectures, and hyper-parametersmay behave differently.

7. Efficient decoding with AmuNMT

AmuNMT is a ground-up neural MT toolkit implemen-tation, developed in C++. It currently consist of an effi-cient beam-search inference engine for models trainedwith Nematus. We focused mainly on efficiency andusability. Features of the AmuNMT decoder include:

• Multi-GPU, multi-CPU and mixed GPU/CPUmode with sentence-wise threads (differentsentences are decoded in different threads);

• Low-latency CPU-based decoding with intra-sentence multi-threading (one sentence makesuse of multiple threads during matrix operations);

• Compatibility with Nematus [12];• Ensembling of multiple models;• Vocabulary selection in the output layer [15, 16];• Integrated segmentation into subword units [3].

0 5 10 15 20 25 30 35 40 45 50

400

600

800

1,000

Beam size

Wor

dspe

rsec

ond

(Wps

)

0 5 10 15 20 25 30 35 40 45 50

48.0

48.5

49.0

49.5

50.0

WpsBleu

Figure 3: Beam size versus speed and quality for a single English-French model. Speed was evaluated with AmuNMTon a single GPU. The ragged appearance is due to CuBLAS GEMM kernel switching for different matrix sizes.

System BLEU Wps Memory

Single best 49.68 ~ 860 301MLast-4 ensemble 51.38 ~ 200 2.1GLast-4 average 51.13 ~ 860 301MLast-8 average 51.35 ~ 860 301M

Table 2: Checkpoint ensembling and averaging. Wps istranslation speed as words-per-second, one GPU. Mem-ory is memory consumption or total model size.

7.1. Checkpoint ensembling and averaging

Results in Figures 1 and 2 are reported for ensemblesof the last four model checkpoints. Generally, this im-proves quality by up to 2 BLEU points, but reducestranslation speed with growing ensemble size. An al-ternative found to work by [18] is checkpoint averaging,where a single model is produced by element-wise av-eraging of all corresponding model parameters, produc-ing a single model.3 We observed that averaging over 8last checkpoints yields similar quality as ensembling ofthe last 4 checkpoints (Table 2). Results for averagingmore models are uneven as increasingly weaker modelsare used. We did not investigate ensembles of modelsproduced by separate training runs which would likelyyield more improvements.

3In later experiments and discussions with Rico Sennrich, wehave found that this effect seems to be an artifact of the Adadeltaalgorithm. Training with Adam results in better single models andseems to make checkpoint ensembling or averaging obsolete.

7.2. Vocabulary selection

Our vocabulary size is already greatly reduced due tothe use of subword units (30k items), but especially forthe CPU version of AmuNMT decoding time is domi-nated by the calculations in the final output layer. Previ-ous work [15] proposed to clip the per-sentence vocabu-lary to the K = 30k (out of 500k) most common targetwords and K ′ = 10 most probable translations for eachsource word. We re-use Moses lexical translation tablestrained on data segmented into subword units to obtainthe translation probabilities.

Our implementation is based on [15], however, sim-ilar as [16], we find that relying mostly on K ′ does notresult in deterioration of translation quality and that Kcan be greatly reduced. We empirically chose K = 75and K ′ = 75, which results on average in ca. 1250 vo-cabulary items per sentence. BLEU scores remain thesame although translations are slightly changed. Forperformance comparisons see Section 7.4.

7.3. Beam size vs. speed and quality

Beam size has a large impact on decoding speed. InFigure 3 we plot the influence of beam size on decod-ing speed (as words per second) and translation quality(in BLEU) for the English-French model. The Englishpart of the UN test set consists or ca. 120.000 tokens,the whole test set was translated for each experiment.As can be seen, beam sizes beyond 5-7 do not resultin significant improvements as the quality for a beamsize of 5 is only 0.2 BLEU below the maximum. How-

Google

CPU-88

Google

GPUT-K

80

Google

TPU

0

200

400

600

800

1,000

1,200

1,400

104.245.5

358.8

Wor

dspe

rsec

ond

(a) Words per second for Google NMTsystem reported by [17]. We only givethis as an example for a production-ready system.

Mos

esCPU-16

Nematu

s CPU-16

Nematu

s GPU-1

AmuNM

TCPU-16

AmuNM

T* CPU-16

AmuNM

TGPU-1

AmuNM

T* GPU-1

0

200

400

600

800

1,000

1,200

1,400

455.3

47.4

269.4140.9

265.3

864.7

1,191.5

(b) Moses vs. Nematus vs. AmuNMT: Our CPUs are Intel Xeon E5-2620 2.40GHz,our GPUs are GeForce GTX 1080. CPU-16 means 16 CPU threads, GPU-1 meanssingle GPU. All NMT systems were run with a beam size of 5. Systems markedwith * use vocabulary selection.

Figure 4: Comparison of translation speed in words per second

ever, decoding speed is significantly slower. We there-fore choose a beam-size of 5 for our experiments.

7.4. AmuNMT vs. Moses and Nematus

In Figure 4a we report speed in terms of words per sec-ond as provided by [17]. Although their models aremore complex than ours, we quote these figures as areference of deployment-ready performance for NMT.

We ran our experiments on an Intel Xeon E5-26202.40GHz server with four NVIDIA GeForce GTX 1080GPUs. The phrase-based parameters are described inSection 3 which is guided by best practices to achievingreasonable speed vs. quality trade-off [19]. The neuralMT models are as described in the previous section.

We present the words-per-second ratio for our NMTmodels using AmuNMT and Nematus, executed on theCPU and GPU, Figure 4b. For the CPU version we use16 threads, translating one sentence per thread. We re-strict the number of OpenBLAS threads to 1 per mainNematus thread. For the GPU version of Nematus weuse 5 processes to maximize GPU saturation. As a base-line, the phrase-based model reaches 455 words per sec-ond using 16 threads.

The CPU-bound execution of Nematus reaches47 words per second while the GPU-bound achieved270 words per second. In similar settings, CPU-boundAmuNMT is three times faster than Nematus CPU,but three times slower than Moses. With vocabularyselection we can nearly double the speed of AmuNMTCPU. The GPU-executed version of AmuNMT is morethan three times faster than Nematus and nearly twiceas fast as Moses, achieving 865 words per second, withvocabulary selection we reach 1,192. Even the speedof the CPU version would already allow to replace aMoses-based SMT system with an AmuNMT-basedNMT system in a production environment withoutseverely affecting translation throughput.

AmuNMT can parallelize to multiple GPUs pro-cessing one sentence per GPU. Translation speed in-creases to 3,368 words per second with four GPUs, to4,423 with vocabulary selection. AmuNMT has a start-up time of less than 10 seconds, while Nematus mayneed several minutes until the first translation can beproduced. Nevertheless, the model used in AmuNMTis an exact implementation of the Nematus model.

The size of the NMT model with the chosen param-eters is approximately 300 MB, which means about 24

Mos

esCPU

Nematu

s GPU

AmuNM

TCPU

AmuNM

T* CPU

AmuNM

TGPU

AmuNM

T* GPU

0

500

1,000

1,500

712610

1,561

886

41 31Lat

ency

–m

spe

rsen

tenc

e

Figure 5: Latency measured in milliseconds per sen-tence. NMT systems were run with a beam size of 5.Lower is better. Translations were executed serially.Nematus CPU exceeded 8,500 ms and was omitted.

models could be loaded onto a single GPU with 8GB ofRAM. Hardly any overhead is required during transla-tion. With multiple GPUs, access could be parallelizedand optimally scheduled in a query-based server setting.

7.5. Low-latency translation

Until now, we have evaluated translation speed asthe time it takes to translate a large test set using alarge number of cores on a powerful server. Thisbulk-throughput measure is useful for judging theperformance of the MT system for batch translation oflarge number of documents.

However, there are use-cases where the latency fora single sentence may be important, for example predic-tive translation [20]. To compare per-sentence latency,we translate our test set with all tools serially, using atmost one CPU thread or process. We do not aim at fullGPU saturation as this would not improve latency. Wethen average time over the number of sentences and re-port milliseconds per sentence in Figure 5, lower val-ues are better. Here AmuNMT GPU compares veryfavourably against all other solutions with a 20 timeslower latency than Moses and Nematus. Latency be-tween the CPU-only variants shows similar ratios as forbulk-translation.

8. Conclusions and future work

Although NMT systems are known to generalize bet-ter than phrase-based systems for out-of-domain data,it was unclear how they perform in a purely in-domainsetting which is of interest for any organization withsignificant resources of their own data, such as the UNor other governmental bodies.

We evaluated the performance of neural machinetranslation on all thirty translation directions of theUnited Nations Parallel Corpus v1.0. We showedthat for all translation directions NMT is either onpar with or surpasses phrase-based SMT. For somelanguage pairs, the gains in BLEU are substantial.These include all pairs that have Chinese as a sourceor target language. Very respectable gains can also beobserved for Arabic. For other language pairs there isgenerally some improvement. In the future we wouldlike to verify these results with human evaluation.

We introduced our efficient neural machinetranslation beam-search decoder, AmuNMT, anddemonstrated that high-quality and high-performanceneural machine translation can be achieved on com-modity hardware; the GPUs we tested on are availableto the general public for gaming PCs and graphicsworkstations. A single GPU outmatches the per-formance of 16 CPU threads on server-grade IntelXeon CPUs. Access to specialized hardware seemsunnecessary when planning to switch to neural machinetranslation with lower-parametrized models.

Even the performance of the CPU-only version ofAmuNMT allows to set-up demo systems and can bea viable solution for low-throughput settings. Training,however, requires a GPU. Still, one might start with oneGPU for training and reuse the CPU machines on whichMoses has been running for first deployment. For futurework, we plan to further improve the performance ofAmuNMT, especially for CPU-only settings.

We did not cover architecture-related questions.Reducing the hidden state size by half could improveperformance 4-fold. Determining the influence ontranslation quality would require more experiments.

9. Acknowledgments

This project has received funding from the Euro-pean Union’s Horizon 2020 research and innovationprogramme under grant agreement 688139 (SUMMA).

This work is sponsored by the Air Force ResearchLaboratory, prime contract FA8650-11-C-6160. The

views and conclusions contained in this document arethose of the authors and should not be interpreted asrepresentative of the official policies, either expressedor implied, of the Air Force Research Laboratory or theU.S. Government.

10. References

[1] M. Ziemski, M. Junczys-Dowmunt, andB. Pouliquen, “The United Nations ParallelCorpus v1.0,” in LREC 2016. ELRA, 2016.

[2] T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, andW. Zaremba, “Addressing the rare word problemin neural machine translation,” in ACL, 2015.

[3] R. Sennrich, B. Haddow, and A. Birch, “Neuralmachine translation of rare words with subwordunits,” CoRR, vol. abs/1508.07909, 2015.

[4] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con-stantin, and E. Herbst, “Moses: Open sourcetoolkit for statistical machine translation,” in ACL,2007, pp. 177–180.

[5] Q. Gao and S. Vogel, “Parallel implementationsof word alignment tool,” in Software Engineering,Testing, and Quality Assurance for Natural Lan-guage Processing, 2008, pp. 49–57.

[6] N. Durrani, A. Fraser, H. Schmid, H. Hoang,and P. Koehn, “Can Markov models over mini-mal translation units help phrase-based SMT?” inACL, 2013, pp. 399–405.

[7] T. Mikolov, K. Chen, G. Corrado, and J. Dean,“Efficient estimation of word representations invector space,” CoRR, vol. abs/1301.3781, 2013.

[8] K. Heafield, I. Pouzyrevsky, J. H. Clark, andP. Koehn, “Scalable modified Kneser-Ney lan-guage model estimation,” in ACL, 2013, pp. 690–696.

[9] H. Johnson, J. D. Martin, G. F. Foster, andR. Kuhn, “Improving translation quality by dis-carding most of the phrasetable,” in EMNLP-CoNLL, 2007, pp. 967–975.

[10] M. Junczys-Dowmunt, “Phrasal Rank-Encoding:Exploiting phrase redundancy and translational

relations for phrase table compression,” PragueBull. Math. Linguistics, vol. 98, pp. 63–74, 2012.

[11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-chine translation by jointly learning to align andtranslate,” CoRR, vol. abs/1409.0473, 2014.

[12] R. Sennrich, B. Haddow, and A. Birch, “Ed-inburgh neural machine translation systems forWMT 16,” in WMT, 2016, pp. 371–376.

[13] R. Pascanu, T. Mikolov, and Y. Bengio, “On thedifficulty of training recurrent neural networks,”in ICML, 2013, pp. 1310–1318.

[14] M. D. Zeiler, “ADADELTA: An Adaptive Learn-ing Rate Method,” CoRR, vol. abs/1212.5701,2012.

[15] S. Jean, K. Cho, R. Memisevic, and Y. Bengio,“On using very large target vocabulary for neuralmachine translation,” in ACL, 2015, pp. 1–10.

[16] H. Mi, Z. Wang, and A. Ittycheriah, “Vocabularymanipulation for neural machine translation,” inACL, 2016.

[17] Y. Wu, M. Schuster, Z. Chen, Q. V. Le,M. Norouzi, W. Macherey, M. Krikun, Y. Cao,Q. Gao, K. Macherey, J. Klingner, A. Shah,M. Johnson, X. Liu, L. Kaiser, S. Gouws,Y. Kato, T. Kudo, H. Kazawa, K. Stevens,G. Kurian, N. Patil, W. Wang, C. Young, J. Smith,J. Riesa, A. Rudnick, O. Vinyals, G. Corrado,M. Hughes, and J. Dean, “Google’s Neural Ma-chine Translation System: Bridging the Gap be-tween Human and Machine Translation,” CoRR,vol. abs/1609.08144, 2016.

[18] M. Junczys-Dowmunt, T. Dwojak, and R. Sen-nrich, “The AMU-UEDIN submission to theWMT16 news translation task: Attention-basedNMT models as feature functions in phrase-basedSMT,” in WMT, 2016, pp. 319–325.

[19] B. Pouliquen, C. Elizalde, M. Junczys-Dowmunt,C. Mazenc, and J. Garcıa-Verdugo, “Large-scalemultiple language translation accelerator at theUnited Nations,” in MT-Summit XIV, 2013, pp.345–352.

[20] R. Knowles and P. Koehn, “Neural interactivetranslation prediction,” in AMTA, 2016.