neural network language models for candidate scoring in multi-system machine translation

Neural Network Language Models

for Candidate Scoring in Multi-System Machine

TranslationMatīss Rikters

University of LatviaCOLING 2016 6th Workshop on

Hybrid Approaches to TranslationOsaka, Japan

December 11, 2016

Contents

1. Introduction2. Baseline System 3. Example Sentence4. Neural Network Language Models5. Results6. Related publications7. Future plans

Chunking– Parse sentences with Berkeley Parser (Petrov et al., 2006)– Traverse the syntax tree bottom up, from right to left– Add a word to the current chunk if

• The current chunk is not too long (sentence word count / 4)• The word is non-alphabetic or only one symbol long• The word begins with a genitive phrase («of »)

– Otherwise, initialize a new chunk with the word– In case when chunking results in too many chunks, repeat the process,

allowing more (than sentence word count / 4) words in a chunkTranslation with online MT systems

– Google Translate; Bing Translator; Yandex.Translate; Hugo.lv 12-gram language model

– DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million Latvian legal domain sentences

Baseline System

Teikumu dalīšana tekstvienībās

Tulkošana ar tiešsaistes MT API

Google Translate

Bing Translator LetsMT

Labāko fragmentu izvēle

Tulkojumu izvade

Teikumu sadalīšana fragmentos

Sintaktiskā analīze

Teikumu apvienošana

Sentence tokenization

Translation with online MT

Selection of the best chunks

Output

Syntactic analysis

Sentence chunking

Sentence recomposition

Baseline System

Sentence Chunking

Choose the best candidate

KenLM (Heafield, 2011) calculates probabilities based on the observed entry with longest matching history :

where the probability and backoff penalties are given by an already-estimated language model. Perplexity is then calculated using this probability: where given an unknown probability distribution p and a proposed probability model q, it is evaluated by determining how well it predicts a separate test sample x1, x2... xN drawn from p.

Example sentence

Example sentence

Recently there has been an increased interest

in the automated discovery

of equivalent expressions in different languages .

Neural Language Models

• RWTHLM• CPU only• Feed-forward, recurrent (RNN) and long short-term

memory (LSTM) NNs• MemN2N

• CPU or GPU• End-to-end memory network (RNN with attention)

• Char-RNN• CPU or GPU• RNNs, LSTMs and rated recurrent units (GRU)• Character level

Best Models• RWTHLM

• one feed-forward input layer with a 3-word history, followed by one linear layer of 200 neurons with sigmoid activation function

• MemN2N• internal state dimension of 150, linear part of

the state 75, number of hops set to six• Char-RNN

• 2 LSTM layers with 1,024 neurons each, dropout set to 0.5

Char-RNN

• A character level model works better for highly inflected languages with less data

• Requires Torch scientific computing framework + additional packages

• Can run on CPU, NVIDIA GPU or AMD GPU

• Intended for generating new text, modified to score new text

More in Andrej Karpathy’s blog

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Experiment Environment

Training• Baseline KenLM and RWTHLM modes

• 8-core CPU with 16GB of RAM• MemN2N

• GeForce Titan X (12GB, 3,072 CUDA cores)12-core CPU and 64GB RAM

• Char-RNN• Radeon HD 7950 (3GB, 1,792 cores)

8-core CPU and 16GB RAM

Translation• All models

• 4-core CPU with 16GB of RAM

Results

System PerplexityTraining Corpus

SizeTrained

OnTraining

Time BLEU

KenLM 34.67 3.1M CPU 1 hour 19.23

RWTHLM 136.47 3.1M CPU 7 days 18.78

MemN2N 25.77 3.1M GPU 4 days 18.81

Char-RNN 24.46 1.5M GPU 2 days 19.53

General domain

0.11 0.20 0.32 0.41 0.50 0.61 0.70 0.79 0.88 1.00 1.09 1.20 1.29 1.40 1.47 1.56 1.67 1.74 1.7715.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

12.00

12.50

13.00

13.50

14.00

14.50

15.00

15.50

16.00

16.50

17.00

Perplexity BLEU-HY Linear (BLEU-HY)BLEU-BG Linear (BLEU-BG)

Epoch

Per

plex

ity

BLE

U

Legal domain

0.11 0.20 0.32 0.41 0.50 0.61 0.70 0.79 0.88 1.00 1.09 1.20 1.29 1.40 1.47 1.56 1.67 1.74 1.7715.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

16.00

17.00

18.00

19.00

20.00

21.00

22.00

23.00

24.00

25.00

Perplexity BLEU-BG Linear (BLEU-BG)BLEU-HY Linear (BLEU-HY)

Epoch

Per

plex

ity

BLE

U

• Matīss Rikters"Multi-system machine translation using online APIs for English-Latvian" ACL-IJCNLP 2015 4th HyTra Workshop

• Matīss Rikters and Inguna Skadiņa"Syntax-based multi-system machine translation" LREC 2016

• Matīss Rikters and Inguna Skadiņa"Combining machine translated sentence chunks from multiple MT systems" CICLing 2016

• Matīss Rikters"K-translate – interactive multi-system machine translation"Baltic DB&IS 2016

• Matīss Rikters"Searching for the Best Translation Combination Across All Possible Variants"Baltic HLT 2016

Related publications

Baseline system• http://ej.uz/ChunkMTOnly the chunker + visualizer• http://ej.uz/chunkerInteractive browser version• http://ej.uz/KTranslateWith integrated usage of NN LMs• http://ej.uz/NNLMs

Code on GitHub

https://github.com/M4t1ss

http://ej.uz/ChunkMT

http://ej.uz/chunker

http://ej.uz/KTranslate

http://ej.uz/NNLMs

https://github.com/M4t1ss

More enhancements for the chunking step– Try dependency parsing instead of constituency

Choose the best translation candidate with MT quality estimation– QuEst++ (Specia et al., 2015)– SHEF-NN (Shah et al., 2015)

Add special processing of multi-word expressions (MWEs)Handle MWEs in neural machine translation systems

Future work

References• Ahsan, A., and P. Kolachina. "Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth Conference of the Association for Machine Translation in the Americas." Denver, Colorado (2010).

• Barrault, Loïc. "MANY: Open source machine translation system combination." The Prague Bulletin of Mathematical Linguistics 93 (2010): 147-155.• Heafield, Kenneth. "KenLM: Faster and smaller language model queries." Proceedings of the Sixth Workshop on Statistical Machine Translation.

Association for Computational Linguistics, 2011.• Kim, Yoon, et al. "Character-aware neural language models." arXiv preprint arXiv:1508.06615 (2015).• Mellebeek, Bart, et al. "Multi-engine machine translation by recursive sentence decomposition." (2006).• Mikolov, Tomas, et al. "Recurrent neural network based language model." INTERSPEECH. Vol. 2. 2010.• Petrov, Slav, et al. "Learning accurate, compact, and interpretable tree annotation." Proceedings of the 21st International Conference on

Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.

• Raivis Skadiņš, Kārlis Goba, Valters Šics. 2010. Improving SMT for Baltic Languages with Factored Models. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192. , 125-132.

• Rikters, M., Skadiņa, I.: Syntax-based multi-system machine translation. LREC 2016. (2016)• Rikters, M., Skadiņa, I.: Combining machine translated sentence chunks from multiple MT systems. CICLing 2016. (2016)• Santanu, Pal, et al. "USAAR-DCU Hybrid Machine Translation System for ICON 2014" The Eleventh International Conference on Natural Language

Processing. , 2014.• Schwenk, Holger, Daniel Dchelotte, and Jean-Luc Gauvain. "Continuous space language models for statistical machine translation." Proceedings of

the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics, 2006.• Shah, Kashif, et al. "SHEF-NN: Translation Quality Estimation with Neural Networks." Proceedings of the Tenth Workshop on Statistical Machine

Translation. 2015.• Specia, Lucia, G. Paetzold, and Carolina Scarton. "Multi-level Translation Quality Prediction with QuEst++." 53rd Annual Meeting of the Association

for Computational Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: System Demonstrations. 2015.

• Steinberger, Ralf, et al. "Dgt-tm: A freely available translation memory in 22 languages." arXiv preprint arXiv:1309.5226 (2013).• Steinberger, Ralf, et al. "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages." arXiv preprint cs/0609058 (2006).

References

Thank you!

Thank you!

neural network language models for candidate scoring in multi-system machine translation

Technology