experiments with different models of statistcial machine translation

Experiment With Different Models Of Statistical Machine Translation

Submitted by-Khyati gupta(14483)

Rakhi Sharma(14514)

Project PresentationON

Contents

Problem Statement Objective

About the project

Flow chart Work done Conclusion Future work Reference

Problem Statement

• Machine Translation is quite popular in research field since 1990’s.

• But little work has been done in Indian Languages as the current state-of-the-art is quite bleak due to sparse data resources.

• The success of an SMT is dependent on the availability of a large parallel corpus.

• Such a data is necessary to reliably estimate translation probabilities.

• We have worked on Hindi to English Translation.

Objective

The objectives of our thesis is-

• Work on Different models of Statistical Machine Translation..

• Report the result obtained

• The SMT models studied are-

SMT

TREE

HIERARCHICAL SYNTAX

STRING

PHRASE

Introduction

What is Translation

Process of converting text from one language to another, so that the original message is retained in target language.

Source Language = language whose text is to be translated.

Target Language = language in which the text is translated.

What is machine translation?

Machine translation is automated translation or “translation carried out by a computer.” It is a process, sometimes referred to as Natural Language Processing which uses a bilingual data set and other language assets to build language and phrase models used to translate text from source language to another language.

About the Project

• Study the basics of SMT

• Installation of Moses, IRSTLM and MGIZA.

• Study various models of SMT like phrase, syntax, hierarchical model

• Creation of parallel Corpus

• Experiment translation from Hindi to English using different models of SMT.

• Conversion of Parser’s output into Moses format .

• Find out result on the basis of Score obtained .

• Evaluate the best model of SMT for a given corpus.

Flowchart of SMT

Bayesian Approach

• We apply Bayesian approach for this-

• Language model(LM):assigns a probability to any target string of words {P(e)}

• an LM probability distribution over strings S that attempts to reflect how frequently a string S occurs as a sentence.

• Translation model(TM): assigns a probability to any pair of target and source strings {P(f|e)}

• Decoder: determines translation based on probabilities of LM & TM

argmaxe p(e|f) = argmaxep(f|e) p(e)

Language Model

• A simple model of language Computes a probability of the sentence.

• Goal of the Language Model: Detect good English.

• SMT uses n-gram approach to computing probability of LM.

• A sentence is composed of product of conditional probability of component words.

• Probability of a word is calculated by that word given the preceding words. calculate

• Likelihood of sentence P(S) =P(W1)*P(W2)*….. *P(N)

= P(w1) × P(w2|w1) × … × P(wn|wn-1)

• Example illustrating bigram model- P(the barking dog) = P(the|<start>)P(barking|the)P(dog|barking)

Translation ModelsP(s|e) is called Translation model. It is used to give better scores to accurate and complete .It is trained on bilingual Hindi-English parallel data.

Approaches for translation models are-

1. Phrase-based translation• The sequences of words are called blocks or phrases, but typically are not linguistic

phrases, but phrasemes found using statistical methods from corpora

2 Hierarchical phrase-based translation• . Hierarchical phrase-based translation combines the strengths of phrase-based and

syntax-based translation.

• It uses synchronous context-free grammar rules, but the grammars may be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents

3. Syntax based Model• Syntax model works on syntactic categories of word and uses CFG grammar.

https://en.wikipedia.org/wiki/Phrase

https://en.wikipedia.org/wiki/Phraseme

Decoding

• The task of decoding in machine translation is to find the best scoring translation according to these formulae.

• Given a Hindi sentence f, it finds the English yield of the single best derivation that has Hindi yield f:

• Phrase based model uses beam search algorithm.

• Tree based models use chart decoding.

System Overview

Component Tool

Word Alignment GIZA++

MGIZA

Library BOOST

Decoder Moses 5

Language Model IRSTLM

SRILM

Corpus English-Hindi

Work Done

Data Pre-Processing Flowchart

Bilingual Text Aligner

Optical character recognition

Convert pdf into jpeg

Sources(pdf)

Data Conversion

pdfConvert to jpeg jpeg

OCR(using Indisenz )

Bilingual Text Alignment(using Microsoft Aligner)

Corpus Preparation

To prepare the data for training the translation system, we have to perform the following steps:

• Tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.

• Truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.

• Cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously misaligned sentences are removed.

Training in Moses

1. Prepare data

• Training data has to be provided sentence aligned in two files, one for the foreign sentences, one for the English sentences

• The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit.

• Two vocabulary files are generated and the parallel corpus is converted into a numberized format.

• The vocabulary files contain words, integer word identifiers and word count information.

2. Run GIZA++

• GIZA++ is a freely available implementation of the IBM models. We need it as a initial step to establish word alignments.

मेरे दोस्त के लिए पान दोGIVE

A

BETTLE

FOR

MY

FRIEND

3. Align words

• To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied.

4. Get lexical translation table Estimate a maximum likelihood lexical translation table.We estimate the w(e|f) as well as the inversew(f|e) word translation table.

5. Extract phrases -all phrases are dumped into one big file

6. Score phrases -estimate the phrase translation probability (ejf)

जहानाबाद *दरभंगा ||| darbhanga* navada* ||| 1 1 1 1 ||| 0-0 1-1 ||| 1 1

7. Build lexicalized reordering model

Moses use lexicalized reordering models for reordering.

8. Build generation models-

The generation model is build from the target side of the parallel corpus.

9. Create Configuration File-

As a final step, a configuration file for the decoder is generated with all the correct paths for the generated model and a number of default parameter settings

Tuning

• Once training is over, the parameters of the log-linear model have to be tuned to avoid over fitting on training data produce the most desirable translation on any test set. This process is called tuning. The basic assumption behind tuning is that the model must be tuned according to the evaluation techniques.

• That’s why tuning technique is known as Maximum Error rate training.

Working of Models performed

1.Working of Phrase based Model

•The Hindi sentence is first broken down into phrases based on statistics drawn from parallel corpora.

•Then these Hindi phrases are translated into English phrases.

•Translated English phrases are reordered.

2.Working of Hierarchical Model• ALL the phases performed by Moses in hierarchal model are same as

phrase passed model but the rule extraction of hierarchal model is differ from phrase based SMT.

It include - Data Preparation

• Tokenization• True casing• Cleaning

Training • word alignment• rule extraction• Glue rule• Extract phrase with phrase extraction table• Reordering Model• Language Modelling

Decoding Tuning

Blue Score

Advantage of Hierarchical Model

• Hierarchical MT replace redundant rule used in phrase based MT into single rule.

• It also overcome the problem of other model it does not require annotated corpora at all or automatically generate it.

• We are working on Hindi to English translation

English already have annotated data and Hindi will be automatically annotated by hierarchical model .• The grammar used correction in known as synchronous context free

grammar.

Synchronous Context Free Grammar

• SCFG is a kind of context free grammar that generates pair of strings.

• Example:- S -> (I, में )• This rule translates ’I’ in English to में in Hindi.

• This rule consists of terminals only but rules may consist of terminals and non-terminals as described below.

• VP ->(V1 NP2, NP2 V1 )

Rule Extraction with SCFG

• Hierarchical model not only reduces the size of a grammar. It also uses the same rules for parsing as well as translation.

Steps performed in rule extraction

• In hierarchical Model intervening words can be separated. these are replace by non-terminal X.

• Synchronization is required between sub-phrases This model does not require parser at the Hindi side because all phrase are labelled as X.

This allow us to build useful translation rule such as

X- ( X1 kA X2 , X2 of X1 )

• Some examples

• भारत का प्रधान मंत्री- -> Prime Minister of India

• जापान का प्रधान मंत्री- -> Prime Minister of Japan

• चीन का वि�त्त मंत्री- -> Finance Minister of China

• भारत का राष्ट्रीय पक्षी-> National bird of India

• Phrase based model memorises all these phrases, but essentially all phrases have the same structure i.e.

• where X1 is prime minister or “प्रधान मंत्री” X2 is India or “भारत”

GLUE RULE• Glue rules facilitate the concatenation of two trees originating from the same

Nonterminal. Here are the two glue rules.

• S-S1 X2, S1 X2

• S- X1, X1

• These two rules in conjunction can be used to concatenate discontigous phrases. So, input to the system is a sentence in hindi and a set of SCFG rules extracted from training set..

• To avoid ruleset of unmanageable size and reduce decoding complexity, we typically set limits on possible rule

• At most 2 non-terminal symbol

• At least one but at most 5 words/language

• Span at most 15 words

3.Working of Syntax Model

• Earlier models did not include any linguistic information on trained data which produced grammatically incoherent output.

• The persistence of reordering problem in translated text led to development of syntax based model. In this model Moses is trained on syntactic phrases on Target side.

• Syntactic information includes root word, word class, POS category. We have syntactic parsing on English language in our work.

ADVANTAGES

• Since Hindi is syntactically divergent language, this model overcomes the reordering problem faced in phrase based and hierarchical based model.

• Syntax based MT performs well in case of structural divergent language. Hindi observes SOV structure while English observes SVO structure.

• This model improves the resultant sentence grammatically.

MODEL

VB PRP VB1 VB2 He adores VB TO Listening TO To MN Music

VB PRP VB2 VB1 He TO VB adores TO MN Listening to music

REORDERING

Cont. …..

VB PRP VB2 VB1 He TO VB ȯ� � � adores ¡ɇ TO MN Listening ȯ� to music

Insertion

VB PRP VB2 VB1 ¡� TO VB ȯ� � � Ü ȡ� � ¡ɇ

TO MN Ǖ� � ȯ ȯ� ȯ� ȲȢ� �

Translation �ह संगीत सुनने के प्यार करते हैं

Working• The string-to-tree model accepts a Hindi string as input and seeks across multiple

parsed English trees and finds the highest scoring tree.

• Input is a string- व्यलि�गत जीवन• Translation Rules-

• [SYM][X] personal [NN][X] [FRAG] ||| [SYM][X] व्यक्ति$गत [NN][X] [X] ||| 0.0326378 0.6 0.0652757 1 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||

• [SYM][X] personal life [FRAG] ||| [SYM][X] व्यक्ति$गत जी�न [X] ||| 0.0326378 0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||

• [SYM][X] personal life [TOP] ||| [SYM][X] व्यक्ति$गत जी�न [X] ||| 0.0326378 0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||

• Decoding by Translation Rules-

• [0..3]: [3..3]=</s> [0..2]=S : S ->S -> S </s> :0-0 : c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,-4,6,-11.5445,-5.99562,-7.46699,-1.60944,1.99979,-16.0431)

• [0..1]: [1..1]= X [0..0]=S : S ->S -> S X :0-0 1-1 : c=0 core=(0,-0,1,0,0,0,0,0.999896,0) 0core=(0,-2,3,-3.35156,-0.916291,-2.43527,0,0.999896,-7.74303)

• [0..0]: [0..0]=<s> : S ->S -> <s> :: c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,-1,1,0,0,0,0,0,0)

• [1..1]: [1..1]=personel : X ->X -> व्यक्ति$गत :: c=0 core=(0,-1,1,-3.35156,-0.916291,-2.43527,0,0,0) 0core=(0,-1,1,-3.35156,-0.916291,-2.43527,0,0,-9.44562)

• ,-

• The target tree it produces is

• Output is a string- personal life

(TOP <s> (S (NP personal) (NP (NN life)))) </s>)

4.Working of Hybrid Translation

• The main disadvantage in Statistical Machine Translation (SMT) is that it only translates phrases which were seen during training.

• Unseen phrases such as named entities are not translated .

• This leads to low bleu score .We can improve bleu score by translating named entities from external source.

Working

PreprocessingTranslation by

Moses Decoder

Postprocessing

आपको नए <n translation=monastery >आश्रम</n> के निनमा�ण के लिए निकतने धन की आवश्यकता है

Preprocessing of Data-Moses accept data in following format for hybrid translation-

Translation by Moses Decoder-

We translate normally using Moses decoder which is trained on our data. The translation using Moses decoder is-

How much money you need for the construction of the new आश्रम??

Here word आश्रम is left untranslated.

Post processing-

The untranslated word can be translated by referring the xml tags. The output obtained is-

How much money youo need fr the construction of the new monastery?

Result of Hybrid Translation

• Exclusive Only the XML-specified translation is used for the input phrase. Any phrases from the phrase table that overlap with that span are ignored.

• Inclusive The XML-specified translation competes with all the phrase table choices for that span.

• Ignore The XML-specified translation is ignored completely.

Xml-exclusive: 7.21

Xml-inclusive 7.36

Xml-ignore 6.18

Syntax Model Parsing Extended

BERKELEY PARSERWe have used Berkeley parser for parsing English language in our project. Since we had parser for English language so we trained our system on string-to-tree and tree-to-string.

Input -Economic Services

ENJU PARSER With a wide-coverage probabilistic HPSG grammar and an efficient parsing algorithm, this parser can effectively analyze syntactic/semantic structures of English sentences and provide a user with phrase structures and predicate-argument structures.

Motivation• Moses accepts data for training syntax model in XML format.

• <tree label="NP"> <tree label="DET"> the </tree> <tree label="NN"> cat </tree> </tree>

• There are a number of parsers available for parsing. Each parser has its own idiosyncratic input and output format. Hence, we need to process the output of these parser in the format compatible with Moses for syntax model. There are 3 wrapper scripts available in Moses decoder /scripts/training/wrapper for converting the parser output into Moses format. These are-

• Parse-en-collins.perl – This script is used with Collins parser available from MIT.

• Parse-de-bitpar.perl – This script is used with Bitapar parser available from University of Munich.

• Parse-de-berkeley- This script is used with Berkeley parser available from UC Berkeley.

• We used Enju parser for our experiment we were motivated to write a wrapper script for this purpose.

• Hence we wrote a wrapper script to convert Enju parser output to Moses format compatible for syntax trees.

Format Conversion-

We designed a program to convert XML output of Enju parser to Moses compatible XML format. But Enju and Penn Tree Bank have different syntactic categories.

Because the output of Enju is based on HPSG and it is different from the annotation policy of PTB, tree structures and/or syntactic categories are often different from those given by the PTB-style annotation. However, these mappings provide a clear image of what Enju expresses. So we mapped Enju categories to PTB style for our experiment.

Steps-

1. For every <sentence> tag , form a output string by adding <tree label =”TOP”>

2. For every <cons> tag i. Retrieve its CAT value ($CAT_VALUE).

ii. Retrieve its XCAT value ($XCAT_VALUE).

iii. If the XCAT value of the CONS element is non-empty:

iv. Find the corresponding POS tag by comparing it with the mapping table.

v. Add new tree tag to the given output string by adding <tree label=”CONS_POS”>

where CONS_POS is the POS category derived from mapping table.

3. For every <tok> tagi. Retrieve its POS value ($POS_VALUE).

ii. Add new tree tag to the given output string by adding <tree label=”POS”> where

POS is the POS category derived from POS attribute from tok tag.

4. For every closing </sentence> tag, add new closing </tree> tag.

5. For every closing </cons> tag, add new closing </tree> tag.

6. For every closing </tok> tag, add new closing </tree> tag.

7. All unnecessary attributes are omitted.

Challenges-

• The deep syntactic parser we used was Enju5 (Miyao and Tsujii, 2005), which is based on HPSG and outputs both (dependency-like) predicate-argument relations (Miyao, 2007) and phrase structure trees (although these do not follow the PTB scheme for phrase structure trees) in an XML format.

• The Berkeley is a phrase structure grammar parser based on PBT grammar.

• The output of both the parsers differ in tree structure since Enju’s syntactic representation is richer, but still quite challenging. Enju parser produces strictly binary trees while Berkeley parser produces binary trees. Also the tress in the number of levels and structure.

• This made the task of converting Enju output to Moses Format difficult.

Conclusion -

• We trained syntax model on converted Enju output. There was not any major effect on the bleu score.

Result

INTERFACE - Phrase Based Translation• Input-ये के्षत्र यमुना पार कहलाते हैं �ैसे ये नई दिदल्ली से बहुत से पुलों द्वारा भली भांवित जुडे़

हुए हैं• Output-it regions are caled yamuna par and they new delhi these are also

joined by many bridges from

Hierarchical Based

• Input-ये क्षेत्र यमुना पार कहलाते हैं �ैसे ये नई दिदल्ली से बहुत से पुलों द्वारा भली भांवित जुडे़ हुए हैं• Output- so these regions are caled yamuna par and they from new delhi पुलों by भली भांवित जुड़े

front are

Syntax based

Input-ये के्षत्र यमुना पार कहलाते हैं �ैसे ये नई दिदल्ली से बहुत से पुलों द्वारा भली भावंित जुडे़ हुए हैं Output-it caled yamuna par regions are and it from new delhi of the world the

very popular from पुलों by भली bridges from are

Corpus

Type Source

Gyan nidhi Downloaded from Joshua

Miscellaneous PM speech(July 2015),Budget

Data( 2014),Vigyan Prashar magazine

ACL2005 Available by Cdac, Noida

Agriculture www.pib.gov.in Govt of India

Result of Comparing Models of SMT

Agriculture ACL 2005 Gyan Nidhi Misc.0

2

4

6

8

10

12

14

16

3.48

6.18

3.61 3.453.27

13.8

4.35.2

2.93

10.79

3.21 2.9

1.22.3

0.91.5

Comparison of SMT Models

Phrase Heirarchical Syntax ST Syntax TS

Corpus

Mod

els S

core

Conclusion

We are developing Hindi to English translation system and comparing the results obtained by various models. .During the course of this project, the various models of translation had been evaluated and it is concluded that “Hierarchical based model” is the best approach to carry out this task. The result is verified both on the various English and Hindi sentences corpus. The project concludes with the tasks showing the excellent and desired result as needed. The project, at the end is completed and successfully tested.

Future Work

We need to –

• Perform and compare results of factored model on Moses.

• Find and replace OOV words.

• Compare the effect of replacing OOV words on blue score.

• Transliterate unknown words.

• We propose a technique “word to vec” for hybrid translation that can automate the process of generating dictionaries and phrase table.

References

• Statistical Phrase-Based Translation by Philipp Koehn, Franz Josef Och, Daniel Marcu Information Sciences Institute Department of Computer Science University of Southern California [email protected] , [email protected] , [email protected]

• A Hierarchical Phrase-Based Model for Statistical Machine Translation by-David Chiang Institute for Advanced Computer Studies (UMIACS)University of Maryland, College Park, MD 20742, USA [email protected]

• Philipp Koehn. 2004b. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP),

• Richard Zens and Hermann Ney. 2004. Improvements in phrase-based statistical machine translation. In Proceedings of HLT-NAACL 2004,

• Hierarchical Phrase-Based Statistical Machine Translation System Mtech. Project Dissertation by Bibek Behera under the guidance of Prof. Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology, Bombay

mailto:[email protected]




• Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W.,

Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.and Herbst, E. (2007). Moses: open source toolkit

for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive

Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for

Computational Linguistics.

Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages

263–270, Stroudsburg, PA, USA. Association for Computational Linguistics.

Sinha, R. M. K. and Thakur, A. (2005). Machine translation of bi-lingual hindi-english (hinglish) text.

10th Machine Translation summit (MT Summit X), Phuket, Thailand, pages 149–156.Kunal Sachdeva,

Rishabh Srivastava, Sambhav Jain, Dipti Misra Sharma

Language Technologies Research Center, International Institute of Information Technology, Hyderabad,

Hindi to English Machine Translation: Using Effective Selection in Multi-Model SMT

Amr Ahmed and Greg Hanneman, Syntax-Based Statistical Machine Translation:A review

Aswani, N. and Gaizauskas, R. (2005). A hybrid approach to align sentences and words in English–

Hindi parallel corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, pp.

57–64, Ann Arbor, Michigan. Association for Computational Linguistics.

Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing (2nd edition). Prentice Hall