dr. preslav nakov — combining, adapting and reusing bi-texts between related languages —...

Combining, Adapting and Reusing Bi-texts between Related Languages:

Application to Statistical Machine Translation

Preslav Nakov, Qatar Computing Research Institute(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)

Yandex seminarAugust 13, 2014, Moscow, Russia

Yandex seminar, August 13, 2014, Moscow, Russia

2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2

Plan

Part IIntroduction to Statistical Machine Translation

Part IICombining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation

Part IIIFurther Discussion on SMT

The Problem:

Lack of Resources



Overview

Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts).

ProblemSuch training bi-texts do not exist for most languages.

IdeaAdapt a bi-text for a related resource-rich language.


5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation

Building an SMT System for a New Language Pair

In theory: only requires few hours/days

In practice: large bi-texts are neededOnly available for

the official languages of the UN

Arabic, Chinese, English, French, Russian, Spanish

the official languages of the EU

some other languages

However, most of the 6,500+ world languages remain resource-poor from an SMT viewpoint.

This number is even more strikingif we consider language pairs.

Even resource-rich language pairsbecome resource-poorin new domains.



Most Language Pairs Have Little Resources

Zipfian distribution of language resources



Building a Bi-text for SMT

Small bi-textsRelatively easy to build

Large bi-textsHard to get, e.g., because of copyrightSources: parliament debates and legislation

national: Canada, Hong KonginternationalUnited Nations European Union: Europarl, Acquis

Becoming an official language of the EUis an easy recipe for getting rich in bi-texts quickly.

Not all languages are so “lucky”,but many can still benefit.



How Google/Bing (Yandex?) Translate Resource-Poor Languages

How do we translate from Russian to Malay?

Use Triangulation

Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)

RussianEnglishMalay

Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)

рамочное соглашение ||| framework agreement ||| 0.7 …perjanjian kerangka kerja ||| framework agreement ||| 0.8 …THUSрамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …



Idea: reuse bi-texts from related resource-rich languages to build an improved SMT system for a related resource-poor language.

NOTE 1: this is NOT triangulationwe focus on translation into English

e.g., Indonesian-English using Malay-Englishrather thanIndonesianEnglishMalayIndonesianMalayEnglish

NOTE 2: We exploit the fact that the source languages are related

What if We Want to Translate into English?

poor

rich

X



Resource-poor vs. Resource-rich



Related EU – non-EU/unofficial languages Swedish – Norwegian Bulgarian – Macedonian Irish – Gaelic Scottish Standard German – Swiss German

Related EU languages Spanish – Catalan Czech – Slovak

Related languages outside Europe Russian – Ukrainian MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi) Hindi – Urdu Turkish – Azerbaijani Malay – Indonesian

Resource-rich vs. Resource-poor Languages

We will explorethese pairs.



Related languages have overlapping vocabulary (cognates)

similarword order

syntax

Motivation

13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Improving

Indonesian-English SMT

Using Malay-English



Malay vs. Indonesian

MalaySemua manusia dilahirkan bebas dan samarata dari segi

kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati dan

hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.

IndonesianSemua orang dilahirkan merdeka dan mempunyai martabat

dan hak-hak yang sama.Mereka dikaruniai akal dan hati nurani dan hendaknya

bergaul satu sama lain dalam semangat persaudaraan.

~50% exact word overlap

from Article 1 of the Universal Declaration of Human Rights



Malay Can Look “More Indonesian”…

MalaySemua manusia dilahirkan bebas dan samarata

dari segi kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati

dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.

~75% exact word overlap

Post-edited Malay to look “Indonesian” (by an Indonesian speaker).

IndonesianSemua manusia dilahirkan bebas dan mempunyai martabat

dan hak-hak yang sama.Mereka mempunyai pemikiran dan perasaan dan hendaklah

bergaul satu sama lain dalam semangat persaudaraan.

from Article 1 of the Universal Declaration of Human Rights

We attempt to do this automatically:adapt Malay to look Indonesian

Then, use it to improve SMT…



Indonesian

Malay

Englishpoor

rich

Method at a Glance

Indonesian

“Indonesian”

Englishpoor

rich

Step 1:Adaptation

Indonesian + “Indonesian”

EnglishStep 2:Combination

Adapt

Note that we have no Malay-Indonesian bi-text!


Step 1:

Adapting Malay-English

to “Indonesian”-English



Word-Level Bi-text Adaptation:Overview

Given a Malay-English sentence pair

1. Adapt the Malay sentence to “Indonesian”• Word-level paraphrases• Phrase-level paraphrases• Cross-lingual morphology

2. We pair the adapted “Indonesian” with the English from the Malay-English sentence pair

Thus, we generate a new “Indonesian”-English sentence pair.

Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012)Pidong Wang, Preslav Nakov, Hwee Tou Ng



Word-Level Bi-text Adaptation:Motivation

In many cases, word-level substitutions are enough

Adapt Malay to Indonesian (train)KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.PDB Malaysia akan mencapai 8 persen pada tahun 2010.

Malaysia’s GDP is expected to reach 8 per cent in 2010.



Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.

Decode using a large Indonesian LM


Probs: pivoting over English



Malaysia’s GDP is expected to reach 8 per cent in 2010.

21

Pair each with the English counter-part

Thus, we generate a new “Indonesian”-English bi-text.




Indonesian translations for Malay: pivoting over English

Weights

22

Malay sentenceML1 ML2 ML3 ML4 ML5

English sentenceEN1 EN2 EN3 EN4

English sentenceEN11 EN3 EN12

Indonesian sentenceIN1 IN2 IN3 IN4

ML-EN bi-text

IN-EN bi-text

Word-Level Adaptation:Extracting Paraphrases

Note: we have no Malay-Indonesian bi-text, so we pivot.



IN-EN bi-text is small, thus:

Unreliable IN-EN word alignments bad ML-IN paraphrases

Solution: improve IN-EN alignments using the ML-EN bi-text

concatenate: IN-EN*k + ML-EN» k ≈ |ML-EN| / |IN-EN|

word alignment

get the alignments for one copy of IN-EN only

23

Word-Level Adaptation:Issue 1

IN

ML

ENpoor

rich

Works because of cognates between Malay and Indonesian.



IN-EN bi-text is small, thus:

Small IN vocabulary for the ML-IN paraphrases Solution:

Add cross-lingual morphological variants:

Given ML word: seperminuman

Find ML lemma: minum

Propose all known IN words sharing the same lemma:» diminum, diminumkan, diminumnya, makan-minum,

makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum

24


IN

ML

ENpoor

rich

Note: The IN variants are from a larger monolingual IN text.



Word-level pivoting Ignores context, and relies on LM Cannot drop/insert/merge/split/reorder words Solution:

Phrase-level pivoting Build ML-EN and EN-IN phrase tables

Induce ML-IN phrase table (pivoting over EN)

Adapt the ML side of ML-EN to get “IN”-EN bi-text:» using Indonesian LM and n-best “IN” as before

Also, use cross-lingual morphological variants

25


- Models context better: not only Indonesian LM, but also phrases.- Allows many word operations, e.g., insertion, deletion.

IN

ML

ENpoor

rich


Step 2:

Combining

IN-EN + “IN”-EN



Combining IN-EN and “IN”-EN bi-texts

Simple concatenation: IN-EN + “IN”-EN

Balanced concatenation: IN-EN * k + “IN”-EN

Sophisticated phrase table combination Improved word alignments for IN-EN

Phrase table combination with extra features

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009)Preslav Nakov, Hwee Tou Ng

Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages. (JAIR, 2012)Preslav Nakov, Hwee Tou Ng



Concatenating bi-texts

Merging phrase tables

Combined method

Bi-text Combination Strategies





Combined method




Summary: Concatenate X1-Y and X2-Y

Advantages improved word alignments

e.g., for rare words

more translation optionsless unknown words

useful non-compositional phrases (improved fluency)

phrases with words from X2 that do not exist in X1: ignored

Disadvantages X2-Y will dominate: it is larger

translation probabilities are messed up

phrases from X1-Y and X2-Y cannot be distinguished

X1

X2

Ypoor

rich

rela

ted

Concatenating Bi-texts (1)



Concat×k: Concatenate k copies of the original and one copy of the additional training bi-text

Concat×k:align1. Concatenate k copies of the original and one copy of the

additional bi-text.

2. Generate word alignments.

3. Truncate them only keeping alignments for one copy of the original bi-text.

4. Build a phrase table.

5. Tune the system using MERT.

The value of k is optimized on the development dataset.

X1

X2

Ypoor

rich

rela

ted

Concatenating Bi-texts (2)





Combined method




Summary: Build two separate phrase tables, then(a) use them together

(b) merge them

(c) interpolate them

Advantagesphrases from X1-Y and X2-Y can be distinguished

the larger bi-text X2-Y does not dominate X1-Ymore translation optionsprobabilities are combined in a more principled manner

Disadvantagesimproved word alignments are not possible

X1

X2

Ypoor

rich

rela

ted

Merging Phrase Tables (1)



Two-tables: Build two separate phrase tables and use them as alternative decoding paths (Birch et al., 2007).




Interpolation: Build two separate phrase tables, Torig and Textra, and combine them using linear interpolation:

Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s).

The value of α is optimized on a development dataset.




Merge:

1. Build separate phrase tables: Torig and Textra.

2. Keep all entries from Torig.

3. Add those entries from Textra that are not in Torig.

4. Add extra features:

F1: 1 if the entry came from Torig, 0 otherwise.

F2: 1 if the entry came from Textra, 0 otherwise.

F3: 1 if the entry was in both tables, 0 otherwise.

The feature weights are set using MERT, and the number of features is optimized on the development set.






Combined method




Use Merge to combine the phrase tables for concat×k:align (as Torig) and for concat×1 (as Textra).

Two parameters to tune number of repetitions k

# of extra features to use with Merge: (a) F1 only;

(b) F1 and F2,

(c) F1, F2 and F3

Improved word alignments.

Improved lexical coverage.

Distinguish phrases by source table.

Combined Method


Experiments & Evaluation



Data

Translation data (for IN-EN) IN2EN-train: 0.9M IN2EN-dev: 37K IN2EN-test: 37K EN-monoling.: 5M

Adaptation data (for ML-EN “IN”-EN) ML2EN: 8.6M IN-monoling.: 20M

(tokens)



Isolated Experiments:Training on “IN”-EN only

ML2

EN(b

asel

ine)

IN2E

N(b

asel

ine)

Wor

dPar

aph

Wor

dPar

aph+

mor

ph(A

)

Phrase

Parap

h

Phrase

Parap

h+m

orph

(B)

Syste

m c

ombi

natio

n of A

+B

14.00

15.00

16.00

17.00

18.00

19.00

20.00

21.00

22.00

14.50

18.6719.50

20.0620.63 20.89 21.24

BLEU

System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)



simple co

ncatenation

balanced co

ncatenation

phrase ta

ble combination

18.018.519.019.520.020.521.021.522.0

18.49

19.79 20.10

21.55 21.64 21.62

ML2EN(baseline) System combination42

BLEU

Combined Experiments:Training on IN-EN + “IN”-EN

Wang, Nakov & Ng (EMNLP 2012)



Experiments: Improvements

43

ML2

EN(baselin

e)

IN2EN(base

line)

phrase

table co

mbination

best iso

lated syste

m

best co

mbined syste

m14.00

15.00

16.00

17.00

18.00

19.00

20.00

21.00

22.00

23.00

14.50

18.6720.10

21.24 21.64BLEU




Improve Macedonian-English SMT by adapting Bulgarian-English bi-text Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)

OPUS movie subtitles

Application to Other Languages & Domains

BG2EN(A) WordParaph+morph(B)

PhraseParaph+morph(C)

System combination of A+B+C

27.00

27.50

28.00

28.50

29.00

29.50

27.33

27.97

28.38

29.05BLEU



Analysis



ParaphrasingNon-Indonesian Malay Words Only

So, we do need to paraphrase all words.




Human Judgments

Morphology yields worse top-3 adaptationsbut better phrase tables, due to coverage.

Is the adapted sentence better Indonesianthan the original Malay sentence?

100 random sentences




Reverse Adaptation

Idea: Adapt dev/test Indonesian input to “Malay”,then, translate with a Malay-English system

Input to SMT: - “Malay” lattice- 1-best “Malay” sentence from the lattice

Adapting dev/test is worse than adapting the training bi-text:So, we need both n-best and LM




A Specialized Decoder (Instead of Moses)



Beam-Search Text Rewriting Decoder:The Algorithm

A Beam-Search Decoder for Normalization of Social Media Textwith Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng



Beam-Search Text Rewriting Decoder:An Example (Twitter Normalization)

Wang, Nakov & Ng (NAACL 2013)



Hypothesis Producers

Word-level mapping

Phrase-level mapping

Cross-lingual morphology mapping

Indonesian LM

Word penalty (target)

Malay word penalty (source)

Phrase count




Moses vs. the Specialized Decoder

Decoding level phrase vs. sentence

Features Moses vs. richer, e.g., Malay word penalty

word-level + phrase-level (potentially, manual rules)

Cross-lingual variants input lattice vs. feature function




Moses vs. a Specialized Decoder: Isolated “IN”-EN Experiments:

Wor

dPar

Wor

dPar

+M

orph

Phra

sePa

r

Phra

sePa

r+M

orph

Syste

m C

ombi

natio

n18

19

20

21

22

19.50

20.06

20.6320.89

21.24

20.39 20.4620.85

21.07

21.76

BLEU

Moses Specialized decoder



Moses vs. a Specialized Decoder:Combining IN-EN and “IN”-EN

simple concat balanced concat phrase table combination17

18

19

20

21

22

23

18.49

19.7920.10

21.55 21.64 21.6221.74 21.8122.03

BLEU

ML2EN (baseline) Moses Specialized decoder



Experiments: Improvements

56

13

15

17

19

21

23

14.5

18.67

20.1

21.2421.64

22.03

BLEU



MKEN, Adapting BG-EN to “MK”-EN

26

27

28

29

30

27.33

27.97

28.38

29.05

29.35

BLEU



Transliteration



59

Spanish vs. Portuguese

Spanish–PortugueseSpanish

Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros.

PortugueseTodos os seres humanos nascem livres e iguais em dignidade e em

direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade.

(from Article 1 of the Universal Declaration of Human Rights)

17% exact word overlap



Spanish vs. Portuguese

Spanish–PortugueseSpanish

Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros.

PortugueseTodos os seres humanos nascem livres e iguais em dignidade e em

direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade.

(from Article 1 of the Universal Declaration of Human Rights)

17% exact word overlap67% approx. word overlap

The actual overlap is even higher.



Cognates

LinguisticsDef: Words derived from a common root, e.g.,

Latin tu (‘2nd person singular’)Old English thouFrench tuSpanish túGerman duGreek sú

Orthography/phonetics/semantics: ignored.

Computational linguisticsDef: Words in different languages that are mutual translations and have a similar orthography, e.g.,

evolution vs. evolución vs. evolução vs. evoluzioneOrthography & semantics: important.Origin: ignored.

Cognates can differ a lot:• night vs. nacht vs. nuit vs. notte vs.

noite• star vs. estrella vs. stella vs. étoile• arbeit vs. rabota vs. robota (‘work’)• father vs. père• head vs. chef



Spelling Differences Between Cognates

Systematic spelling differencesSpanish – Portuguese

different spelling-nh- -ñ- (senhor vs. señor)

phonetic-ción -ção (evolución vs. evolução)-é -ei (1st sing past) (visité vs. visitei)-ó -ou (3rd sing past) (visitó vs. visitou)

Occasional differencesSpanish – Portuguese

decir vs. dizer (‘to say’)Mario vs. MárioMaría vs. Maria

Many of these can be learned automatically.



Automatic Transliteration

Transliteration

1. Extract likely cognates for Portuguese-Spanish

2. Learn a character-level transliteration model

3. Transliterate the Portuguese side of pt-en, to look like Spanish



Automatic Transliteration (2)

Extract pt-es cognates using English (en)

1. Induce pt-es word translation probabilities

2. Filter out by probability if

3. Filter out by orthographic similarity if

constants proposedin the literature

Longest commonsubsequence



SMT-based Transliteration

Train & tune a monotone character-level SMT system

Representation

Use it to transliterate the Portuguese side of pt-en



ESEN, Adapting PT-EN to “ES”-EN

PT-EN ES-EN phrase table combination0

5

10

15

20

25

30

5.34

22.8724.23

13.79

26.24

BLEU

original transliterated10K ES-EN, 1.23M PT-EN



Transliteration

vs.

Character-Level Translation



Macedonian vs. Bulgarian

68



MKBG: Transliteration vs. Translation

5

10

15

20

25

30

35

40

10.7412.07

22.74

31.10 32.19 32.7133.94

BLEU

Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages (ACL 2012).Preslav Nakov, Jorg Tiedemann.



Character-Level SMT

70

• MK: Никогаш не сум преспала цела сезона.• BG: Никога не съм спала цял сезон.

• MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ .• BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .



Character-Level Phrase Pairs

71

Can cover:

word prefixes/suffixes

entire words

word sequences

combinations thereof

Max-phrase-length=10LM-order=10



MKBG: The Impact of Data Size



Slavic Languages in Europe



BG

MK

SR

CZ

SL

MK XX



MK SR, SL, CZ

76

Pivoting



MK->EN: Pivoting over BG

Macedonian: Никогаш не сум преспала цела сезона.

Bulgarian: Никога не съм спала цял сезон.

English: I’ve never slept for an entire season.

For related languages• subword transformations• character-level translation



MK->EN: Pivoting over BG

Analyzing the Use of Character-Level Translationwith Sparse and Noisy Datasets (RANLP 2013)Jorg Tiedemann, Preslav Nakov



MK->EN: Using Synthetic “MK”-EN Bi-Texts

Translate Bulgarian to Macedonian in a BG-XX corpus

All synthetic data combined (+mk-en): 36.69 BLEU Tiedemann & Nakov (RANLP 2013)

80

Conclusion



Adapt bi-texts for related resource-rich languages, using confusion networks word-level & phrase-level paraphrasing cross-lingual morphological analysis

Character-level Models translation transliteration pivoting vs. synthetic data

Future work other languages & NLP problems robustness: noise and domain shift Thank you!

Conclusion & Future Work



Related Work



Related Work (1)

Machine translation between related languages E.g.

Cantonese–Mandarin (Zhang, 1998)

Czech–Slovak (Hajic & al., 2000)

Turkish–Crimean Tatar (Altintas & Cicekli, 2002)

Irish–Scottish Gaelic (Scannell, 2006)

Bulgarian–Macedonian (Nakov & Tiedemann, 2012)

We do not translate (no training data), we “adapt”.



Related Work (2)

Adapting dialects to standard language (e.g., Arabic)(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)

manual rules and/or language-specific tools

Normalizing Tweets and SMS(Aw & al., 2006; Han & Baldwin, 2011)

informal text: spelling, abbreviations, slang

same language



Related Work (3)

Adapt Brazilian to European Portuguese (Marujo & al. 2011)

rule-based, language-dependent

tiny improvements for SMT

Reuse bi-texts between related languages (Nakov & Ng. 2009)

no language adaptation (just transliteration)

Cascaded/pivoted translation(Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)

poor rich X requires an additional poor-rich bi-text

rich X poor does not use the similarity poor-rich

poor

rich

Xour:

dr. preslav nakov — combining, adapting and reusing bi-texts between related languages —...

Internet

reusing bitexts

resourcepoor languages

source languages

world languages

related resourcepoor

smt small bitexts

large bitexts hard

bilingual corpora bitexts