dr. preslav nakov — combining, adapting and reusing bi-texts between related languages —...
DESCRIPTION
Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts. We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii). We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder. Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.TRANSCRIPT
Combining, Adapting and Reusing Bi-texts between Related Languages:
Application to Statistical Machine Translation
Preslav Nakov, Qatar Computing Research Institute(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)
Yandex seminarAugust 13, 2014, Moscow, Russia
Yandex seminar, August 13, 2014, Moscow, Russia
2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2
Plan
Part IIntroduction to Statistical Machine Translation
Part IICombining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Part IIIFurther Discussion on SMT
The Problem:
Lack of Resources
Yandex seminar, August 13, 2014, Moscow, Russia
4Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 4
Overview
Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts).
ProblemSuch training bi-texts do not exist for most languages.
IdeaAdapt a bi-text for a related resource-rich language.
Yandex seminar, August 13, 2014, Moscow, Russia
5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building an SMT System for a New Language Pair
In theory: only requires few hours/days
In practice: large bi-texts are neededOnly available for
the official languages of the UN
Arabic, Chinese, English, French, Russian, Spanish
the official languages of the EU
some other languages
However, most of the 6,500+ world languages remain resource-poor from an SMT viewpoint.
This number is even more strikingif we consider language pairs.
Even resource-rich language pairsbecome resource-poorin new domains.
Yandex seminar, August 13, 2014, Moscow, Russia
6Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Most Language Pairs Have Little Resources
Zipfian distribution of language resources
Yandex seminar, August 13, 2014, Moscow, Russia
7Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building a Bi-text for SMT
Small bi-textsRelatively easy to build
Large bi-textsHard to get, e.g., because of copyrightSources: parliament debates and legislation
national: Canada, Hong KonginternationalUnited Nations European Union: Europarl, Acquis
Becoming an official language of the EUis an easy recipe for getting rich in bi-texts quickly.
Not all languages are so “lucky”,but many can still benefit.
Yandex seminar, August 13, 2014, Moscow, Russia
8Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
How Google/Bing (Yandex?) Translate Resource-Poor Languages
How do we translate from Russian to Malay?
Use Triangulation
Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)
RussianEnglishMalay
Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)
рамочное соглашение ||| framework agreement ||| 0.7 …perjanjian kerangka kerja ||| framework agreement ||| 0.8 …THUSрамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …
Yandex seminar, August 13, 2014, Moscow, Russia
9Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Idea: reuse bi-texts from related resource-rich languages to build an improved SMT system for a related resource-poor language.
NOTE 1: this is NOT triangulationwe focus on translation into English
e.g., Indonesian-English using Malay-Englishrather thanIndonesianEnglishMalayIndonesianMalayEnglish
NOTE 2: We exploit the fact that the source languages are related
What if We Want to Translate into English?
poor
rich
X
Yandex seminar, August 13, 2014, Moscow, Russia
10Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Resource-poor vs. Resource-rich
Yandex seminar, August 13, 2014, Moscow, Russia
11Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 11
Related EU – non-EU/unofficial languages Swedish – Norwegian Bulgarian – Macedonian Irish – Gaelic Scottish Standard German – Swiss German
Related EU languages Spanish – Catalan Czech – Slovak
Related languages outside Europe Russian – Ukrainian MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi) Hindi – Urdu Turkish – Azerbaijani Malay – Indonesian
Resource-rich vs. Resource-poor Languages
We will explorethese pairs.
Yandex seminar, August 13, 2014, Moscow, Russia
12Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related languages have overlapping vocabulary (cognates)
similarword order
syntax
Motivation
13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Improving
Indonesian-English SMT
Using Malay-English
Yandex seminar, August 13, 2014, Moscow, Russia
14Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 14
Malay vs. Indonesian
MalaySemua manusia dilahirkan bebas dan samarata dari segi
kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati dan
hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.
IndonesianSemua orang dilahirkan merdeka dan mempunyai martabat
dan hak-hak yang sama.Mereka dikaruniai akal dan hati nurani dan hendaknya
bergaul satu sama lain dalam semangat persaudaraan.
~50% exact word overlap
from Article 1 of the Universal Declaration of Human Rights
Yandex seminar, August 13, 2014, Moscow, Russia
15Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 15
Malay Can Look “More Indonesian”…
MalaySemua manusia dilahirkan bebas dan samarata
dari segi kemuliaan dan hak-hak.Mereka mempunyai pemikiran dan perasaan hati
dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.
~75% exact word overlap
Post-edited Malay to look “Indonesian” (by an Indonesian speaker).
IndonesianSemua manusia dilahirkan bebas dan mempunyai martabat
dan hak-hak yang sama.Mereka mempunyai pemikiran dan perasaan dan hendaklah
bergaul satu sama lain dalam semangat persaudaraan.
from Article 1 of the Universal Declaration of Human Rights
We attempt to do this automatically:adapt Malay to look Indonesian
Then, use it to improve SMT…
Yandex seminar, August 13, 2014, Moscow, Russia
16Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Indonesian
Malay
Englishpoor
rich
Method at a Glance
Indonesian
“Indonesian”
Englishpoor
rich
Step 1:Adaptation
Indonesian + “Indonesian”
EnglishStep 2:Combination
Adapt
Note that we have no Malay-Indonesian bi-text!
17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Step 1:
Adapting Malay-English
to “Indonesian”-English
Yandex seminar, August 13, 2014, Moscow, Russia
18Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18
Word-Level Bi-text Adaptation:Overview
Given a Malay-English sentence pair
1. Adapt the Malay sentence to “Indonesian”• Word-level paraphrases• Phrase-level paraphrases• Cross-lingual morphology
2. We pair the adapted “Indonesian” with the English from the Malay-English sentence pair
Thus, we generate a new “Indonesian”-English sentence pair.
Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012)Pidong Wang, Preslav Nakov, Hwee Tou Ng
Yandex seminar, August 13, 2014, Moscow, Russia
19Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 19
Word-Level Bi-text Adaptation:Motivation
In many cases, word-level substitutions are enough
Adapt Malay to Indonesian (train)KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.PDB Malaysia akan mencapai 8 persen pada tahun 2010.
Malaysia’s GDP is expected to reach 8 per cent in 2010.
Yandex seminar, August 13, 2014, Moscow, Russia
20Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 20
Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
Decode using a large Indonesian LM
Word-Level Bi-text Adaptation:Overview
Probs: pivoting over English
Yandex seminar, August 13, 2014, Moscow, Russia
21Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Malaysia’s GDP is expected to reach 8 per cent in 2010.
21
Pair each with the English counter-part
Thus, we generate a new “Indonesian”-English bi-text.
Word-Level Bi-text Adaptation:Overview
Yandex seminar, August 13, 2014, Moscow, Russia
22Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Indonesian translations for Malay: pivoting over English
Weights
22
Malay sentenceML1 ML2 ML3 ML4 ML5
English sentenceEN1 EN2 EN3 EN4
English sentenceEN11 EN3 EN12
Indonesian sentenceIN1 IN2 IN3 IN4
ML-EN bi-text
IN-EN bi-text
Word-Level Adaptation:Extracting Paraphrases
Note: we have no Malay-Indonesian bi-text, so we pivot.
Yandex seminar, August 13, 2014, Moscow, Russia
23Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
IN-EN bi-text is small, thus:
Unreliable IN-EN word alignments bad ML-IN paraphrases
Solution: improve IN-EN alignments using the ML-EN bi-text
concatenate: IN-EN*k + ML-EN» k ≈ |ML-EN| / |IN-EN|
word alignment
get the alignments for one copy of IN-EN only
23
Word-Level Adaptation:Issue 1
IN
ML
ENpoor
rich
Works because of cognates between Malay and Indonesian.
Yandex seminar, August 13, 2014, Moscow, Russia
24Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
IN-EN bi-text is small, thus:
Small IN vocabulary for the ML-IN paraphrases Solution:
Add cross-lingual morphological variants:
Given ML word: seperminuman
Find ML lemma: minum
Propose all known IN words sharing the same lemma:» diminum, diminumkan, diminumnya, makan-minum,
makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum
24
Word-Level Adaptation:Issue 2
IN
ML
ENpoor
rich
Note: The IN variants are from a larger monolingual IN text.
Yandex seminar, August 13, 2014, Moscow, Russia
25Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Word-level pivoting Ignores context, and relies on LM Cannot drop/insert/merge/split/reorder words Solution:
Phrase-level pivoting Build ML-EN and EN-IN phrase tables
Induce ML-IN phrase table (pivoting over EN)
Adapt the ML side of ML-EN to get “IN”-EN bi-text:» using Indonesian LM and n-best “IN” as before
Also, use cross-lingual morphological variants
25
Word-Level Adaptation:Issue 3
- Models context better: not only Indonesian LM, but also phrases.- Allows many word operations, e.g., insertion, deletion.
IN
ML
ENpoor
rich
26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Step 2:
Combining
IN-EN + “IN”-EN
Yandex seminar, August 13, 2014, Moscow, Russia
27Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Combining IN-EN and “IN”-EN bi-texts
Simple concatenation: IN-EN + “IN”-EN
Balanced concatenation: IN-EN * k + “IN”-EN
Sophisticated phrase table combination Improved word alignments for IN-EN
Phrase table combination with extra features
Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009)Preslav Nakov, Hwee Tou Ng
Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages. (JAIR, 2012)Preslav Nakov, Hwee Tou Ng
Yandex seminar, August 13, 2014, Moscow, Russia
28Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
29Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
30Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Summary: Concatenate X1-Y and X2-Y
Advantages improved word alignments
e.g., for rare words
more translation optionsless unknown words
useful non-compositional phrases (improved fluency)
phrases with words from X2 that do not exist in X1: ignored
Disadvantages X2-Y will dominate: it is larger
translation probabilities are messed up
phrases from X1-Y and X2-Y cannot be distinguished
X1
X2
Ypoor
rich
rela
ted
Concatenating Bi-texts (1)
Yandex seminar, August 13, 2014, Moscow, Russia
31Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concat×k: Concatenate k copies of the original and one copy of the additional training bi-text
Concat×k:align1. Concatenate k copies of the original and one copy of the
additional bi-text.
2. Generate word alignments.
3. Truncate them only keeping alignments for one copy of the original bi-text.
4. Build a phrase table.
5. Tune the system using MERT.
The value of k is optimized on the development dataset.
X1
X2
Ypoor
rich
rela
ted
Concatenating Bi-texts (2)
Yandex seminar, August 13, 2014, Moscow, Russia
32Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
33Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Summary: Build two separate phrase tables, then(a) use them together
(b) merge them
(c) interpolate them
Advantagesphrases from X1-Y and X2-Y can be distinguished
the larger bi-text X2-Y does not dominate X1-Ymore translation optionsprobabilities are combined in a more principled manner
Disadvantagesimproved word alignments are not possible
X1
X2
Ypoor
rich
rela
ted
Merging Phrase Tables (1)
Yandex seminar, August 13, 2014, Moscow, Russia
34Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Two-tables: Build two separate phrase tables and use them as alternative decoding paths (Birch et al., 2007).
Merging Phrase Tables (2)
Yandex seminar, August 13, 2014, Moscow, Russia
35Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Interpolation: Build two separate phrase tables, Torig and Textra, and combine them using linear interpolation:
Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s).
The value of α is optimized on a development dataset.
Merging Phrase Tables (3)
Yandex seminar, August 13, 2014, Moscow, Russia
36Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Merge:
1. Build separate phrase tables: Torig and Textra.
2. Keep all entries from Torig.
3. Add those entries from Textra that are not in Torig.
4. Add extra features:
F1: 1 if the entry came from Torig, 0 otherwise.
F2: 1 if the entry came from Textra, 0 otherwise.
F3: 1 if the entry was in both tables, 0 otherwise.
The feature weights are set using MERT, and the number of features is optimized on the development set.
Merging Phrase Tables (4)
Yandex seminar, August 13, 2014, Moscow, Russia
37Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
38Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Use Merge to combine the phrase tables for concat×k:align (as Torig) and for concat×1 (as Textra).
Two parameters to tune number of repetitions k
# of extra features to use with Merge: (a) F1 only;
(b) F1 and F2,
(c) F1, F2 and F3
Improved word alignments.
Improved lexical coverage.
Distinguish phrases by source table.
Combined Method
39Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Experiments & Evaluation
Yandex seminar, August 13, 2014, Moscow, Russia
40Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Data
Translation data (for IN-EN) IN2EN-train: 0.9M IN2EN-dev: 37K IN2EN-test: 37K EN-monoling.: 5M
Adaptation data (for ML-EN “IN”-EN) ML2EN: 8.6M IN-monoling.: 20M
(tokens)
Yandex seminar, August 13, 2014, Moscow, Russia
41Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Isolated Experiments:Training on “IN”-EN only
ML2
EN(b
asel
ine)
IN2E
N(b
asel
ine)
Wor
dPar
aph
Wor
dPar
aph+
mor
ph(A
)
Phrase
Parap
h
Phrase
Parap
h+m
orph
(B)
Syste
m c
ombi
natio
n of A
+B
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
14.50
18.6719.50
20.0620.63 20.89 21.24
BLEU
System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
42Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
simple co
ncatenation
balanced co
ncatenation
phrase ta
ble combination
18.018.519.019.520.020.521.021.522.0
18.49
19.79 20.10
21.55 21.64 21.62
ML2EN(baseline) System combination42
BLEU
Combined Experiments:Training on IN-EN + “IN”-EN
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
43Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Experiments: Improvements
43
ML2
EN(baselin
e)
IN2EN(base
line)
phrase
table co
mbination
best iso
lated syste
m
best co
mbined syste
m14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
14.50
18.6720.10
21.24 21.64BLEU
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
44Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Improve Macedonian-English SMT by adapting Bulgarian-English bi-text Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)
OPUS movie subtitles
Application to Other Languages & Domains
BG2EN(A) WordParaph+morph(B)
PhraseParaph+morph(C)
System combination of A+B+C
27.00
27.50
28.00
28.50
29.00
29.50
27.33
27.97
28.38
29.05BLEU
Yandex seminar, August 13, 2014, Moscow, Russia
45Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 45
Analysis
Yandex seminar, August 13, 2014, Moscow, Russia
46Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
ParaphrasingNon-Indonesian Malay Words Only
So, we do need to paraphrase all words.
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
47Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Human Judgments
Morphology yields worse top-3 adaptationsbut better phrase tables, due to coverage.
Is the adapted sentence better Indonesianthan the original Malay sentence?
100 random sentences
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
48Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Reverse Adaptation
Idea: Adapt dev/test Indonesian input to “Malay”,then, translate with a Malay-English system
Input to SMT: - “Malay” lattice- 1-best “Malay” sentence from the lattice
Adapting dev/test is worse than adapting the training bi-text:So, we need both n-best and LM
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
49Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 49
A Specialized Decoder (Instead of Moses)
Yandex seminar, August 13, 2014, Moscow, Russia
50Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Beam-Search Text Rewriting Decoder:The Algorithm
A Beam-Search Decoder for Normalization of Social Media Textwith Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng
Yandex seminar, August 13, 2014, Moscow, Russia
51Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Beam-Search Text Rewriting Decoder:An Example (Twitter Normalization)
Wang, Nakov & Ng (NAACL 2013)
Yandex seminar, August 13, 2014, Moscow, Russia
52Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Hypothesis Producers
Word-level mapping
Phrase-level mapping
Cross-lingual morphology mapping
Indonesian LM
Word penalty (target)
Malay word penalty (source)
Phrase count
Wang, Nakov & Ng (NAACL 2013)
Yandex seminar, August 13, 2014, Moscow, Russia
53Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Moses vs. the Specialized Decoder
Decoding level phrase vs. sentence
Features Moses vs. richer, e.g., Malay word penalty
word-level + phrase-level (potentially, manual rules)
Cross-lingual variants input lattice vs. feature function
Wang, Nakov & Ng (NAACL 2013)
Yandex seminar, August 13, 2014, Moscow, Russia
54Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Moses vs. a Specialized Decoder: Isolated “IN”-EN Experiments:
Wor
dPar
Wor
dPar
+M
orph
Phra
sePa
r
Phra
sePa
r+M
orph
Syste
m C
ombi
natio
n18
19
20
21
22
19.50
20.06
20.6320.89
21.24
20.39 20.4620.85
21.07
21.76
BLEU
Moses Specialized decoder
Yandex seminar, August 13, 2014, Moscow, Russia
55Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 55
Moses vs. a Specialized Decoder:Combining IN-EN and “IN”-EN
simple concat balanced concat phrase table combination17
18
19
20
21
22
23
18.49
19.7920.10
21.55 21.64 21.6221.74 21.8122.03
BLEU
ML2EN (baseline) Moses Specialized decoder
Yandex seminar, August 13, 2014, Moscow, Russia
56Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Experiments: Improvements
56
13
15
17
19
21
23
14.5
18.67
20.1
21.2421.64
22.03
BLEU
Yandex seminar, August 13, 2014, Moscow, Russia
57Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKEN, Adapting BG-EN to “MK”-EN
26
27
28
29
30
27.33
27.97
28.38
29.05
29.35
BLEU
Yandex seminar, August 13, 2014, Moscow, Russia
58Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 58
Transliteration
Yandex seminar, August 13, 2014, Moscow, Russia
59Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
59
Spanish vs. Portuguese
Spanish–PortugueseSpanish
Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros.
PortugueseTodos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap
Yandex seminar, August 13, 2014, Moscow, Russia
60Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Spanish vs. Portuguese
Spanish–PortugueseSpanish
Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros.
PortugueseTodos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap67% approx. word overlap
The actual overlap is even higher.
Yandex seminar, August 13, 2014, Moscow, Russia
61Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Cognates
LinguisticsDef: Words derived from a common root, e.g.,
Latin tu (‘2nd person singular’)Old English thouFrench tuSpanish túGerman duGreek sú
Orthography/phonetics/semantics: ignored.
Computational linguisticsDef: Words in different languages that are mutual translations and have a similar orthography, e.g.,
evolution vs. evolución vs. evolução vs. evoluzioneOrthography & semantics: important.Origin: ignored.
Cognates can differ a lot:• night vs. nacht vs. nuit vs. notte vs.
noite• star vs. estrella vs. stella vs. étoile• arbeit vs. rabota vs. robota (‘work’)• father vs. père• head vs. chef
Yandex seminar, August 13, 2014, Moscow, Russia
62Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Spelling Differences Between Cognates
Systematic spelling differencesSpanish – Portuguese
different spelling-nh- -ñ- (senhor vs. señor)
phonetic-ción -ção (evolución vs. evolução)-é -ei (1st sing past) (visité vs. visitei)-ó -ou (3rd sing past) (visitó vs. visitou)
Occasional differencesSpanish – Portuguese
decir vs. dizer (‘to say’)Mario vs. MárioMaría vs. Maria
Many of these can be learned automatically.
Yandex seminar, August 13, 2014, Moscow, Russia
63Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Automatic Transliteration
Transliteration
1. Extract likely cognates for Portuguese-Spanish
2. Learn a character-level transliteration model
3. Transliterate the Portuguese side of pt-en, to look like Spanish
Yandex seminar, August 13, 2014, Moscow, Russia
64Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Automatic Transliteration (2)
Extract pt-es cognates using English (en)
1. Induce pt-es word translation probabilities
2. Filter out by probability if
3. Filter out by orthographic similarity if
constants proposedin the literature
Longest commonsubsequence
Yandex seminar, August 13, 2014, Moscow, Russia
65Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
SMT-based Transliteration
Train & tune a monotone character-level SMT system
Representation
Use it to transliterate the Portuguese side of pt-en
Yandex seminar, August 13, 2014, Moscow, Russia
66Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
ESEN, Adapting PT-EN to “ES”-EN
PT-EN ES-EN phrase table combination0
5
10
15
20
25
30
5.34
22.8724.23
13.79
26.24
BLEU
original transliterated10K ES-EN, 1.23M PT-EN
Yandex seminar, August 13, 2014, Moscow, Russia
67Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 67
Transliteration
vs.
Character-Level Translation
Yandex seminar, August 13, 2014, Moscow, Russia
68Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Macedonian vs. Bulgarian
68
Yandex seminar, August 13, 2014, Moscow, Russia
69Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKBG: Transliteration vs. Translation
5
10
15
20
25
30
35
40
10.7412.07
22.74
31.10 32.19 32.7133.94
BLEU
Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages (ACL 2012).Preslav Nakov, Jorg Tiedemann.
Yandex seminar, August 13, 2014, Moscow, Russia
70Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Character-Level SMT
70
• MK: Никогаш не сум преспала цела сезона.• BG: Никога не съм спала цял сезон.
• MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ .• BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .
Yandex seminar, August 13, 2014, Moscow, Russia
71Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Character-Level Phrase Pairs
71
Can cover:
word prefixes/suffixes
entire words
word sequences
combinations thereof
Max-phrase-length=10LM-order=10
Yandex seminar, August 13, 2014, Moscow, Russia
72Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKBG: The Impact of Data Size
Yandex seminar, August 13, 2014, Moscow, Russia
73Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Slavic Languages in Europe
Yandex seminar, August 13, 2014, Moscow, Russia
74Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
BG
MK
SR
CZ
SL
MK XX
Yandex seminar, August 13, 2014, Moscow, Russia
75Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK SR, SL, CZ
76
Pivoting
Yandex seminar, August 13, 2014, Moscow, Russia
77Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Pivoting over BG
Macedonian: Никогаш не сум преспала цела сезона.
Bulgarian: Никога не съм спала цял сезон.
English: I’ve never slept for an entire season.
For related languages• subword transformations• character-level translation
Yandex seminar, August 13, 2014, Moscow, Russia
78Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Pivoting over BG
Analyzing the Use of Character-Level Translationwith Sparse and Noisy Datasets (RANLP 2013)Jorg Tiedemann, Preslav Nakov
Yandex seminar, August 13, 2014, Moscow, Russia
79Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Using Synthetic “MK”-EN Bi-Texts
Translate Bulgarian to Macedonian in a BG-XX corpus
All synthetic data combined (+mk-en): 36.69 BLEU Tiedemann & Nakov (RANLP 2013)
80
Conclusion
Yandex seminar, August 13, 2014, Moscow, Russia
81Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Adapt bi-texts for related resource-rich languages, using confusion networks word-level & phrase-level paraphrasing cross-lingual morphological analysis
Character-level Models translation transliteration pivoting vs. synthetic data
Future work other languages & NLP problems robustness: noise and domain shift Thank you!
Conclusion & Future Work
Yandex seminar, August 13, 2014, Moscow, Russia
83Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 83
Related Work
Yandex seminar, August 13, 2014, Moscow, Russia
84Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (1)
Machine translation between related languages E.g.
Cantonese–Mandarin (Zhang, 1998)
Czech–Slovak (Hajic & al., 2000)
Turkish–Crimean Tatar (Altintas & Cicekli, 2002)
Irish–Scottish Gaelic (Scannell, 2006)
Bulgarian–Macedonian (Nakov & Tiedemann, 2012)
We do not translate (no training data), we “adapt”.
Yandex seminar, August 13, 2014, Moscow, Russia
85Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (2)
Adapting dialects to standard language (e.g., Arabic)(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)
manual rules and/or language-specific tools
Normalizing Tweets and SMS(Aw & al., 2006; Han & Baldwin, 2011)
informal text: spelling, abbreviations, slang
same language
Yandex seminar, August 13, 2014, Moscow, Russia
86Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (3)
Adapt Brazilian to European Portuguese (Marujo & al. 2011)
rule-based, language-dependent
tiny improvements for SMT
Reuse bi-texts between related languages (Nakov & Ng. 2009)
no language adaptation (just transliteration)
Cascaded/pivoted translation(Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)
poor rich X requires an additional poor-rich bi-text
rich X poor does not use the similarity poor-rich
poor
rich
Xour: