using pivot/bridge languages
DESCRIPTION
Using Pivot/Bridge Languages. Matthias Eck. General Problem. Resources are available between languages A and B … and between languages B and C … but not C and A How to train translation models between C and A?. A. C. B. 1 st paper. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/1.jpg)
Using Pivot/Bridge Languages
Matthias Eck
![Page 2: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/2.jpg)
General Problem
Resources are available between languages A and B… and between languages B and C… but not C and A
How to train translation models between C and A?
A
C B
![Page 3: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/3.jpg)
1st paper
Multipath Translation Lexicon Induction via Bridge Languages
Gideon S. Mann and David Yarowsky NAACL 2001
Method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages
![Page 4: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/4.jpg)
Lexicon via Cognate pairs
Lexicon: Mapping of word in source language to words in
target language
Here: Lexicon is built between arbitrary languages using
models of cognate pairs and cognate distance
![Page 5: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/5.jpg)
Romance Family
General idea
English Spanish Portuguese
Italian
French
Romanian
dictionarycognate
model
source targetbridge
![Page 6: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/6.jpg)
Cognate pairs can make up significant portion of lexicon if languages are in the same family and close
Translation pairs
English French
nephew neveu typical cognate pair
father pere Historically related, but now distant
water eau not related
![Page 7: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/7.jpg)
Cognate string edit distance
Obvious condition for a good distance D
So we choose
…as the translation for s
D(s,n)D(s,c)
(s,n)(s,c)
TncSs
Then
noncognate cognate If
, ,
),(minargˆ tsDtTt
![Page 8: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/8.jpg)
Used distance measures
L: Levenshtein distance Minimum sum of the costs of edit operations required to
transform one string into another Deletion, Substitution, Insertion – traditional cost 1
S: Stochastic transducers Probabilistic costs for each possible edit operation
H: Hidden Markov Model Each character has separate edit operation parameters
![Page 9: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/9.jpg)
Distance Measures
Variants of Levenshtein distance: L-V: vowel substitution cost only: 0.5
L-S/L-A: Filter probabilities obtained by S into 3 classes 0.5, 0.75, 1 L-S: Each pair separately trained L-A: Collectively trained for all Romance languages
Limitation Method cannot discover translation pairs with having
no surface form relationship
Assumed cognate pairs: Levenshtein edit distance < 3 Few false positives
![Page 10: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/10.jpg)
Intra Family Translation Lexicon Induction
Family: Romance languages Available: dictionary (English/Bridge language)
General evaluation algorithm:1. Select 100 word pairs from dictionary for testing2. For adaptive metrics: Select hypothesized word pairs
(Edit distance < 3) as cognate pairs and train on them
3. For each word in source language select closest word from the 100 target words
![Page 11: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/11.jpg)
Results
Source Languages: Spanish, French, Italian, Romanian
Target Language: Portuguese
1000 word pairs in dictionary for Spanish/Portuguese 900 for other language pairs
![Page 12: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/12.jpg)
Results
Pure Levenshtein distance works surprisingly well S gives boost on French-Portuguese Reason could be that Spanish-Portuguese are closer
than French-Portuguese L-S usually best
![Page 13: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/13.jpg)
Consonant-to-consonant
Consonant-to-consonant edit operations
Most probable forFrench – Portuguese
French Portuguese
n m
c g
p f
g n
b v
p f
x s
s c
c q
g v
t d
![Page 14: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/14.jpg)
Analysis
![Page 15: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/15.jpg)
Analysis - Example
![Page 16: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/16.jpg)
Multiple bridge languages
Slavic Family
English Czech
Ukrainian
dictionarycognate
model
source targetbridge
Russian
Polish
Serbian
![Page 17: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/17.jpg)
Translation Lexicon Induction
Algorithm (One or more bridge languages)
For each word s SFor each bridge language B
Translate s → b Bt T, Calculate D(b,t)
Rank t by D(b,t)
Score t using information from all bridgesSelect highest scored tMap s → t
![Page 18: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/18.jpg)
Results
One bridge languages, but multiple pathes
![Page 19: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/19.jpg)
Examples
![Page 20: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/20.jpg)
Different Pathways
English to Portuguese (via Romance languages)
English to Norwegian (via Germanic languages)
English to Ukrainian (via Slavic languages)
Portuguese to English (via Germanic languages, French)
![Page 21: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/21.jpg)
Results
![Page 22: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/22.jpg)
2nd Paper
Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages
Charles Schafer and David Yarowsky COLING 2002
Improves results of first paper by introducing additional similarity scores between candidate translations
![Page 23: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/23.jpg)
Basic Idea
Decompose:
P(English|Serbian) = P(English|Czech) x P(Czech|Serbian)
For any language L close to Czech: P(English|L) = P(English|Czech) x P(Czech|L)
P (Czech|L) as presented was done using similarity on cognate pairs
![Page 24: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/24.jpg)
Covered Languages
English Czech
Hindi
Nepali
Bengali
Marathi
Gujarati
Punjabi
Polish
Slovak
Ukrainian
Bulgarian
Serbian
Slovene
![Page 25: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/25.jpg)
Resources
Serbian – Czech – English Czech – English
dictionary: 171k word pairs
Corpora:English: 192M wordsSerbian: 12M(News data from web)
Gujarati – Hindi – English Hindi – English
dictionary:74k word pairs
Corpora:Gujarati: 2M
![Page 26: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/26.jpg)
Problem with Cognate Pairs
Serbian Czech English
prazan prizen
pazen
prazdny
favor
grace
patronage
blank
emptycorrect
not correct
![Page 27: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/27.jpg)
Idea
Introduce additional similarity models Weighted Levenshtein Similarity Context Similarity Date distributional Similarity Relative frequency Similarity Burstiness Similarity and Inverse Document
Frequency Use of Additional Bridge Languages
Combine them with weighted string distance
![Page 28: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/28.jpg)
Weighted Levenshtein Similarity
1. Iteration: Vowel cluster operations have half the cost of single consonant substitutions, insertions and deletions
dist(vowel+, vowel+)
Use highest weighted of the top 2000 to re-estimate edit weights
Some high probability substitutions:
![Page 29: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/29.jpg)
Context Similarity
Compare narrow and wide contexts for candidatesContext: bag of words (Narrow: radius 1/ Wide: radius 10)
1. Calculate Context on Source Language (Serbian)2. Translate to English using current estimations 3. Compare with English Contexts via Cosine Similarity
![Page 30: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/30.jpg)
Context Similarity - Example
Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4
Context in Serbian Corpus with frequencies
![Page 31: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/31.jpg)
Context Similarity - Example
Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4
2 1.5 1.5 1.5 4 1.5
justice
majesty
sovereignty
declaration
country ornamental
Translate with Initial Lexicon
![Page 32: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/32.jpg)
Context Similarity - Example
Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4
2 1.5 1.5 1.5 4 1.5
justice
majesty
sovereignty
declaration
country ornamental
10 0 479 836 191 013
104 0 21 4 141 0184681
expression
religion
Independence
Freedom
00
Context of Candidates in English Corpus
![Page 33: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/33.jpg)
Context Similarity - Example
Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4
2 1.5 1.5 1.5 4 1.5
justice
majesty
sovereignty
declaration
country ornamental
10 0 479 836 191 013
104 0 21 4 141 0184681
expression
religion
Independence
Freedom
00
COS
Cosine Similarity finds correct candidate(Independence)
![Page 34: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/34.jpg)
Date distributional Similarity
News Data: Events are reported in parallel in multiple languages
(+/- 2 days)
Construct term frequency vectors over time and compare candidates
![Page 35: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/35.jpg)
Date distributional Similarity
![Page 36: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/36.jpg)
Relative Frequencies
Word and translation are likely to have similar relative frequencies
Modest frequency variations are expected
Useful to rule out pairings with several orders of magnitude difference in relative frequency
Ratio of logs of frequencies correlates well with translational compatibility
![Page 37: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/37.jpg)
Relative Frequency Similarity
Correct translation “laud” has higher RF Score than higher ranked incorrect candidates “calibre”, “quarter” and “class”
![Page 38: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/38.jpg)
Burstiness Similarity
Define Burstiness to measure differences
![Page 39: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/39.jpg)
Burstiness Similarity
Burstiness matches better for correct translations “laud” and “praise”
![Page 40: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/40.jpg)
Combine the different measures
1. Weighted Levenshtein distance to get initial candidate pairs
2. Calculate 8 similarity measures Weighted Levenshtein Wide bag-of-words context similarity Narrow bag of words context similarity Local News date distribution similarity All News date distribution similarity IDF similarity Burstiness similarity
![Page 41: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/41.jpg)
Combine the different measures
3. Integrate similarity measures into a single similarity function:1. POS Similarity
Bias in favor of compatible parts of speech (N, V, ADJ)Penalty for non-matching candidates
2. Sort candidates for each score in decreasing orderAssign Ranks 0,1,… and normalize by count
3. Scoring: Similarity models have associated weights
![Page 42: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/42.jpg)
Weight Allocation
![Page 43: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/43.jpg)
Evaluation
3 Evaluation Criteria Exact Match Accuracy
Percentage of correct English in the top k ranks
Median Position of the per word highest ranked correct translation
![Page 44: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/44.jpg)
Results
![Page 45: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/45.jpg)
Results
Improvements with second bridge language
![Page 46: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/46.jpg)
Additional Bridge Language Work
Interlingua based Statistical Machine Translation Manuel Kauers, Stephan Vogel, Christian Fügen, Alex
Waibel ICSLP 2002
Paper covers SMT from Text to a structured Interlingua format (IF)
Corpus English/IF is available…but we also want to translate other languages into IF?
English IF
![Page 47: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/47.jpg)
Generalized problem
Assume we have translation model F to E and G to F… but we want G to E?
Decompose:
Because:
E
G F
![Page 48: Using Pivot/Bridge Languages](https://reader035.vdocuments.site/reader035/viewer/2022062408/56813b22550346895da3ddb3/html5/thumbnails/48.jpg)
And just translating…
Experiments done during PF-STAR project 2003/2004
Training data: 48k lines of BTEC data Test data: 506 lines, Test set for CSTAR 2003
Translating Chinese → Italian Also via a bridge language Chinese → English →
Italian
Ch → It Ch → En → It
ITC-IRST 0.1769/4.5251 0.1695/4.4343
CMU/UKA 0.2030/4.8210 0.2238/4.9453