cognates and word alignment in bitexts greg kondrak university of alberta
Post on 22-Dec-2015
224 views
TRANSCRIPT
Cognates and Cognates and Word Alignment Word Alignment
in Bitextsin Bitexts
Greg KondrakGreg Kondrak
University of AlbertaUniversity of Alberta
22
OutlineOutline
BackgroundBackground Improving LCSRImproving LCSR Cognates vs. word alignment linksCognates vs. word alignment links Experiments & resultsExperiments & results
33
MotivationMotivation
Claim: words that are orthographically Claim: words that are orthographically similar are more likely to be mutual similar are more likely to be mutual translations than words that are not translations than words that are not similar.similar.
Reason: existence of cognates, which Reason: existence of cognates, which are usually orthographically are usually orthographically andand semantically similar.semantically similar.
Use: Considering cognates can improve Use: Considering cognates can improve word alignment and translation models.word alignment and translation models.
44
ObjectiveObjective
Evaluation of orthographic Evaluation of orthographic similarity measures in the context similarity measures in the context of word alignment in bitexts. of word alignment in bitexts.
55
MT applicationsMT applications
sentence alignment sentence alignment word alignmentword alignment improving translation modelsimproving translation models inducing translation lexiconsinducing translation lexicons aid in manual alignmentaid in manual alignment
66
CognatesCognates
Similar in orthography or Similar in orthography or pronunciation.pronunciation.
Often mutual translations.Often mutual translations. May include:May include:
– genetic cognatesgenetic cognates– lexical loanslexical loans– namesnames– numbersnumbers– punctuationpunctuation
77
The task of cognate The task of cognate identificationidentification Input: two wordsInput: two words Output: the likelihood that they Output: the likelihood that they
are cognateare cognate One method: compute their One method: compute their
orthographic/phonetic/semantic orthographic/phonetic/semantic similaritysimilarity
88
ScopeScope
The measures that we consider areThe measures that we consider are language-independentlanguage-independent orthography-basedorthography-based operate on the level of individual operate on the level of individual
lettersletters binary identity functionbinary identity function
99
Similarity measuresSimilarity measures
Prefix method Prefix method Dice coefficientDice coefficient Longest Common Subsequence Longest Common Subsequence
Ratio (LCSR)Ratio (LCSR) Edit distanceEdit distance Phonetic alignmentPhonetic alignment Many other methodsMany other methods
1010
IDENTIDENT
1 if two words are identical, 0 1 if two words are identical, 0 otherwiseotherwise
The simplest similarity measureThe simplest similarity measure e.g. IDENT(e.g. IDENT(cocolour, lour, cocouleur) = 0uleur) = 0
1111
PREFIXPREFIX
The ratio of the longest common The ratio of the longest common prefix of two words to the length prefix of two words to the length of the longer wordof the longer word
e.g. PREFIX(e.g. PREFIX(cocolour, lour, cocouleur) = 2/7 = uleur) = 2/7 = 0.280.28
1212
DICE coefficientDICE coefficient
The ratio of the number of The ratio of the number of common letter bigrams to the common letter bigrams to the total number of letter bigramstotal number of letter bigrams
e.g. DICE(colour, couleur) = 6/11 = e.g. DICE(colour, couleur) = 6/11 = 0.550.55
coco ol lo ol lo ouou urur
coco ouou ul le eu ul le eu urur
1313
Longest Common Sub-Longest Common Sub-sequence Ratio (LCSR)sequence Ratio (LCSR) The ratio of the longest common The ratio of the longest common
subsequence of two words to the subsequence of two words to the length of the longer word.length of the longer word.
e.g. LCSR(colour, couleur) = 5/7 = e.g. LCSR(colour, couleur) = 5/7 = 0.710.71
cc oo -- ll oo -- uu rr
cc oo uu ll -- ee uu rr
1414
LCSRLCSR
Method of choice in several papersMethod of choice in several papers Weak point: insensitive to word Weak point: insensitive to word
lengthlength ExampleExample
– LCSR(LCSR(walls, allwalls, allééss) = 0.8) = 0.8– LCSR(LCSR(sanctuary, sanctuairesanctuary, sanctuaire) = 0.8) = 0.8
Sometimes a minimal word length Sometimes a minimal word length imposedimposed
A principled solution?A principled solution?
1515
The random modelThe random model
Assumption: strings are Assumption: strings are generated randomly from a given generated randomly from a given distribution of letters.distribution of letters.
Problem: what is the probability Problem: what is the probability of seeing of seeing kk matches between two matches between two strings of length strings of length mm and and nn??
1616
A special caseA special case
Assumption: Assumption: kk=0 (no matches)=0 (no matches) t – alphabet sizet – alphabet size S(n,i) - Stirling number of the S(n,i) - Stirling number of the
second kindsecond kind
mtn
i
itinSi
t
tLCS mn ))(,(
1)0Pr(
),max(
1
1717
The problemThe problem
What is the probability of seeing What is the probability of seeing kk matches between two strings of length matches between two strings of length mm and and nn??
An exact analytical formula is unlikely to An exact analytical formula is unlikely to exist.exist.
A very similar problem has been studied in A very similar problem has been studied in bioinformatics as bioinformatics as statistical significance of statistical significance of alignment scoresalignment scores..
Approximations developed in Approximations developed in bioinformatics are not applicable to words bioinformatics are not applicable to words because of length differences.because of length differences.
1818
Solutions for the Solutions for the general casegeneral case SamplingSampling
– Not reliable for small probability valuesNot reliable for small probability values– Works well for low Works well for low k/nk/n ratios (uninteresting) ratios (uninteresting)– Depends on a given alphabet size and Depends on a given alphabet size and
letter frequenciesletter frequencies– No insightNo insight
Inexact approximation Inexact approximation – Works well for high Works well for high k/nk/n ratios (interesting) ratios (interesting)– Easy to useEasy to use
1919
Formula 1Formula 1
- probability of a match- probability of a match
))1log(exp(
)1()Pr(
k
k
m
k
n
k
pk
m
k
n
pkLCS
t
j jpp1
2
)Pr()1Pr()Pr( kLCSkLCSkLCS
2020
Formula 1Formula 1
Exact for Exact for k=m=nk=m=n Inexact in generalInexact in general Reason: implicit independence Reason: implicit independence
assumptionassumption Lower bound for the actual probabilityLower bound for the actual probability Good approximation for high Good approximation for high k/nk/n ratios. ratios. Runs into numerical problems for largerRuns into numerical problems for larger
nn
nnn pnL )Pr( ,
2121
Formula 2Formula 2
Expected number of pairs of Expected number of pairs of kk--letter substrings.letter substrings.
Approximates the required Approximates the required probability for high probability for high k/nk/n ratios. ratios.
)Pr()( kLCSpk
m
k
nxE kk
2222
Formula 2Formula 2
Does not work for low Does not work for low k/nk/n ratios. ratios. Not monotonic.Not monotonic. Simpler than Formula 1.Simpler than Formula 1. More robust against numerical More robust against numerical
underflow for very long words.underflow for very long words.
2323
Comparison of both Comparison of both formulasformulas Both are exact for Both are exact for k=m=nk=m=n For k close to max(m,n)For k close to max(m,n)
– both formulas are good both formulas are good approximationsapproximations
– their values are their values are veryvery close close Both can be quickly computed Both can be quickly computed
using dynamic programming.using dynamic programming.
2424
LCSFLCSF
A new similarity measure based on A new similarity measure based on Formula 2.Formula 2.
LCSR(X,Y) = k/nLCSR(X,Y) = k/n LCSF(X,Y) =LCSF(X,Y) = LCSF is as fast as LCSR because its LCSF is as fast as LCSR because its
values that depend only on values that depend only on kk and and n n can be pre-computed and storedcan be pre-computed and stored
)0),logmax( kpk
n
k
n
2525
Evaluation - motivationEvaluation - motivation
Intrinsic evaluation of orthographic Intrinsic evaluation of orthographic similarity is difficult and subjective.similarity is difficult and subjective.
My idea: extrinsic evaluation on My idea: extrinsic evaluation on cognates and word aligned bitexts.cognates and word aligned bitexts.– Most cross-language cognates are Most cross-language cognates are
orthographically similar and vice-versa.orthographically similar and vice-versa.– Cognation is binary and Cognation is binary and notnot subjective subjective
2626
Cognates vs alignment Cognates vs alignment linkslinks Manual identification of cognates Manual identification of cognates
is tedious.is tedious. Manually word-aligned bitexts are Manually word-aligned bitexts are
available, but only some of the available, but only some of the links are between cognates.links are between cognates.
Question #1: can we use Question #1: can we use manually-constructed word manually-constructed word alignment links instead?alignment links instead?
2727
Manual vs automatic Manual vs automatic alignment linksalignment links Automatically word-aligned bitext Automatically word-aligned bitext
are easily obtainable, but a good are easily obtainable, but a good fraction of the links are wrong.fraction of the links are wrong.
Question #2: can we use Question #2: can we use machine-generated word machine-generated word alignment links instead?alignment links instead?
2828
Evaluation Evaluation methodologymethodology Assumption: a word aligned bitextAssumption: a word aligned bitext Treat aligned sentences as bags of Treat aligned sentences as bags of
wordswords Compute similarity for all word pairs Compute similarity for all word pairs Order word pairs by their similarity Order word pairs by their similarity
valuevalue Compute precision against a gold Compute precision against a gold
standardstandard– either a cognate list or alignment linkseither a cognate list or alignment links
2929
Test dataTest data
Blinker bitext (French-English)Blinker bitext (French-English)– 250 Bible verse pairs250 Bible verse pairs– manual word alignmentmanual word alignment– all cognates manually identifiedall cognates manually identified
Hansards (French-English)Hansards (French-English)– 500 sentences500 sentences– manual and automatic word-alignmentmanual and automatic word-alignment
Romanian-EnglishRomanian-English– 248 sentences248 sentences– manually alignedmanually aligned
3030
Blinker resultsBlinker results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cognates Links
IDENTPREFIXDICELCSRLCSF
3131
Hansards resultsHansards results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Manual Automatic
IDENTPREFIXDICELCSRLCSF
3232
Romanian-English Romanian-English resultsresults
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Manual
IDENTPREFIXDICELCSRLCSF
3333
ContributionsContributions
We showed that word alignment We showed that word alignment links can be used instead of links can be used instead of cognates for evaluating word cognates for evaluating word similarity measures.similarity measures.
We proposed a new similarity We proposed a new similarity measure which outperforms measure which outperforms LCSR.LCSR.
3434
Future workFuture work
Extend our approach to length Extend our approach to length normalization to edit distance and normalization to edit distance and other similarity measures.other similarity measures.
Incorporate cognate information Incorporate cognate information into statistical MT models as an into statistical MT models as an additional feature function.additional feature function.
3535
Thank you