finding translations for low-frequency words in comparable corpora

23
Finding Translations for Low-Frequency Words in Comparable Corpora Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: [email protected]

Upload: laszlo

Post on 17-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Finding Translations for Low-Frequency Words in Comparable Corpora. Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: [email protected]. Overview. Distributional Hypothesis and bilingual lexicon acquisition - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding Translations for Low-Frequency Words in Comparable Corpora

Finding Translations for Low-Frequency Words in Comparable Corpora

Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni

ILP, University of Wolverhampton, UK

Contact email: [email protected]

Page 2: Finding Translations for Low-Frequency Words in Comparable Corpora

Overview

Distributional Hypothesis and bilingual lexicon acquisition

The effect of data sparseness Methods to model co-occurrence vectors

of low-frequency words Experimental evaluation Conclusions

Page 3: Finding Translations for Low-Frequency Words in Comparable Corpora

Distributional Hypothesis in the bilingual context Words of different languages that appear in

similar contexts are translationally equivalent Acquisition of bilingual lexicons from

comparable, rather than parallel corpora Bilingual comparable corpora: not translated

texts, but same topic, size, style of presentation Advantages over parallel corpora:

Broad coverage Easy domain portability Virtually unlimited number of language pairs Parallel corpora = restoration of existing dictionaries

Page 4: Finding Translations for Low-Frequency Words in Comparable Corpora

General approach Comparable corpora in languages L1 and L2

Words to be aligned: N1 and N2

Extract co-occurrence data on N1 and N2 from respective corpora: V1 and V2

Create co-occurrence matrices N1×V1, each cell containing f(v,n) or p(v|n)

Create a translation matrix using a bilingual lexicon: V1×V2

Equivalences between only the core vocabularies Each cell encodes translation probability Used to map a vector from L1 to the vector space of L2

Words with the most similar vectors are taken to be equivalent

Page 5: Finding Translations for Low-Frequency Words in Comparable Corpora

Data sparseness

The approach works quite unreliably on all but very frequent words (e.g., Gaussier et al 2004)

Polysemy and synonymy: many-to-many correspondences between the two vocabularies

Noise introduced during the translation between vector spaces

Page 6: Finding Translations for Low-Frequency Words in Comparable Corpora

Data sparseness

0

20

40

60

80

100

120

140

160

180

1-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000

Frequency ranks

Mea

n R

ank

En-Fr

En-Ge

En-Sp

Fr-Ge

Fr-Sp

Ge-Sp

Page 7: Finding Translations for Low-Frequency Words in Comparable Corpora

Dealing with data sparseness

How can one deal with data sparseness? Various smoothing techniques exist: Good

Turing, Kneser-Ney, Katz’s back-off Previous comparative studies:

Class-based smoothing (Resnik 1993) Web-based smoothing (Keller&Lapata 2003) Distance-based averaging (Pereira et al 1993;

Dagan et al. 1999)

Page 8: Finding Translations for Low-Frequency Words in Comparable Corpora

Distance-based averaging

Probability of an unknown co-occurrence p*(v|n) is estimated from known probabilities of N’, a set of nearest neighbours of n:

where w is a weight with which n’ influences the average of known probabilities of N’; w is computed from distance/similarity between n and n’

norm is a normalisation factor

Page 9: Finding Translations for Low-Frequency Words in Comparable Corpora

Adjusting probabilities for rare co-occurrences DBA was used to predict unseen probabilities We’d like predict unseen as well as adjust seen,

but unreliable probabilities:

0 ≤ γ ≤1, the degree to which the seen probability is smoothed with data on the neighbours

Problem: how does one estimate γ?

Page 10: Finding Translations for Low-Frequency Words in Comparable Corpora

Heuristical estimation of γ

The less frequent is n, the more it gets smoothed

Log-transformed corpus counts to downplay differences between frequent words

Page 11: Finding Translations for Low-Frequency Words in Comparable Corpora

Exact relationship between corpus frequency of n and γ is determined on held-out pairs The held-out data are split into frequency ranges Mean rank of the correct equivalent in each range is

computed Function g(x) is interpolated along the mean rank points

g(n) – predicted rank for n RR - random rank, lowest bound on mean rank

Performance-based estimation of γ

Page 12: Finding Translations for Low-Frequency Words in Comparable Corpora

Smoothing functions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000

Frequency ranks

gam

ma

En-Fr (perf)

Ge-Sp (perf)

En-Fr (heur)

Ge-Sp (heur)

Page 13: Finding Translations for Low-Frequency Words in Comparable Corpora

Less frequent neighbours

Remove less frequent neighbours, in order to avoid “diluting” corpus-attested probabilities

Page 14: Finding Translations for Low-Frequency Words in Comparable Corpora

Experimental setup

6 language pairs: all combinations with English, French, German, and Spanish

Corpora: EN: WSJ (87-89), Connexor FDG FR: Le Monde (94-96), Xerox Xelda GE: die Tageszeitung (87-89, 94-98), Versley SP: EFE (94-95), Connexor FDG

Extracted verb-direct object pairs from each corpus

Page 15: Finding Translations for Low-Frequency Words in Comparable Corpora

Experimental setup Translation matrices:

Equivalents between verb synsets in EuroWordNet

Translation probabilities equally distributed among different translations of a source word

Evaluation samples of noun pairs: 1000 pairs from EWN for each language pair Sampled from equidistant positions in a sorted

frequency list Divided into 10 frequency ranges Each noun might have several translations in

the sample (1.06 to 1.15 translations)

Page 16: Finding Translations for Low-Frequency Words in Comparable Corpora

Experimental setup

Assignment algorithm To pair each source noun with a correct target

noun Similarity measured using Jensen-Shannon

Divergence Kuhn-Munkres algorithm to determine the most

optimal assignment on the entire set Evaluation measure

Mean rank of the correct equivalent

Page 17: Finding Translations for Low-Frequency Words in Comparable Corpora

Baseline: no smoothing

0

20

40

60

80

100

120

140

160

180

1-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000

Frequency ranks

Mea

n R

ank

En-Fr

En-Ge

En-Sp

Fr-Ge

Fr-Sp

Ge-Sp

Page 18: Finding Translations for Low-Frequency Words in Comparable Corpora

DBA: replace p(v|n) with p*(v|n)

Page 19: Finding Translations for Low-Frequency Words in Comparable Corpora

Discard less frequent neighbours

significant reduction of Mean Rank: Fr-Ge, Fr-Sp, Ge-Sp

Page 20: Finding Translations for Low-Frequency Words in Comparable Corpora

Heuristical estimation of γ

significant reduction of Mean Rank: all language pairs

Page 21: Finding Translations for Low-Frequency Words in Comparable Corpora

Performance-based estimation of γ

significant reduction of Mean Rank: all language pairs

Page 22: Finding Translations for Low-Frequency Words in Comparable Corpora

Relationship between k, frequency and Mean Rank

0 2 5

10 20

40

60

100

140

200

300

500

800

100 30

0 500 70

0 900

0

20

40

60

80

100

120

140

160

180

me

an

ra

nk

k frequency ranges

Page 23: Finding Translations for Low-Frequency Words in Comparable Corpora

Conclusions Smoothing co-occurrence data on rare words

using intra-language similarities to improve retrieval of their translational equivalents

Extensions of DBA, to smooth rare co-occurrences: Heuristical (amount of smoothing is a linear function of

frequency) Performance-based (the smoothing function is estimated

on held-out data) Both lead to considerable improvement:

up to 48 ranks reduction (from 146 to 99, 32%) in low frequency ranges

up to 27 ranks reduction (from 81 to 54, 33%) overall