1 dictionary acquisition using parallel text and co-occurrence statistics chris biemann, uwe...

1

Dictionary Acquisition Using Parallel Text

and Co-occurrence Statistics

Chris Biemann, Uwe Quasthoff

University of Leipzig, NLP-Dept.

Friday, May 20, 2005

NODALIDA 2005

2

Problem Description

Given: • certain amounts of sentence-aligned parallel texts

Not available:• morphology, grammar, semantic etc. information• string similarity for cognates• bilingual dictionary

Wanted:• bilingual dictionaries• alignment on word level

3

Broad Picture

• Calculation of translingual statistically significant co-occurrences yields ranked translation candidates

• For alignment, the highest ranked translation candidates that occurr in the sentence pair are linked.

4

Co-occurrence Statistics

• Co-occurrence: occurrence of two words within a well-defined unit of information (sentence, nearest neighbors, window...)

• Significant co-occurrences reflect relations between words

• Threshold on significance measure (log-likelihood):

( , ) log log !

with number of sentences,

.

sig A B x k x k

n

abx

n

k= number of units containing A and B

5

Trans-co-occurrencesTranslingual co-occurrences

‘normal‘ co-occurrences:• Calculaton performed on sentence basis• Co-occurrents can be found frequently together in

sentences

Trans-co-occurrences:• Calculaton performed on bilingual sentence pairs• Co-occurrents can be found frequently together in bilingual

sentence pairs• Hypothesis: significant co-occurrences between words of

different languages (= trans-co-occurrences) are translation equivalents

6

Data: Europarl

• Transcriptions of European Parliament, about 1 million sentences per language

• Available for Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish and Swedish

• Experiments carried out for:Englisch-DanishEnglisch-Dutch

Englisch-GermanEnglisch-FinnishEnglisch-ItalianEnglisch-PortugeseEnglisch-Swedish

(chosen because of dictionary availability)

7

Example: Gesellschaft@de society@en

Die@de drogenfreie@de Gesellschaft@de wird@de es@de aber@de nie@de geben@de .@de But@en there@en never@en will@en be@en a@en drug-free@en society@en .@en

Unsere@de Gesellschaft@de neigt@de leider@de dazu@de ,@de Gesetze@de zu@de umgehen@de .@de Unfortunately@en ,@en our@en society@en is@en inclined@en to@en skirt@en round@en the@en law@en .@en

Zum@de Glück@de kommt@de das@de in@de einer@de demokratischen@de Gesellschaft@de selten@de vor@de .@de Fortunately@en ,@en in@en a@en democratic@en society@en this@en is@en rare@en .@en

Herr@de Präsident@de !@de Wir@de leben@de in@de einer@de paradoxen@de Gesellschaft@de .@de Mr@en President@en ,@en we@en live@en in@en a@en paradoxical@en society@en .@en

Ich@de sprach@de vom@de Paradoxon@de unserer@de Gesellschaft@de .@de I@en mentioned@en what@en is@en paradoxical@en in@en society@en .@en

Zeit@de ist@de Macht@de in@de unserer@de Gesellschaft@de .@de Time@en is@en power@en in@en our@en society@en .@en .

In all sentence pairs, Gesellschaft@de and society@en occur together.

8

Example: top-ranked trans-co-occurrences

Gesellschaft: society@en (12082), social@en (342), our@en (274), in@en (237), societies@en (226), Society@en (187), women@en (183), as@en a@en whole@en (182), of@en our@en (168), open@en society@en (165), democratic@en (159), company@en (137), modern@en (134), children@en (120), values@en (120), economy@en (119), of@en a@en (111), knowledge-based@en (110), European@en (105), civil@en society@en (102)

society: Gesellschaft@de (12082), unserer@de (466), einer@de (379), gesellschaftlichen@de (328), Wissensgesellschaft@de (312), Menschen@de (233), gesellschaftliche@de (219), Frauen@de (213), Zivilgesellschaft@de (179), Gesellschaften@de (173), Informationsgesellschaft@de (161), modernen@de (157), sozialen@de (155), Wirtschaft@de (132), Leben@de (119), Familie@de (118), Gesellschaftsmodell@de (108), demokratischen@de (108), soziale@de (98), Schichten@de (97)

kaum: hardly@en (825), scarcely@en (470), little@en (362), barely@en (278), hardly@en any@en (254), very@en little@en (186), almost@en (88), difficult@en (68), unlikely@en (63), virtually@en (53), scarcely@en any@en (51), impossible@en (47), or@en no@en (40), there@en is@en (38), hardly@en ever@en (37), any@en (32), hardly@en anything@en (32), surprising@en (31), hardly@en a@en (29), hard@en (28)

hardly: kaum@de (825), wohl@de kaum@de (138), schwerlich@de (64), nicht@de (51), verwunderlich@de (43), kann@de (37), wenig@de (37), wundern@de (25), man@de (21), dürfte@de (17), gar@de nicht@de (17), auch@de nicht@de (16), gerade@de (16), überrascht@de (15), fast@de (14), überraschen@de (14), praktisch@de (13), ist@de (12), schlecht@de (12), verwundern@de (12)

9

Evaluation• What is the quality of determined translation equivalents?• Evaluation by comparing results to bilingual dictionaries

(freelang) to measure precision• Method:

- Only words that are in the dictionary and have automatic translations are taken into account- Determine portion of matches in the 3 highest-ranked trans-co-occurrences

Problems:• Some translations are correct but not found in the

dictionary• Dictionaries are not adopted to domain• Inflection: Dictionaries contain lemmas -> Prefix matching• Unknown multiword units

10

Prefix matching

• Prefix match prfx(A,B) of two strings A and B is defined by

Examples:prfx(Herbert, Herberts) = 7/8 = 0.875prfx(Baustelle, Baugenehmigung)=3/14 = 0.2142prfx(Häuserkampf, Häuserkämpfe) = 7/12 = 0.5833

A quite crude measure, but deals more or less with the inflection problem

length of common prefix of A and B

max (length(A), length(B))prfx(A,B)=

11

Sample data from en-de

co1-3: top trans-co-occurrences, p1-3: largest prefix match with some dict. entry of “word“.

word (en)

co1 (de) p1

co2 (de) p2

co3 (de) p3 absolutely

essential absolut 0 unbedingt 0.16

6 unbedingt notwendig

0.10

essential

wesentlichen 0.83 wesentliche 0.909

ist 0

office

Büro 1 Amt 1 Büros 0.8

pollutants

Schadstoffe 1 Schadstoffen 0.916

Emission 0

expertise

Fachwissen 0 Sachverstand 1 Sachkenntnis 1

prescribed

vorgeschrieben 1 vorgeschriebenen 0.875

vorgeschriebene 0.93

means

bedeutet 1 Mittel 1 heißt 0.09

bill

Gesetzentwurf 0.15 Gesetzesentwurf 0.133

Rechnung 1

approach Ansatz 1 Konzept 0 Vorgehensweise 0

audit Prüfung 0 Audit 1 echnungsprüfung 1

12

Results for freelang-evaluation

blue: prfx=1, red: 1<prfx<=0.8, yellow: 0.8<prfx<=0.6

English-Danish

0

10

20

30

40

50

60

70

80

90

100

sim1 sim2 sim3 sim 1 or 2 or 3

Kook pos

% c

orr

ect

sim=1 1<sim<=0.8 0.8<sim<=0.6

English-Svedish

0

10

20

30

40

50

60

70

80

90

100


Kook pos%

co

rrec

t

sim=1 1<sim<=0.8 0.8<sim<=0.6

English-Finnish

0

10

20

30

40

50

60

70

80

90

100


Kook pos

% c

orr

ect

sim=1 1<sim<=0.8 0.8<sim<=0.6

English-Dutch

0

10

20

30

40

50

60

70

80

90

100


Kook pos

% c

orr

ect

sim=1 1<sim<=0.8 0.8<sim<=0.6

English-Portugese

0

10

20

30

40

50

60

70

80

90

100


Kook pos

% c

orr

ect

sim=1 1<sim<=0.8 0.8<sim<=0.6

English-German

0

10

20

30

40

50

60

70

80

90

100


Kook pos

% c

orr

ect

sim=1 1<sim<=0.8 0.8<sim<=0.6

13

Manual Evaluation on 1000 words random samples

Better results:- no domain-

dependent deficiency of dictionary

- no problems with inflection

Manual Evaluation: Correct and partially correct for 1st translation candidate

0

0,2

0,4

0,6

0,8

1

de-en en-de da-en en-da sv-en en-sv nl-en en-nl

Language Pair

Pre

cisi

on

correct partially correct

Manual Evaluation: Correct and partially correct for 2nd translation candidate

0

0,2

0,4

0,6

0,8

1

de-en en-de da-en en-da sv-en en-sv nl-en en-nl

Language Pairs

Pre

cis

ion

correct partially correct

14

Coverage on types

Proportion of words with at least 3 trans-co-occurrences in types list

Proportion of types with at least 3 trans-co-occurrences for Enlish-other language

0,431 0,4210,348

0,455 0,422

0,512 0,516

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

DE DA FI NL SV PT IT

15

Coverage on tokens

Proportion of tokens having at least 3 trans-co-occurrences in running text.

Coverage on types with at least 3 trans-co-occurrencesfor English-other language

0,871 0,866 0,8360,889 0,882 0,879 0,874

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

DE DA FI NL SV PT IT

16

Comparison with [Sahlgren 2004]

Precision from 1st to 3rd translaton, Freq >100, en->de

0

10

20

30

40

50

60

1 2 3

# candidate translation

Pre

cisi

on

sim=1 sim>0.8 sim>0.6 [Sahlgren 2004]

17

Comparison with [Sahlgren 2004]

Precision for 9 frequency ranges, en->de

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1 10 100 1000 10000 100000 1000000

Frequency

Pre

cisi

on

sim1=1 sim1>0.6 [Sahlgren 2004]

18

Alignment

Given:• Bilingual sentence pair

Wanted:• Which word corresponds with which?

Method:• Scan sentence 1 word by word and link it to the highest

ranked word in the trans-co-coccurrences that can be found in sentence 2.

19

Alignment: Example 1

Red Words: No alignmentBlue Arrows: ErrorsArrow Index: rank in trans-co-occurrences

Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union .

Agriculture only provides 5.5 % of employment in (the Union) .

1 21113 13 3

2

Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union .

Agriculture only provides 5.5 % of employment in (the Union) .

1 1 154 2 1

20

Alignment: Example 2

Grey Arrows: Multiple alignments for frequent words.

Indem wir den Mitgliedstaaten für die Umsetzung der Richtlinie kein spezifisches Datum setzen ,

By not setting a specific date (for the) Member States (to implement) the directive

sondern ihnen einen Zeitraum von drei Monaten nach Inkrafttreten der Richtlinie zugestehen ,

and instead giving them a period of three months after its (entry into force) ,

führen wir eine Flexibilitätsklausel ein ,

we are introducing a flexibility clause

die eine unverzügliche Umsetzung gewährleistet .

which ensures that the directive will be implemented without delay .

1 1 12

1 7 14

1 1 15

1 1

1

1

1 1 1 1 1 1,2,3

1

1

4

1 1

11

121

1 1

1 1 1

4 445

21

Further work

Dictionary acquisition:• document-level aligned texts• weakly parallel texts or corpora

Alignment:• Dealing with cognates• Symmetric alignment • Alignment of phrases and multiword units

22

END@en

beenden@deBeendigung@de

Ende@define@it

terminare@itfinire@it

lopettaa@filopettamaan@fi

lopettamiseksi@fichegar@pt

fim@pttermo@pteinde@nleind@nljaar@nl

23

References

• (Biemann et al 2004): Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language-independent Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg

• (Sahlgren 2004) Sahlgren, M. (2004): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data, Proceedings of LREC-2004, Lisboa, Portugal

• (Koehn 2002) Koehn, P. (2002): Europarl: A multilingual corpus for evaluation of machine translation, http://people.csail.mit.edu/people/koehn/publications/europarl/

24

Alignment Evaluation Strong‘s numbers in the Bible

English-Russian

25

Alignment Evaluation Strong‘s numbers in the Bible

English-German

1 dictionary acquisition using parallel text and co-occurrence statistics chris biemann, uwe...

Documents