1 dictionary acquisition using parallel text and co-occurrence statistics chris biemann, uwe...
TRANSCRIPT
1
Dictionary Acquisition Using Parallel Text
and Co-occurrence Statistics
Chris Biemann, Uwe Quasthoff
University of Leipzig, NLP-Dept.
Friday, May 20, 2005
NODALIDA 2005
2
Problem Description
Given: • certain amounts of sentence-aligned parallel texts
Not available:• morphology, grammar, semantic etc. information• string similarity for cognates• bilingual dictionary
Wanted:• bilingual dictionaries• alignment on word level
3
Broad Picture
• Calculation of translingual statistically significant co-occurrences yields ranked translation candidates
• For alignment, the highest ranked translation candidates that occurr in the sentence pair are linked.
4
Co-occurrence Statistics
• Co-occurrence: occurrence of two words within a well-defined unit of information (sentence, nearest neighbors, window...)
• Significant co-occurrences reflect relations between words
• Threshold on significance measure (log-likelihood):
( , ) log log !
with number of sentences,
.
sig A B x k x k
n
abx
n
k= number of units containing A and B
5
Trans-co-occurrencesTranslingual co-occurrences
‘normal‘ co-occurrences:• Calculaton performed on sentence basis• Co-occurrents can be found frequently together in
sentences
Trans-co-occurrences:• Calculaton performed on bilingual sentence pairs• Co-occurrents can be found frequently together in bilingual
sentence pairs• Hypothesis: significant co-occurrences between words of
different languages (= trans-co-occurrences) are translation equivalents
6
Data: Europarl
• Transcriptions of European Parliament, about 1 million sentences per language
• Available for Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish and Swedish
• Experiments carried out for:Englisch-DanishEnglisch-Dutch
Englisch-GermanEnglisch-FinnishEnglisch-ItalianEnglisch-PortugeseEnglisch-Swedish
(chosen because of dictionary availability)
7
Example: Gesellschaft@de society@en
Die@de drogenfreie@de Gesellschaft@de wird@de es@de aber@de nie@de geben@de .@de But@en there@en never@en will@en be@en a@en drug-free@en society@en .@en
Unsere@de Gesellschaft@de neigt@de leider@de dazu@de ,@de Gesetze@de zu@de umgehen@de .@de Unfortunately@en ,@en our@en society@en is@en inclined@en to@en skirt@en round@en the@en law@en .@en
Zum@de Glück@de kommt@de das@de in@de einer@de demokratischen@de Gesellschaft@de selten@de vor@de .@de Fortunately@en ,@en in@en a@en democratic@en society@en this@en is@en rare@en .@en
Herr@de Präsident@de !@de Wir@de leben@de in@de einer@de paradoxen@de Gesellschaft@de .@de Mr@en President@en ,@en we@en live@en in@en a@en paradoxical@en society@en .@en
Ich@de sprach@de vom@de Paradoxon@de unserer@de Gesellschaft@de .@de I@en mentioned@en what@en is@en paradoxical@en in@en society@en .@en
Zeit@de ist@de Macht@de in@de unserer@de Gesellschaft@de .@de Time@en is@en power@en in@en our@en society@en .@en .
In all sentence pairs, Gesellschaft@de and society@en occur together.
8
Example: top-ranked trans-co-occurrences
Gesellschaft: society@en (12082), social@en (342), our@en (274), in@en (237), societies@en (226), Society@en (187), women@en (183), as@en a@en whole@en (182), of@en our@en (168), open@en society@en (165), democratic@en (159), company@en (137), modern@en (134), children@en (120), values@en (120), economy@en (119), of@en a@en (111), knowledge-based@en (110), European@en (105), civil@en society@en (102)
society: Gesellschaft@de (12082), unserer@de (466), einer@de (379), gesellschaftlichen@de (328), Wissensgesellschaft@de (312), Menschen@de (233), gesellschaftliche@de (219), Frauen@de (213), Zivilgesellschaft@de (179), Gesellschaften@de (173), Informationsgesellschaft@de (161), modernen@de (157), sozialen@de (155), Wirtschaft@de (132), Leben@de (119), Familie@de (118), Gesellschaftsmodell@de (108), demokratischen@de (108), soziale@de (98), Schichten@de (97)
kaum: hardly@en (825), scarcely@en (470), little@en (362), barely@en (278), hardly@en any@en (254), very@en little@en (186), almost@en (88), difficult@en (68), unlikely@en (63), virtually@en (53), scarcely@en any@en (51), impossible@en (47), or@en no@en (40), there@en is@en (38), hardly@en ever@en (37), any@en (32), hardly@en anything@en (32), surprising@en (31), hardly@en a@en (29), hard@en (28)
hardly: kaum@de (825), wohl@de kaum@de (138), schwerlich@de (64), nicht@de (51), verwunderlich@de (43), kann@de (37), wenig@de (37), wundern@de (25), man@de (21), dürfte@de (17), gar@de nicht@de (17), auch@de nicht@de (16), gerade@de (16), überrascht@de (15), fast@de (14), überraschen@de (14), praktisch@de (13), ist@de (12), schlecht@de (12), verwundern@de (12)
9
Evaluation• What is the quality of determined translation equivalents?• Evaluation by comparing results to bilingual dictionaries
(freelang) to measure precision• Method:
- Only words that are in the dictionary and have automatic translations are taken into account- Determine portion of matches in the 3 highest-ranked trans-co-occurrences
Problems:• Some translations are correct but not found in the
dictionary• Dictionaries are not adopted to domain• Inflection: Dictionaries contain lemmas -> Prefix matching• Unknown multiword units
10
Prefix matching
• Prefix match prfx(A,B) of two strings A and B is defined by
Examples:prfx(Herbert, Herberts) = 7/8 = 0.875prfx(Baustelle, Baugenehmigung)=3/14 = 0.2142prfx(Häuserkampf, Häuserkämpfe) = 7/12 = 0.5833
A quite crude measure, but deals more or less with the inflection problem
length of common prefix of A and B
max (length(A), length(B))prfx(A,B)=
11
Sample data from en-de
co1-3: top trans-co-occurrences, p1-3: largest prefix match with some dict. entry of “word“.
word (en)
co1 (de) p1
co2 (de) p2
co3 (de) p3 absolutely
essential absolut 0 unbedingt 0.16
6 unbedingt notwendig
0.10
essential
wesentlichen 0.83 wesentliche 0.909
ist 0
office
Büro 1 Amt 1 Büros 0.8
pollutants
Schadstoffe 1 Schadstoffen 0.916
Emission 0
expertise
Fachwissen 0 Sachverstand 1 Sachkenntnis 1
prescribed
vorgeschrieben 1 vorgeschriebenen 0.875
vorgeschriebene 0.93
means
bedeutet 1 Mittel 1 heißt 0.09
bill
Gesetzentwurf 0.15 Gesetzesentwurf 0.133
Rechnung 1
approach Ansatz 1 Konzept 0 Vorgehensweise 0
audit Prüfung 0 Audit 1 echnungsprüfung 1
12
Results for freelang-evaluation
blue: prfx=1, red: 1<prfx<=0.8, yellow: 0.8<prfx<=0.6
English-Danish
0
10
20
30
40
50
60
70
80
90
100
sim1 sim2 sim3 sim 1 or 2 or 3
Kook pos
% c
orr
ect
sim=1 1<sim<=0.8 0.8<sim<=0.6
English-Svedish
0
10
20
30
40
50
60
70
80
90
100
sim1 sim2 sim3 sim 1 or 2 or 3
Kook pos%
co
rrec
t
sim=1 1<sim<=0.8 0.8<sim<=0.6
English-Finnish
0
10
20
30
40
50
60
70
80
90
100
sim1 sim2 sim3 sim 1 or 2 or 3
Kook pos
% c
orr
ect
sim=1 1<sim<=0.8 0.8<sim<=0.6
English-Dutch
0
10
20
30
40
50
60
70
80
90
100
sim1 sim2 sim3 sim 1 or 2 or 3
Kook pos
% c
orr
ect
sim=1 1<sim<=0.8 0.8<sim<=0.6
English-Portugese
0
10
20
30
40
50
60
70
80
90
100
sim1 sim2 sim3 sim 1 or 2 or 3
Kook pos
% c
orr
ect
sim=1 1<sim<=0.8 0.8<sim<=0.6
English-German
0
10
20
30
40
50
60
70
80
90
100
sim1 sim2 sim3 sim 1 or 2 or 3
Kook pos
% c
orr
ect
sim=1 1<sim<=0.8 0.8<sim<=0.6
13
Manual Evaluation on 1000 words random samples
Better results:- no domain-
dependent deficiency of dictionary
- no problems with inflection
Manual Evaluation: Correct and partially correct for 1st translation candidate
0
0,2
0,4
0,6
0,8
1
de-en en-de da-en en-da sv-en en-sv nl-en en-nl
Language Pair
Pre
cisi
on
correct partially correct
Manual Evaluation: Correct and partially correct for 2nd translation candidate
0
0,2
0,4
0,6
0,8
1
de-en en-de da-en en-da sv-en en-sv nl-en en-nl
Language Pairs
Pre
cis
ion
correct partially correct
14
Coverage on types
Proportion of words with at least 3 trans-co-occurrences in types list
Proportion of types with at least 3 trans-co-occurrences for Enlish-other language
0,431 0,4210,348
0,455 0,422
0,512 0,516
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
DE DA FI NL SV PT IT
15
Coverage on tokens
Proportion of tokens having at least 3 trans-co-occurrences in running text.
Coverage on types with at least 3 trans-co-occurrencesfor English-other language
0,871 0,866 0,8360,889 0,882 0,879 0,874
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
DE DA FI NL SV PT IT
16
Comparison with [Sahlgren 2004]
Precision from 1st to 3rd translaton, Freq >100, en->de
0
10
20
30
40
50
60
1 2 3
# candidate translation
Pre
cisi
on
sim=1 sim>0.8 sim>0.6 [Sahlgren 2004]
17
Comparison with [Sahlgren 2004]
Precision for 9 frequency ranges, en->de
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 10 100 1000 10000 100000 1000000
Frequency
Pre
cisi
on
sim1=1 sim1>0.6 [Sahlgren 2004]
18
Alignment
Given:• Bilingual sentence pair
Wanted:• Which word corresponds with which?
Method:• Scan sentence 1 word by word and link it to the highest
ranked word in the trans-co-coccurrences that can be found in sentence 2.
19
Alignment: Example 1
Red Words: No alignmentBlue Arrows: ErrorsArrow Index: rank in trans-co-occurrences
Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union .
Agriculture only provides 5.5 % of employment in (the Union) .
1 21113 13 3
2
Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union .
Agriculture only provides 5.5 % of employment in (the Union) .
1 1 154 2 1
20
Alignment: Example 2
Grey Arrows: Multiple alignments for frequent words.
Indem wir den Mitgliedstaaten für die Umsetzung der Richtlinie kein spezifisches Datum setzen ,
By not setting a specific date (for the) Member States (to implement) the directive
sondern ihnen einen Zeitraum von drei Monaten nach Inkrafttreten der Richtlinie zugestehen ,
and instead giving them a period of three months after its (entry into force) ,
führen wir eine Flexibilitätsklausel ein ,
we are introducing a flexibility clause
die eine unverzügliche Umsetzung gewährleistet .
which ensures that the directive will be implemented without delay .
1 1 12
1 7 14
1 1 15
1 1
1
1
1 1 1 1 1 1,2,3
1
1
4
1 1
11
121
1 1
1 1 1
4 445
21
Further work
Dictionary acquisition:• document-level aligned texts• weakly parallel texts or corpora
Alignment:• Dealing with cognates• Symmetric alignment • Alignment of phrases and multiword units
22
END@en
beenden@deBeendigung@de
Ende@define@it
terminare@itfinire@it
lopettaa@filopettamaan@fi
lopettamiseksi@fichegar@pt
fim@pttermo@pteinde@nleind@nljaar@nl
23
References
• (Biemann et al 2004): Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language-independent Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg
• (Sahlgren 2004) Sahlgren, M. (2004): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data, Proceedings of LREC-2004, Lisboa, Portugal
• (Koehn 2002) Koehn, P. (2002): Europarl: A multilingual corpus for evaluation of machine translation, http://people.csail.mit.edu/people/koehn/publications/europarl/
24
Alignment Evaluation Strong‘s numbers in the Bible
English-Russian
25
Alignment Evaluation Strong‘s numbers in the Bible
English-German