lemmatizer czechtoenglish ml

12
Effects of Lemmatization on Czech-English Statistical MT Ashley Gill University of Washington Seattle, WA [email protected] Parinita University of Washington Seattle, WA [email protected] .edu

Upload: parinita-thakur-rahi

Post on 11-Jun-2015

743 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Lemmatizer czechtoenglish ml

Effects of Lemmatization on Czech-English Statistical MT

Ashley GillUniversity of Washington

Seattle, [email protected]

ParinitaUniversity of Washington

Seattle, [email protected]

Page 2: Lemmatizer czechtoenglish ml

Motivation• Morphologically rich language -> English• Source (Czech)-functions expressed as endings (inflections) -fewer instances of the surface form of a word (prefix+stem+suffix) occur in the corpus, data

sparsity- Free word-order• Target (English)- word order - function words• Goal- to improve word-alignments• Approach- analyze surface word forms into lemma and morphology, e.g.: car +plural- translate lemma and morphology separately- generate target surface form- experiment with the different POS

Page 3: Lemmatizer czechtoenglish ml

Experiments

• Most problematic parts of speech in Czech-English translations are nouns and verbs (Bojar and Prokopov´a ,2006).

• The baseline - no changes. • ALemma - all words were lemmatized • NLemma - nouns were lemmatized only• Vlemma - verbs were lemmatized only

Page 4: Lemmatizer czechtoenglish ml

Source Corpus Lemmatizer

leleAlemma Nlemma Vlemma

Moses Toolkit

Baseline

Lemmatized Source Corpus

Target Translation

System Overview

Page 5: Lemmatizer czechtoenglish ml

Lemmatizer• ‘The Free Morphology (FM)’ tool (Hajic 2001). • universal (i.e., language-independent) morphology tool (FMAnalyze.pl)• analysis of word forms for inflective languages.• includes a frequency-based, high coverage Czech dictionary. • Czech positional morphology (Hajic, 2000) uses morphological tags

consisting of 12 actively used positions, each stating the value of one morphological category – we used tags for Nouns and Verbs

Examples: Input: Prezident rezignoval na svou funkci. Output: <csts><f cap>Prezident<MMl>prezident<MMt>NNMS1-----A----<f>rezignoval<MMl>rezignovat_:T<MMt>VpYS---XR-AA---<f>na<MMl>na<MMt>RR--4----------<MMt>RR--6----------<f>svou<MMl>svůj-1_^(přivlast.)<MMt>P8FS4---------1<MMt>P8FS7---------1<f>funkci<MMl>funkce<MMt>NNFS3-----A----<MMt>NNFS4-----A----<MMt>NNFS6-----A----<D><d>.<MMl>.<MMt>Z:-------------</csts>

Page 6: Lemmatizer czechtoenglish ml

Preprocessing

• The output from the FM – one token per line• No markup for sentence delimiter• Inserted a simple sentence delimiter, “*”, in the corpus ( it

does not occur naturally in the corpus)• For each word from the FM file: • Alemma experiment - use the lemma instead of the original

word• Nlemma experiment – use the lemma only if the first

position of the FM output markup is “N” (denoting a noun)• Vlemma experiment - use the lemma only if the first

position of the FM output markup is “V” (denoting a verb).

Page 7: Lemmatizer czechtoenglish ml

Corpus:

Input sentences Output sentences

baseline 35000 14453

ALemma 70048 13136

NLemma 70048 15737

VLemma 70048 20686

News Commentary corpus

Input sentences Output sentences

original baseline 70048 62610

We used a corpus of about half the size as the baseline to compare with .35,000 lines, which ends up using 14453 lines after removing sentences > 40 tokens.

Page 8: Lemmatizer czechtoenglish ml

Results:

BASELINE: BLEU = 4.24, 27.9/9.3/2.2/0.7 (BP=0.931, ratio=0.933, hyp_len=46470, ref_len=49805)

ALEMMA: BLEU = 8.60, 36.4/13.7/5.2/2.1 (BP=1.000, ratio=1.177, hyp_len=58645, ref_len=49805)

NLEMMA: BLEU = 10.09, 40.0/15.7/6.2/2.7 (BP=1.000, ratio=1.108, hyp_len=55174, ref_len=49805)

VLEMMA: BLEU = 13.06, 44.1/19.1/8.5/4.1 (BP=1.000, ratio=1.017, hyp_len=50652, ref_len=49805)

original baseline (full corpus)

BLEU = 18.89, 53.0/27.0/14.1/7.9 (BP=0.946, ratio=0.947, hyp_len=47182, ref_len=49805)

Improvement in BLEU scores , double for ALEMMA ,and triple for VERBS lemmatized only

Page 9: Lemmatizer czechtoenglish ml

BASELINE OUTPUT: rasov~[ rozd~[lená europetypickým evropské extrémní of the right , there is a sign of její racism , and that že využívá imigra~Mní otázku in svůj politický prosp~[ch .italská lega nord , nizozemský vlaams blocks , francouzská penova defensive on national , this vše are p~Yíklady parties ~Mi hnutí vzešlých from spole~Mné aaverze vů~Mi imigrantům and prosazujících zjednodušující to look at how ~Yešit otázku p~Yist~[hovalců .

ALEMMA OUTPUT: rasov~R , divided europein fact , european the extreme right is its racism and that using imigra~R is the question in their political of would .italy ' s nord lego , the dutch , vlaams blockade , the french has come . as to how souë jmen . it ' s rule of money ' s administration national fronts - all of this iis an example sides poorer or vze movement , the rise of the common averze against immigrants and pushing the ) , simplifies a view , how many out to question the immigrants .

NLEMMA OUTPUT: race-specific divided europein fact the extreme right is its racism and that applied to the immigration question in their political of europe .indeed , the lego , nord , the dutch vlaams bloc , the french still penova combatants national - all of this are examples parties themselves or movements be held and of from the common averze towards immigrants and pushing the the simplest a view , the solution is to question the immigrants .

VLEMMA OUTPUT: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .

Page 10: Lemmatizer czechtoenglish ml

Czech English POS

rasismusrasismus penova averzeaverzeaverze imigrant

 racialismracismfoam abhorrence dislikingloathing immigrant

Noun Noun Noun  Noun Noun  nounNoun

VLemma-Dict-Output: race-specific divided europein fact the extreme right is its rasismus [racism] and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova[foam] national fronts - all of this is happening parties or movement would be held and of from the common averse[loathing] towards immigrant[immigrants] and pushing of makes it easier to this view , to question the immigrants .

VLemma Output: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averse towards immigrant and pushing of makes it easier to this view , to question the immigrants .

Dictionary

Page 11: Lemmatizer czechtoenglish ml

Limitations

• No use of syntax/POS in sentence reordering• Phenomenon like ‘pronoun dropping’ that

occurs in Czech is not tested for accuracy in translations

• No Human Cross evaluation for better understanding of the improvement in results

• Does not cover the effect of morphology of target language on translations. (Zhang et. al, 2007).

Page 12: Lemmatizer czechtoenglish ml

FUTURE DIRECTION•Add syntactic information to improve the word reordering and language modeling. •Carry experiments with other languages too.•Test Pipeline of lemmatization to improve word alignment ? But what about syntax?

VLEMMA: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .

Lemmatize nouns

Source Language

English