disambiguation of homographic adjective and adverb forms in croatian texts danijela merkler*, daša...

18
Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics ** Department of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb [email protected] ; [email protected] ; [email protected] NooJ 2011 Dubrovnik 2011-06-15

Upload: catalina-whitfill

Post on 29-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Disambiguation of homographic adjective and adverb forms in

Croatian texts

Danijela Merkler*, Daša Berović*, Željko Agić**

* Department of Linguistics** Department of Information Sciences

Faculty of Humanities and Social Sciences, University of Zagreb

[email protected]; [email protected]; [email protected]

NooJ 2011Dubrovnik

2011-06-15

Page 2: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Talk overview project ACCURAT problem and corpora modeling local grammars and applying them statistical evaluation

NooJ2011Dubrovnik2011-06-15

Page 3: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

ACCURAT FP7 project main goal - to develop methods and techniques

to overcome one of the central problems of machine translation – the lack of linguistic resources for under-resourced areas of machine translation

key innovation - creation of methodology and tools to measure, to find and to use comparable corpora to improve the quality of MT

the ACCURAT project will significantly contribute not only to the theory of MT, but also to corpus linguistics, information extraction and natural language processing in general

NooJ2011Dubrovnik2011-06-15

Page 4: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Scientific objectives create comparability metrics – to develop the

methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora

establish research methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora

disambiguation – important process for POS and MSD tagging

NooJ2011Dubrovnik2011-06-15

Page 5: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Problem parallel and comparable resources are sparse for Croatian

when paired with any of the languages included in the project, especially if the other language is under-resourced as well

importance of high quality annotation for existing language resources for Croatian building (factored) language models for MT using text anchors in comparable resources

MSD-tagging and lemmatization errors detected in existing Croatian language resources e.g. Croatian National Corpus v2.5 (automatically lemmatized

and MSD-tagged), manually annotated subcorpora, Croatian Dependency Treebank

manual analysis of their annotation reveals regular patterns in these errors

NooJ2011Dubrovnik2011-06-15

Page 6: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Problem forms of descriptive adjectives in the nominative singular

case in the neuter gender are the same as the forms of the adverbs that are made from those adjectives by suffixation

these adverbs are realized in context in most cases adverb is made from adjective that has

abstract meaning there are several types of word forms

the forms of adverbs and adjectives that occur with no semantic constraints: razdragano (gleeful), bahato (arrogant), ubrzano (rapidly), uzrujano (upset), umiljato (cuddly)

forms that are made from verbs: drhtavo (shaking), laskavo (flattering), šepavo (lame)

forms that have dual meaning (concrete and abstract): mlako (lukewarm), šugavo (itchy), mračno (darkly), hladno (cold), gorko (bitter)

forms that denote spatial and temporal relations: rano (early), duboko (deeply), plitko (shallow), lijevo (left)

NooJ2011Dubrovnik2011-06-15

Page 7: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Corpora Croatia Weekly

100 kw newspaper corpus (newspaper published from 1998 to 2000, 118 numbers)

it covers different domains: politics, economy and finance, tourism, ecology, culture, art, sports

part of Croatian side of the Croatian-English Parallel Corpus manually lemmatized and MSD-tagged using the MULTEXT-East v3 morphosyntactic specifications

1984. Orwell’s "1984" corpus, manually lemmatized and

MSD-tagged using MULTEXT-East v4 languages: En, Ro, Sl, Cs, Et, Hu, Sr, Bg, Ru, Mk,

Hr... encoded in TEI P4 (XML)

NooJ2011Dubrovnik2011-06-15

Page 8: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Corpora imported the corpora to NooJ

used the NooJ XML import feature kept the MSD feature annotations for adjectives,

adverbs, nouns and verbs converted the annotations for these PoS from

Multext-East to NooJ format for lexical resources modified feature annotations

e.g. MTE verb type from auxiliary, copulative to PG (auxiliary verb) in NooJ

preprocessing enabled designing the rules without using Croatian resources for NooJ, i.e. skipping NooJ linguistic analysis

NooJ2011Dubrovnik2011-06-15

Page 9: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Patterns we noticed several types of patterns in which

adverbs that are homographic with adjectives occur

they are defined by their contextual environment

1) Vpg + A* + V → Vpg + R* + V

2) Vpg + A + A* → Vpg + R + A*

3) A* + V → R* + V

4) A + A* + N → R + A* + N

NooJ2011Dubrovnik2011-06-15

Page 10: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Vpg + A* + V

NooJ2011Dubrovnik2011-06-15

Page 11: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Vpg + A + A*

NooJ2011Dubrovnik2011-06-15

Page 12: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

A* + V

NooJ2011Dubrovnik2011-06-15

Page 13: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

A + A* + N

NooJ2011Dubrovnik2011-06-15

Page 14: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Statistics 1

NooJ2011Dubrovnik2011-06-15

manually checked concordances

errors frequently include the word sve, so we upgraded all grammars in order not to recognize sve

cw100 orwellcw100 + orwell

Vpg + A* + V 64 % 62 % 63 %

Vpg + A + A* 100 % 100 % 100 %

A* + V 82 % 54 % 67 %

A + A* + N 69 % 75 % 70 %

total 77 % 61 % 70 %

Page 15: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Example of upgraded grammar

NooJ2011Dubrovnik2011-06-15

Page 16: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Statistics 2

obtained results improved after we applied new grammars

significant difference between newspaper and literature corpus

NooJ2011Dubrovnik2011-06-15

cw100 orwellcw100 + orwell

Vpg + A* + V 100 % 83 % 92 %

Vpg + A + A* 100 % 100 % 100 %

A* + V 87 % 63 % 74 %

A + A* + N 78 % 100 % 82 %

total 89 % 73 % 83 %

Page 17: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Future work forms of relational adjectives in the nominative

singular case in the masculine gender are the same as the forms of the adverbs that are made from those adjectives by suffixation (junački, pučki, bratski, životinjski)

disambiguation of these forms also depends on the grammatical context in which they occur, so it can also be done in a similar way

applying the disambiguation rules to other Croatian language resources

NooJ2011Dubrovnik2011-06-15

Page 18: Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

Thank you for your attention.

The research within the project Accurat leading to these results has received funding from the

European Union Seventh Framework Programme (FP7/2007-2013), grant agreement

no 248347.

www.accurat-project.eu