tagset reductions in morphosyntactic tagging of croatian texts Željko agić, marko tadić and...

11
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Upload: magnus-harrell

Post on 03-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb

{zagic, mtadic, zdovedan}@ffzg.hr

Page 2: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Introduction

• morphosyntactic tagging• asssigning word categories and subcategories to words in

sentence context

• issues• modelling sentence context• handling unknown words, dealing with sparse data

• common approaches• rule-based, stochastic, hybrid• data-driven models are predominant today• best performing taggers are based on SVM, CRF, HMM

Page 3: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Introduction

• data-driven tagging modules• the tagger and the data• data implies tagset encoding word (sub)categories

• a solved problem?• state-of-the-art accuracy on English is 97-98%• tagsets for English max. 100 different tags• 1475 different morphosyntactic tags used in the

Croatian Morphological Lexicon• accuracy for state-of-the art taggers drops by ca 10%

Page 4: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagging Croatian texts

• CroTag tagger• inspired by TnT and HunPos• trained on manually MTE v3 annotated 118 kw corpus• accuracy identical to these (96-97% EN, 85-86% HR)• all are highly dependent on unknown word counts

• improvements• using the inflectional lexicon to handle unknown words• tagger voting, hibridization?

Page 5: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

From another perspective...

• goals of tagging• reaching perfect accuracy on full tagset or• making large-scale NLP systems perform better?

• specific requirements• users and systems always have them• example: named entity normalization in Croatian

Is it Ivo (m.) or Iva (f.) Sanader?

• specific tasks may require specific tagset design• keeping speed and memory footprint• reducing tagset size means raising accuracy

Page 6: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Reducing the tagset

• MulText East version 3• positional tagset, letters encode categories• example: Ncmsn = noun, common, masculine, etc.

• the subsets1 – strip non-inflective categories and numerals (800 tags)2 – strip verbs (739)3 – strip all but gender, number, case and noun type (243)4 – remove case category (48)5 – keep noun type category only (15)6 – maintain part-of-speech information only (13)

Page 7: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Results

Page 8: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

More results

• adjectives, nouns and pronouns• most difficultly tagged cattegories for Croatian• combination of frequency and tags used• maybe these are most important to tag accurately?

F1-measures on adjectives, nouns and pronouns

type subset 0 subset 4 subset 5

Adjective 0.64±0.04 0.74±0.05 0.92±0.02

Noun 0.79±0.03 0.86±0.03 0.95±0.01

Pronoun 0.76±0.03 0.87±0.04 0.99±0.01

Page 9: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Conclusions

• results are as expected• reducing tagset size raises tagging accuracy• sacrificing information for efficiency

• reductions are illustrative• careful tagset design required with regards to

requirements

• further work• as mentioned: reaching perfect accuracy on full tagset

or making large-scale NLP systems perform better?

Page 10: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Your questions?

Computational Linguistic Models and Language Technologies for Croatian

rmjt.ffzg.hr | hml.ffzg.hr | hnk.ffzg.hr

Page 11: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb

{zagic, mtadic, zdovedan}@ffzg.hr