Download - Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb

{zagic, mtadic, zdovedan}@ffzg.hr

Introduction

• morphosyntactic tagging• asssigning word categories and subcategories to words in

sentence context

• issues• modelling sentence context• handling unknown words, dealing with sparse data

• common approaches• rule-based, stochastic, hybrid• data-driven models are predominant today• best performing taggers are based on SVM, CRF, HMM

Introduction

• data-driven tagging modules• the tagger and the data• data implies tagset encoding word (sub)categories

• a solved problem?• state-of-the-art accuracy on English is 97-98%• tagsets for English max. 100 different tags• 1475 different morphosyntactic tags used in the

Croatian Morphological Lexicon• accuracy for state-of-the art taggers drops by ca 10%

Tagging Croatian texts

• CroTag tagger• inspired by TnT and HunPos• trained on manually MTE v3 annotated 118 kw corpus• accuracy identical to these (96-97% EN, 85-86% HR)• all are highly dependent on unknown word counts

• improvements• using the inflectional lexicon to handle unknown words• tagger voting, hibridization?

From another perspective...

• goals of tagging• reaching perfect accuracy on full tagset or• making large-scale NLP systems perform better?

• specific requirements• users and systems always have them• example: named entity normalization in Croatian

Is it Ivo (m.) or Iva (f.) Sanader?

• specific tasks may require specific tagset design• keeping speed and memory footprint• reducing tagset size means raising accuracy

Reducing the tagset

• MulText East version 3• positional tagset, letters encode categories• example: Ncmsn = noun, common, masculine, etc.

• the subsets1 – strip non-inflective categories and numerals (800 tags)2 – strip verbs (739)3 – strip all but gender, number, case and noun type (243)4 – remove case category (48)5 – keep noun type category only (15)6 – maintain part-of-speech information only (13)

Results

More results

• adjectives, nouns and pronouns• most difficultly tagged cattegories for Croatian• combination of frequency and tags used• maybe these are most important to tag accurately?

F1-measures on adjectives, nouns and pronouns

type subset 0 subset 4 subset 5

Adjective 0.64±0.04 0.74±0.05 0.92±0.02

Noun 0.79±0.03 0.86±0.03 0.95±0.01

Pronoun 0.76±0.03 0.87±0.04 0.99±0.01

Conclusions

• results are as expected• reducing tagset size raises tagging accuracy• sacrificing information for efficiency

• reductions are illustrative• careful tagset design required with regards to

requirements

• further work• as mentioned: reaching perfect accuracy on full tagset

or making large-scale NLP systems perform better?

Your questions?

Computational Linguistic Models and Language Technologies for Croatian

rmjt.ffzg.hr | hml.ffzg.hr | hnk.ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb

{zagic, mtadic, zdovedan}@ffzg.hr

Download - Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Top Related