![Page 1: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/1.jpg)
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb
{zagic, mtadic, zdovedan}@ffzg.hr
![Page 2: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/2.jpg)
Introduction
• morphosyntactic tagging• asssigning word categories and subcategories to words in
sentence context
• issues• modelling sentence context• handling unknown words, dealing with sparse data
• common approaches• rule-based, stochastic, hybrid• data-driven models are predominant today• best performing taggers are based on SVM, CRF, HMM
![Page 3: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/3.jpg)
Introduction
• data-driven tagging modules• the tagger and the data• data implies tagset encoding word (sub)categories
• a solved problem?• state-of-the-art accuracy on English is 97-98%• tagsets for English max. 100 different tags• 1475 different morphosyntactic tags used in the
Croatian Morphological Lexicon• accuracy for state-of-the art taggers drops by ca 10%
![Page 4: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/4.jpg)
Tagging Croatian texts
• CroTag tagger• inspired by TnT and HunPos• trained on manually MTE v3 annotated 118 kw corpus• accuracy identical to these (96-97% EN, 85-86% HR)• all are highly dependent on unknown word counts
• improvements• using the inflectional lexicon to handle unknown words• tagger voting, hibridization?
![Page 5: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/5.jpg)
From another perspective...
• goals of tagging• reaching perfect accuracy on full tagset or• making large-scale NLP systems perform better?
• specific requirements• users and systems always have them• example: named entity normalization in Croatian
Is it Ivo (m.) or Iva (f.) Sanader?
• specific tasks may require specific tagset design• keeping speed and memory footprint• reducing tagset size means raising accuracy
![Page 6: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/6.jpg)
Reducing the tagset
• MulText East version 3• positional tagset, letters encode categories• example: Ncmsn = noun, common, masculine, etc.
• the subsets1 – strip non-inflective categories and numerals (800 tags)2 – strip verbs (739)3 – strip all but gender, number, case and noun type (243)4 – remove case category (48)5 – keep noun type category only (15)6 – maintain part-of-speech information only (13)
![Page 7: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/7.jpg)
Results
![Page 8: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/8.jpg)
More results
• adjectives, nouns and pronouns• most difficultly tagged cattegories for Croatian• combination of frequency and tags used• maybe these are most important to tag accurately?
F1-measures on adjectives, nouns and pronouns
type subset 0 subset 4 subset 5
Adjective 0.64±0.04 0.74±0.05 0.92±0.02
Noun 0.79±0.03 0.86±0.03 0.95±0.01
Pronoun 0.76±0.03 0.87±0.04 0.99±0.01
![Page 9: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/9.jpg)
Conclusions
• results are as expected• reducing tagset size raises tagging accuracy• sacrificing information for efficiency
• reductions are illustrative• careful tagset design required with regards to
requirements
• further work• as mentioned: reaching perfect accuracy on full tagset
or making large-scale NLP systems perform better?
![Page 10: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/10.jpg)
Your questions?
Computational Linguistic Models and Language Technologies for Croatian
rmjt.ffzg.hr | hml.ffzg.hr | hnk.ffzg.hr
![Page 11: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr](https://reader035.vdocuments.site/reader035/viewer/2022080917/56649efb5503460f94c0d6d7/html5/thumbnails/11.jpg)
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb
{zagic, mtadic, zdovedan}@ffzg.hr