tagset reductions in morphosyntactic tagging of croatian texts Željko agić, marko tadić and...

11

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Upload: magnus-harrell

Post on 03-Jan-2016

213 views

Category:

Documents

1 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb

{zagic, mtadic, zdovedan}@ffzg.hr

Page 2: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Introduction

• morphosyntactic tagging• asssigning word categories and subcategories to words in

sentence context

• issues• modelling sentence context• handling unknown words, dealing with sparse data

• common approaches• rule-based, stochastic, hybrid• data-driven models are predominant today• best performing taggers are based on SVM, CRF, HMM

Page 3: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Introduction

• data-driven tagging modules• the tagger and the data• data implies tagset encoding word (sub)categories

• a solved problem?• state-of-the-art accuracy on English is 97-98%• tagsets for English max. 100 different tags• 1475 different morphosyntactic tags used in the

Croatian Morphological Lexicon• accuracy for state-of-the art taggers drops by ca 10%

Page 4: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagging Croatian texts

• CroTag tagger• inspired by TnT and HunPos• trained on manually MTE v3 annotated 118 kw corpus• accuracy identical to these (96-97% EN, 85-86% HR)• all are highly dependent on unknown word counts

• improvements• using the inflectional lexicon to handle unknown words• tagger voting, hibridization?

Page 5: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

From another perspective...

• goals of tagging• reaching perfect accuracy on full tagset or• making large-scale NLP systems perform better?

• specific requirements• users and systems always have them• example: named entity normalization in Croatian

Is it Ivo (m.) or Iva (f.) Sanader?

• specific tasks may require specific tagset design• keeping speed and memory footprint• reducing tagset size means raising accuracy

Page 6: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Reducing the tagset

• MulText East version 3• positional tagset, letters encode categories• example: Ncmsn = noun, common, masculine, etc.

• the subsets1 – strip non-inflective categories and numerals (800 tags)2 – strip verbs (739)3 – strip all but gender, number, case and noun type (243)4 – remove case category (48)5 – keep noun type category only (15)6 – maintain part-of-speech information only (13)

Page 7: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Results

Page 8: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

More results

• adjectives, nouns and pronouns• most difficultly tagged cattegories for Croatian• combination of frequency and tags used• maybe these are most important to tag accurately?

F1-measures on adjectives, nouns and pronouns

type subset 0 subset 4 subset 5

Adjective 0.64±0.04 0.74±0.05 0.92±0.02

Noun 0.79±0.03 0.86±0.03 0.95±0.01

Pronoun 0.76±0.03 0.87±0.04 0.99±0.01

Page 9: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Conclusions

• results are as expected• reducing tagset size raises tagging accuracy• sacrificing information for efficiency

• reductions are illustrative• careful tagset design required with regards to

requirements

• further work• as mentioned: reaching perfect accuracy on full tagset

or making large-scale NLP systems perform better?

Page 10: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Your questions?

Computational Linguistic Models and Language Technologies for Croatian

rmjt.ffzg.hr | hml.ffzg.hr | hnk.ffzg.hr

Page 11: Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Željko Agić, Marko Tadić and Zdravko DovedanUniversity of Zagreb

{zagic, mtadic, zdovedan}@ffzg.hr

Mr.sc. Džemila Agić · Graničnei tolerantne vrijednosti za pojedine zagađujućematerije propisane Pravilnikom o načinuvršenja monitoringa kvaliteta zraka i definiranju vrsta

Komunikaciona tehnika II · Komunikaciona tehnika za 2. razred elektrotehničke stručne škole Javna ustanova Mješovita srednja elektrotehnička škola Tuzla Sejfudin Agić KOMUNIKACIONA

Language Technologies: a happy marriage between linguistics and informatics Marko Tadić ([email protected], Department of

SREBRENICA 1995 - Hamdija Dobrunahamdijadobruna.com/sreb/Preliminarni_spisak_Srebrenica_1995.pdf · 120 agiĆ hamdija mehmed 2003978000000 20.03.1978 121 agiĆ rasim himzo 01.01.1927

Introducing Music to NooJ Kristina Kocijan Sara Librenjak Zdravko Dovedan Han University of Zagreb Faculty of Humanities and Social Sciences Department

Đilda Pečarić ( [email protected] ) The University of Zagreb

Džemila Agić 1*, Halid Makić2, Miladin Gligorić3, Sejfudin ...idk.org.rs/wp-content/uploads/2020/03/3DZEMILA.pdf · vremenskih serija i modeli u odnosu na uticajne faktore, meteoroloških

bib.irb.hrbib.irb.hr/datoteka/624442.IMR-MESARIC_KUZIC-RUDAN… · Web viewProblem Vision, goals, ... Alphabetical listing and additional key word searches revealed common ... Dovedan,

Kultura javnog govorenja - srce.unizg.hr · Kultura javnog govorenja Diana Tomić, prof. [email protected] . ... Priprema govora •Komprimirani kolegij –Brzo se prolazi kroz teorijski

1. Adem Agić 2. Dani Batistuti ekipa Omladinske lige FSKS/SF Valter.pdfŠKOLA FUDBALA „VALTER“ ŠKOLA FUDBALA „VALTER“ ID: 4 2016 3615 0002 | Adresa: Breka br.11, 71000 Sarajevo,

Đilda Pečarić ( [email protected] ) Prof. dr. Miroslav Tuđman ( [email protected] )

Novka Agić, direktorica Zavoda zdravstvenogtuzlafarm.com/download/VA8.pdfza najteže onkološke slučajeve, transplant-aciju koštane srži i urođene srčane mane. Za sve odrasle

Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian Krešimir Šojat, Željko Agić, Marko Tadić Department of Linguistics,

Seminarski Rad Iz Uvoda u Pravo EU - Agić Haris - EU Od Maastrichta Do Lisabona - Copy

Ranko Matasović, University of Zagreb and Croatian Academy ...Ranko Matasović, University of Zagreb and Croatian Academy of Sciences and Arts E-mail: [email protected] Language classification

[email protected] Katedra za antropologiju

Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics

agić Lamija Beg - Tpo Fondacija - Novosti (1).pdf · polica uza zid niz nekakvih velikih mašina pokrivenih nekad vjerojatno bijelim plahtama. „Što li bi to moglo biti?“ i pogleda

NooJ2009 Tozeur 2009-06-09 1/22 SynCro - Parsing Simple Croatian Sentences Kristina Vučković, Božo Bekavac, Zdravko Dovedan University of Zagreb, Faculty

SOCIAL MEDIA: TEACHING TOOL OR NOT? Marija Matešić Kristina Vučković Zdravko Dovedan Faculty of Humanities and Social Sciences Department of Information

Osječko-baranjska županija · JURČEVIĆ AGIĆ,DORJA BUČEVIĆ Biološka raznolikost vretenca (Odonata) i očuvanje ugroženih vrsta u Dunavskom kopnenom dijelu Kopačkog rita Biodiversity

NooJ2008 Budapest 2008-06-08 Verb Valency Enhanced Croatian Lexicon Kristina Vučković, Nives Mikelić Preradović, Zdravko Dovedan [email protected]@ffzg.hr,

Evaluation of digital collections' user interfaces Radovan Vrana Faculty of Humanities and Social Sciences Zagreb, Croatia E-mail: [email protected]

Lovorka Zergollern-Miletić€¦ · Web viewSenior Lector (Language Instructor) Department of English. Faculty of Philosophy. University of Zagreb. e-mail: [email protected]. THE

Prilog poznavanju historicističkih intervencija na dubrovačkojOdsjek za povijest umjetnosti Zagreb, I. Lučića 3 [email protected] Izvorni znanstveni rad Predan 4. 11. 2013. UDK 725.13(497.5

dr. sc. Žarka Vujić, izv. prof., Katedra za muzeologiju . FF [email protected]

COBISS Kooperativni online bibliografski sistem i servisi ... · 8 AGIĆ, Asmir Upravljanje vremenom menadţera kao faktor uspješnosti poslovanja korporacija u F BiH : završni rad

ovaj telefonski imenik nije za ispisivanje, neki podaci su ... · Bratulić Josip 2075 B112 Kroatistika [email protected] Brekalo Marija 2223 D26 Studentska sluţba [email protected]

Seminarski rad iz Obligacionog prava - opći dio- Agić Haris - Ugovr kao izvor obligacije

Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac ([email protected], [email protected],

Implementation of NETCONF Protocoldocs.mipro-proceedings.com/cti/08_cti_5509.pdf · [email protected], [email protected] . Abstract - The first version of SNMP (Simple Network

Biljana Agić • Ana Lopac Groš • Ozrenka Meštrović • Tanja ...¡afran visibaba 14 4. Zaokruži slovo ispred dviju točnih tvrdnji. a) Mahovine često možemo naći na osunčanom

ZBORNIK SAŽETAKA BOOK OF ABSTRACTSsa.agr.hr/pdf/2014/impressum_a.pdf · Sonja Grljušić, Luka Andrić, Dejan Agić, Ivica Beraković, Gordana Bukvić, Sanja Marković Špoljarić.....60

Virtual learning spaces: example of international collaboration Jadranka Lasić-Lazić [email protected]@ffzg.hr Mihaela Banek Zorica [email protected]@ffzg.hr

č FF telefonski imenik 2011 beta 3B · Anglistika Stanojevi ć Mateusz-Milan I B008 2051 [email protected] mmstanoje Anglistika Star čevi ć An đel I B016 2058 [email protected]