the oslo-bergen tagger obt+stat - a short presentation
DESCRIPTION
The Oslo-Bergen Tagger OBT+stat - a short presentation. André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad. Morphosyntactic tagger and lemmatizer. Bokmål and Nynorsk Based on lexicon and linguistic rules - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/1.jpg)
The Oslo-Bergen TaggerOBT+stat - a short presentation
André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad
![Page 2: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/2.jpg)
Morphosyntactic tagger and lemmatizer• Bokmål and Nynorsk• Based on lexicon and linguistic rules• Statistical disambiguation for
completely unambiguous output (Currently Bokmål only)
![Page 3: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/3.jpg)
Purpose
• Annotation for linguistic research (e.g. The Oslo Corpus) • Large scale corpora annotation (e.g. NoWaC in progress)
![Page 4: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/4.jpg)
Applications
• Grammar checker in Microsoft Word and others• Open source and commercial translation systems (Apertium,
NyNo, Kaldera)• Commercial Content Management Systems (TextUrgy)
![Page 5: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/5.jpg)
Resources
Lexicon based on Norsk ordbank Bokmål: 151 229 entriesNynorsk: 126 323 entries
![Page 6: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/6.jpg)
Resources
Hand-made Constraint Grammar rules
Bokmål: 2214 morphological rulesNynorsk: 3849 morphological rules
![Page 7: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/7.jpg)
Resources
Development and test corpora Training/development corpus approx. 120,000 words each for Bokmål and Nynorsk
Test/evaluation corpusapprox. 30,000 words each for Bokmål and Nynorsk
![Page 8: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/8.jpg)
Resources
Dependency syntax for both Bokmål and Nynorsk
![Page 9: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/9.jpg)
Technology
Multitagger Common LispCG Disambiguator VislCG3 (C++)Statistical Disambiguator Ruby, HunPos
![Page 10: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/10.jpg)
Pipeline
![Page 11: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/11.jpg)
Results
Competitive results on varied domains
![Page 12: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/12.jpg)
Multitagger
• Sophisticated tokenizer, morphological analyzer and compound word analyzer (guesser)
• Enumerates all possible tags and lemmas• Tags composed of detailed morphosyntactic information
![Page 13: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/13.jpg)
Multitagger output<word>Dette</word>"<dette>""dette" verb inf i2 pa4"dette" pron nøyt ent pers 3"dette" det dem nøyt ent<word>er</word>"<er>""være" verb pres a5 pr1 pr2 <aux1/perf_part><word>en</word>"<en>""en" det mask ent kvant"en" pron ent pers hum"en" adv"ene" verb imp tr1<word>testsetning</word>"<testsetning>""testsetning" subst appell fem ub ent samset"testsetning" subst appell mask ub ent samset<word>.</word>"<.>""$." clb <<< <punkt>
![Page 14: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/14.jpg)
Multitagger output
<word>en</word>"<en>" "en" det mask ent kvant "en" pron ent pers hum "en" adv "ene" verb imp tr1
![Page 15: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/15.jpg)
CG Disambiguator
• Based on detailed Constraint Grammar rulesets for Bokmål and Nynorsk
• Rules compatible with the state of the art VislCG3 disambiguator
• Efficiently disambiguates multitagger cohorts with high precision
• Leaves some ambiguity by design
![Page 16: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/16.jpg)
CG Rules
#:2553 SELECT:2553 (subst mask ent) IF (NOT 0 farlige-mask-subst) (NOT 0 fv) (NOT 0 adj) (NOT -1 komma/konj) (**-1C mask-det LINK NOT 0 nr2-det LINK NOT *1 ikke-adv-adj);# "en vidunderlig vakker sommerfugl"
![Page 17: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/17.jpg)
Example output<word>Dette</word>"<dette>""dette" pron nøyt ent pers 3 SELECT:2607; "dette" verb inf i2 pa4 SELECT:2607 ; "dette" det dem nøyt ent SELECT:2607 <word>er</word>"<er>""være" verb pres a5 pr1 pr2 <aux1/perf_part><word>en</word>"<en>""en" det mask ent kvant SELECT:2762; "en" adv REMOVE:3689 ; "en" pron ent pers hum SELECT:2762 ; "ene" verb imp tr1 SELECT:2762<word>testsetning</word>"<testsetning>""testsetning" subst appell mask ub ent samset SELECT:2553; "testsetning" subst appell fem ub ent samset SELECT:2553 <word>.</word>"<.>""$." clb <<< <punkt>
![Page 18: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/18.jpg)
Example of ambiguity left unresolved<word>Setninger</word>"<setninger>""setning" subst appell fem ub fl "setning" subst appell mask ub fl <word>kan</word>"<kan>""kunne" verb pres tr1 tr3 <aux1/infinitiv> <word>være</word>"<være>""være" verb inf tr5 "være" verb inf a5 pr1 pr2 <aux1/perf_part> ; "være" subst appell nøyt ubøy REMOVE:3123 <word>vanskelige</word>"<vanskelige>""vanskelig" adj fl pos ; "vanskelig" adj be ent pos REMOVE:2318 <word>.</word>"<.>""$." clb <<< <punkt>
![Page 19: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/19.jpg)
Example of ambiguity left unresolved
<word>Setninger</word>"<setninger>""setning" subst appell fem ub fl "setning" subst appell mask ub fl
![Page 20: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/20.jpg)
Example of unresolved ambiguity<word>Det</word>"<det>""det" pron nøyt ent pers 3 SELECT:2607 ; "det" det dem nøyt ent SELECT:2607 <word>dreier</word>"<dreier>""dreie" verb pres tr1 i2 tr11 SELECT:2467 ; "drei" subst appell mask ub fl SELECT:2467 ; "dreier" subst appell mask ub ent SELECT:2467<word>seg</word>"<seg>""seg" pron akk refl SELECT:3333 ; "sige" verb pret i2 a3 pa4 SELECT:3333<word>om</word>"<om>""om" prep SELECT:2653 ; "om" sbu SELECT:2653<word>åndsverk</word>"<åndsverk>""åndsverk" subst appell nøyt ub fl <*verk> "åndsverk" subst appell nøyt ub ent <*verk> <word>.</word>"<.>""$." clb <<< <punkt>
![Page 21: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/21.jpg)
Example of unresolved ambiguity
<word>åndsverk</word>"<åndsverk>""åndsverk" subst appell nøyt ub fl <*verk> "åndsverk" subst appell nøyt ub ent <*verk>
![Page 22: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/22.jpg)
Example of lemma ambiguity
<word>Det</word>"<det>""Det" subst prop <*> <word>gamle</word>"<gamle>""gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064 ; "gammel" adj fl pos SELECT:3064 ; "gammal" adj fl pos SELECT:3064 <word>testamentet</word>"<testamentet>""testament" subst appell nøyt be ent "testamente" subst appell nøyt be ent <word>.</word>"<.>"
![Page 23: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/23.jpg)
Example of lemma ambiguity
<word>gamle</word>"<gamle>""gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064
![Page 24: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/24.jpg)
Example of lemma ambiguity
<word>Oslo</word>"<oslo>" "Oslo" subst prop <word>er</word>"<er>" "være" verb pres a5 pr1 pr2 <aux1/perf_part> <word>byen</word>"<byen>" "bye" subst appell mask be ent "by" subst appell mask be ent <word>vår</word>"<vår>" "vår" det mask ent poss SELECT:2689 ; "vår" det fem ent poss SELECT:2689 ; "vår" subst appell mask ub ent SELECT:2689 <word>.</word>"<.>" "$." clb <<< <punkt>
![Page 25: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/25.jpg)
Example of lemma ambiguity
<word>byen</word>"<byen>" "bye" subst appell mask be ent "by" subst appell mask be ent
![Page 26: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/26.jpg)
Example of unwanted ambiguity
Livet på jorden har tilpasset seg og tildels utnyttet de skiftende forhold.
![Page 27: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/27.jpg)
Example of unwanted ambiguity <word>og</word>"<og>" "og" konj "og" konj clb ; "og" adv REMOVE:2227 <word>til dels</word>"<til dels>" "til dels" adv prep+subst @adv <word>utnyttet</word>"<utnyttet>" "utnytte" verb pret tr1 "utnytte" verb perf-part tr1 ; "utnytte" adj nøyt ub ent <perf-part> tr1 REMOVE:2274 ; "utnytte" adj ub m/f ent <perf-part> tr1 REMOVE:2274 <word>de</word>"<de>" "de" det dem fl SELECT:2780 ; "de" pron fl pers 3 nom SELECT:2780 <word>skiftende</word>"<skiftende>" "skifte" adj <pres-part> tr1 i1 i2 tr11 pa1 pa2 pa5 tr13 <word>forhold</word>
![Page 28: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/28.jpg)
Example of unwanted ambiguity
<word>utnyttet</word>"<utnyttet>" "utnytte" verb pret tr1 "utnytte" verb perf-part tr1
![Page 29: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/29.jpg)
Statistical disambiguator
• Uses a statistical model to fully disambiguate• Simple model based on existing resources• Must discriminate between the ambiguities left by the CG
disambiguator
![Page 30: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/30.jpg)
Earlier ambiguities - now resolved
<word>Setninger</word>"<setninger>" "setning" subst appell fem ub fl <Correct!> "setning" subst appell mask ub fl
![Page 31: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/31.jpg)
Earlier ambiguities - now resolved
<word>om</word>"<om>" "om" prep <Correct!> "om" sbu <word>åndsverk</word>"<åndsverk>" "åndsverk" subst appell nøyt ub fl <*verk> <Correct!> "åndsverk" subst appell nøyt ub ent <*verk>
![Page 32: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/32.jpg)
Earlier ambiguities - now resolved
<word>gamle</word>"<gamle>" "gammel" adj be ent pos <Correct!> "gammal" adj be ent pos "gammel" adj fl pos "gammal" adj fl pos
![Page 33: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/33.jpg)
Earlier ambiguities - now resolved
<word>byen</word>"<byen>" "bye" subst appell mask be ent "by" subst appell mask be ent <Correct!>
![Page 34: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/34.jpg)
Statistical disambiguation process
• Statistical tagger is run independently of the CG disambiguator
• The output is aligned• Statistical tagger result used to select among ambiguous
results• Simple lemma disambiguation
![Page 35: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/35.jpg)
HMM modelling
• Robust performance on smaller amounts of training data• Good unknown word handling• Cheap and mature
![Page 36: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/36.jpg)
Our HMM model
• Trained on 122 523 words in 8178 sentences• Variety of domains• More than 350 distinct tags• Not very good accuracy really
![Page 37: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/37.jpg)
HMM model integration
Ambiguities in ca. 4.5% of tokensCoverage ca. 80%
![Page 38: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/38.jpg)
Lemma disambiguation
Mainly resolved by tag disambiguationBut some are still disambiguous
![Page 39: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/39.jpg)
Using word form frequencies
Idea: lemmas occur as word forms in large corpora
Use word frequencies from NoWaC to disambiguate among lemmas
![Page 40: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/40.jpg)
Remaining ambiguities
Randomly selected
![Page 41: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/41.jpg)
Expectations
• Cheap and cheerful modeling• Facing a variety of hard disambiguation decisions• On a large morphosyntactic tagset• Evaluated on a slightly eclectic corpus
![Page 42: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/42.jpg)
Results: CG Disambiguation
Precision 96.03%Recall 99.02%F-score 97.2%
![Page 43: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/43.jpg)
Results: Full disambiguation
Accuracy 96.56%
![Page 44: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/44.jpg)
Results: Full disambiguation
Overall accuracy 96.56%Tagging accuracy 96.74%Lemma accuracy 98.33%
![Page 45: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/45.jpg)
Details
Tagger coverage 79.39% Tagger accuracy 81.70%Lemma coverage 54.23%Lemma accuracy 86.71%
![Page 46: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/46.jpg)
Forthcoming (technical)
• Optimizing for very large corpora (> billion words)• More sophisticated modeling• Discriminative modeling or MBT modeling• Constrained decoding• Better lemma disambiguation
![Page 47: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/47.jpg)
Forthcoming (theoretical)
• Finding the best division of labor between data driven and rule driven approaches
• Pivoting on specific errors and ambiguities• Working more with syntax (CG3 dependency trees)
![Page 48: The Oslo-Bergen Tagger OBT+stat - a short presentation](https://reader035.vdocuments.site/reader035/viewer/2022062304/568145a8550346895db29f4c/html5/thumbnails/48.jpg)
Links
• http://tekstlab.uio.no/obt-ny/index.html• http://github.com/andrely/OBT-Stat