dan zeman - sdjt · interset (zeman, 2008) ... syntax content words are related by dependency...
TRANSCRIPT
29.9.2016, Ljubljana 4
Why?
● Linguistic research– Corpus query
● Training tools (parsers) for NLP– Downstream applications
29.9.2016, Ljubljana 7
Universal Dependencies
StanfordDependencies
CLEAR
Google UD
Stanford UD
HamleDT
Interset
Googleuniversal tags
http://universaldependencies.org/
29.9.2016, Ljubljana 8
Universal Dependencies
http://universaldependencies.org/
Universal Dependencies
29.9.2016, Ljubljana 9
Universal Dependencies
● Milestones:– 2014-04: EACL Göteborg, kick-off meeting– 2014-10: UD guidelines version 1– 2015-01: released 10 treebanks of 10 languages (UD 1.0)– 2015-05: released 19 treebanks of 18 languages (UD 1.1)– 2015-11: released 37 treebanks of 33 languages (UD 1.2)– 2016-05: released 54 treebanks of 40 languages (UD 1.3)– 2016-11: UD release 1.4, ~7 new languages– 2016 fall: UD guidelines version 2
http://universaldependencies.org/
29.9.2016, Ljubljana 10
Goals and Requirements● Cross-linguistically consistent grammatical annotation● Support multilingual research and development in NLP● Based on common usage and existing de facto standards● Caveats:
– Not a new linguistic theory –but linguistically informed and relevant
– Not an ideal parsing representation –but useful for comparative evaluation
– Not the ultimate annotation scheme –but a lightweight lingua franca
Not“Universal”
in the strictlytypological
sense!
29.9.2016, Ljubljana 11
Design Principles● Dependency
– Widely used in practical NLP systems– Available in treebanks for many languages
● Lexicalism– Basic annotation units are words – syntactic words– Words have morphological properties– Words enter into syntactic relations
● Recoverability– Transparent mapping from input text to word segmentation
29.9.2016, Ljubljana 12
Golden Rules
● Maximize parallelism– Don’t annotate the same thing in different ways– Don’t make different things look the same
● But don’t overdo it– Don’t annotate things that are not there– Balance: is it still the same thing?– Allow language-specific extensions
29.9.2016, Ljubljana 13
Part-of-Speech TagsOpen Closed OtherADJ ADP PUNCTADV AUX SYMINTJ CONJ X
NOUN DETPROPN NUMVERB PART
PRONSCONJ
● Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset (Petrov et al., 2012)
● All languages use the same inventory, but not all tags have to be used by all languages
29.9.2016, Ljubljana 14
FeaturesLexical Inflectional / Nominal Inflectional / Verbal
PronType Gender VerbFormNumType Animacy Mood
Poss Number TenseReflex Case Aspect
Definite VoiceDegree Person
Negative
● Standardized inventory of morphological features, based on Interset (Zeman, 2008)
● Languages select relevant features and can add language-specific features or values with documentation
29.9.2016, Ljubljana 15
Syntax
● Content words are related by dependency relations● Function words attach to closest content word● Punctuation attach to head of phrase or clause
29.9.2016, Ljubljana 16
Syntax
● Content words are related by dependency relations● Function words attach to closest content word● Punctuation attach to head of phrase or clause
Not“dependency”in the strictly
syntacticsense!
29.9.2016, Ljubljana 19
Dependency Relations● Taxonomy of 40 universal grammatical relations,
broadly attested in language typology (de Marneffe et al., 2014)– Language-specific subtypes may be added
● Organizing principles– Three types of structures: nominals, clauses, modifiers– Core arguments vs. other dependents (not arguments
vs. adjuncts)
29.9.2016, Ljubljana 20
Dependents of Clausal Predicates
Nominal Clausal Other
Core
nsubjnsubjpass
dobjiobj
csubjcsubjpass
ccompxcomp
Non-Core
nmodvocative
discourseexpl
advcl
advmodnegaux
auxpasscop
markpunct
29.9.2016, Ljubljana 22
Dependents of Nominals
Nominal Clausal Other
nmodappos
nummodacl
amoddetnegcase
29.9.2016, Ljubljana 23
“Stanford-style” Coordination
● Coordinate structures are headed by the first conjunct– Subsequent conjuncts depend on it via the conj relation– Conjunctions depend on it via the cc relation– Punctuation marks depend on it via the punct relation
29.9.2016, Ljubljana 24
Multiword Expressions
Relation Examplesmwe in spite of, as well as, ad hocname Roger Bacon, New York
compound phone book, four thousand, dress upgoeswith notwith standing, with out
● UD annotation does not permit “words with spaces”– Multiword expressions are analyzed using special relations– The mwe, name and goeswith relations are always head-initial– The compound relation reflects the internal structure
29.9.2016, Ljubljana 25
Other Relations
Relation Explanationparataxis Loosely linked clauses of same rank
list Lists without syntactic structureremnant Orphans in ellipsis linked to parallel elements
reparandum Disfluency linked to (speech) repairforeign Elements within opaque stretches of code switching
dep Unspecified dependencyroot Syntactically independent element of clause/phrase
29.9.2016, Ljubljana 26
Language-Specific Relations● Language-specific relations are subtypes of universal
relations added to capture important phenomena● Subtyping permits us to “back off” to universal relations
Relation Explanationacl:relcl Relative clause
compound:prt Verb particle (dress up)nmod:poss Genitive nominal (Mary ’s book)nmod:agent Agent in passive (saved by the bell)cc:preconj Preconjunction (both … and)det:predet Predeterminer (all those …)
29.9.2016, Ljubljana 27
Word Segmentation● Must be reproducible on new data● Surface tokens vs. syntactic words● Chinese, Vietnamese etc.: no clues, non-trivial algorithm● Arabic, Tamil etc.: part of morphological analysis● Spanish, German etc.: rather limited cases of contractions● Others: only punctuation (low-level tokenization)
29.9.2016, Ljubljana 28
Word Segmentation● Fusions
– al = a + el– naň = na + něj
● Clitics– vámonos = vamos + nos– изменяться = изменять + ся– potrafilibyśmy
= potrafili + by + jesteśmy
29.9.2016, Ljubljana 29
Where Are We Now?● Two years of UD version 1● 4 treebank releases (every 6 months)● 54 (61) treebanks● 40 (47) languages (over 50% world’s population)● Over 11M tokens; treebanks range from 1K to 1.5M● Over 120 contributors
– language group consistency SIGs– version 2 guidelines coming soon
29.9.2016, Ljubljana 31
Where Are We Going?
● UD guidelines version 2 coming soon
● Consistency checking
29.9.2016, Ljubljana 33
… almost
Childs of you bevary acute!
From RenetteLouwLouw (own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], through Wikimedia Commons
29.9.2016, Ljubljana 34
Consistency Checking
● Automatic tests catch only a fraction
● Focus groups on– Romance, Germanic, Slavic, Uralic, Turkic languages
29.9.2016, Ljubljana 36
Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions
29.9.2016, Ljubljana 37
Pronouns and Determiners● English + Romance languages: DET = article or pronominal
adjective (this, which, every)● We don’t have this category! (Traditionally → PRON.)● Some authors do recognize
determiners in Slavic!
29.9.2016, Ljubljana 38
Pronouns and Determiners● English + Romance languages: DET = article or pronominal
adjective (this, which, every)● We don’t have this category! (Traditionally → PRON.)● We have the words (except for articles).● Currently functional borderline (but ellipsis?)
This.DET car is expensive.This.PRON is expensive.
● Less strict in UD v2.
29.9.2016, Ljubljana 39
Pronouns Only● Personal pronouns (including reflexives, but not possessives)● Interrogative who, what● Indefinite and negative derivatives● Relative [cs] jenž
– cs: já, ty, on, my, vy, oni, se, kdo, co, někdo, něco, nikdo, nic– sk: ja, ty, on, my, vy, oni, sa, kto, čo, niekto, niečo, nikto, nič– pl: ja, ty, on, my, wy, oni, się, kto, co, ktoś, coś, nikt, nic– ru: я, ты, он, мы, вы, они, ся, кто, что, кто-нибудь, что-нибудь, никто, ничто– sl: jaz, ti, on, mi, vi, oni, se, kdo, kaj, nekdo, nekaj, nihče, nič– hr: ja, ti, on, mi, vi, oni, se, tko, što, neki, nešto, nitko, ništa– bg: аз, ти, ние, вие, се, кой, кое, някой, нещо, никой, нищо– cu: азъ, тꙑ, мꙑ, вꙑ, и, сѧ, къто, чьто
29.9.2016, Ljubljana 40
Possessives: Determiners● If they occur without a noun … ellipsis
Můj otec je starší. Tvůj má ale více zkušeností.My father is older. But yours is more experienced.
● sl: moj, tvoj, njegov, njen, najin, vajin, njun, naš, vaš, njihov, svoj● bg: мой, твой, негов, неин, наш, ваш, техен, свой● cs: můj, tvůj, jeho, její, náš, váš, jejich, svůj● sk: môj, tvoj, jeho, jej, náš, váš, ich, svoj● cu: мои, твои, нашь, вашь, свои / его, еѩ, ею, ихъ
29.9.2016, Ljubljana 41
Both Possible?● Demonstratives
– cs: ten, to, tento, tenhle, tamten, …– sl: ta, to, tisti, oni, takšen, …
● Adjectival interrogatives/relatives, indefinites, negatives– jaký, který, čí, nějaký, některý, něčí, každý, žádný– všechen, všichni, všechno
● Relative pronouns cannot be explained by ellipsis!– Muž, kterého *muže jsem vám představil.– The man, which *man I introduced to you.
29.9.2016, Ljubljana 45
Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions
29.9.2016, Ljubljana 57
Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions
29.9.2016, Ljubljana 58
Verb Forms
● Conflicting terminologies in traditional grammars
● Participle … verb or adjective?● Converb … verb or adverb?
● Tags and features apply to individual words!
29.9.2016, Ljubljana 59
Verb Forms
● POS tags and features apply to individual words!● A ko so se leta 1942 vračali, …
– past tense● … da ne bi v Atene prišli …
– conditional mood● … v prihodnje ne bodo vozili zgolj les …
– future tense
29.9.2016, Ljubljana 60
Verb Forms
● POS tags and features apply to individual words!● A ko so se leta 1942 vračali, …
– past tense● … da ne bi v Atene prišli …
– conditional mood● … v prihodnje ne bodo vozili zgolj les …
– future tense
Present
Conditional
Future Participle
Participle
Participle
Past???
29.9.2016, Ljubljana 61
Verb Forms● vračali, prišli, vozili
● [cs] “active participle” / “past tense”● [ru] “past tense” / “finite!”
– Active participle is something else: нарушивший● [bg] “participle + past (aorist) / imperfect” (two subtypes)● [cu] “participle + resultative aspect” (lang-spec)● “l-participle”
– But that would be a language-specific verb form.
29.9.2016, Ljubljana 62
Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions
29.9.2016, Ljubljana 63
Core Arguments
● Easier cross-linguistically than argument-adjunct?
● Subject of intransitive verb● Agent of transitive verb● Patient (direct object) of transitive verb
● Indirect object? Dative only?
29.9.2016, Ljubljana 64
Core vs. Oblique Dependents
● Core arguments: what exactly is it?● English:
– He gave John the book. (iobj)– He gave the book to John. (nmod)
● Spanish:– Dio el libro a John. (iobj)
● Czech:– Every Obj is translated to dobj, regardless the case and the
presence of preposition
29.9.2016, Ljubljana 65
dobj / iobj● Not as easy as accusative vs. dative.● Default: dobj● Heuristics for iobj
– Cením si vaší pomoci. (Gen)I appreciate your help.
– Čelíme velkým problémům. (Dat)We are facing big problems.
– Nedisponuje takovým rozpočtem. (Ins)He does not have such budget.
– Učí mou dceru fyziku. (2 × Acc)He teaches my daughter physics.
29.9.2016, Ljubljana 66
All Slavic Treebanks HaveNon-Accusative “Direct” Objects
● podrobit se testu; odpovídají smlouvě; jednat s někým● mówi o niej; używa wielkich słów● от которых зависит; относится к программам● potrebuje informacij; slediti evropskim smernicam;
ukvarjal se bom orožjem● odriče se imuniteta; priključiti se naporima● се характеризира с развитие; моля за внимание
29.9.2016, Ljubljana 67
Reflexive Pronouns● Direct or indirect object (dobj, iobj):Řízl se do prstu / Řízl ho do prstu.– Including reciprocal usage:
Políbili se. / They kissed each other.● Inherently reflexive verbs: smát se, bát se / laugh, fear
– expl:pv (pronominal verb; previously compound)
● Reflexive passive:To se snadněji řekne než udělá. / That is easier said than done.– expl:pass (previously auxpass:reflex)
● Impersonal construction (~ passive?):Zde se mluví německy. / German is spoken here.– expl:impers
29.9.2016, Ljubljana 68
Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions
29.9.2016, Ljubljana 73
Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions
29.9.2016, Ljubljana 77
Wrapping Up
● UD has had a great start
● Still a long way to go.Consistency matters!
● Get involved. It’s fun!
29.9.2016, Ljubljana 78
Thank you!Questions?
Děkuji!Otázky?
Ďakujem!Otázky?
Dziękuję!Pytania?
Hvala!Pitanja?
Hvala!Vprašanja?
Благодаря!Въпроси?
Спасибо!Вопросы?
Благодарѭ!Въпроси?