dan zeman - sdjt · interset (zeman, 2008) ... syntax content words are related by dependency...

70
29.9.2016, Ljubljana 1 Dan Zeman Charles University, Prague

Upload: duongbao

Post on 23-May-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

29.9.2016, Ljubljana 1

Dan ZemanCharles University, Prague

29.9.2016, Ljubljana 2

Dependency Treebanks

29.9.2016, Ljubljana 3

Dependency Treebanks

29.9.2016, Ljubljana 4

Why?

● Linguistic research– Corpus query

● Training tools (parsers) for NLP– Downstream applications

29.9.2016, Ljubljana 5

29.9.2016, Ljubljana 6

29.9.2016, Ljubljana 7

Universal Dependencies

StanfordDependencies

CLEAR

Google UD

Stanford UD

HamleDT

Interset

Googleuniversal tags

http://universaldependencies.org/

29.9.2016, Ljubljana 8

Universal Dependencies

http://universaldependencies.org/

Universal Dependencies

29.9.2016, Ljubljana 9

Universal Dependencies

● Milestones:– 2014-04: EACL Göteborg, kick-off meeting– 2014-10: UD guidelines version 1– 2015-01: released 10 treebanks of 10 languages (UD 1.0)– 2015-05: released 19 treebanks of 18 languages (UD 1.1)– 2015-11: released 37 treebanks of 33 languages (UD 1.2)– 2016-05: released 54 treebanks of 40 languages (UD 1.3)– 2016-11: UD release 1.4, ~7 new languages– 2016 fall: UD guidelines version 2

http://universaldependencies.org/

29.9.2016, Ljubljana 10

Goals and Requirements● Cross-linguistically consistent grammatical annotation● Support multilingual research and development in NLP● Based on common usage and existing de facto standards● Caveats:

– Not a new linguistic theory –but linguistically informed and relevant

– Not an ideal parsing representation –but useful for comparative evaluation

– Not the ultimate annotation scheme –but a lightweight lingua franca

Not“Universal”

in the strictlytypological

sense!

29.9.2016, Ljubljana 11

Design Principles● Dependency

– Widely used in practical NLP systems– Available in treebanks for many languages

● Lexicalism– Basic annotation units are words – syntactic words– Words have morphological properties– Words enter into syntactic relations

● Recoverability– Transparent mapping from input text to word segmentation

29.9.2016, Ljubljana 12

Golden Rules

● Maximize parallelism– Don’t annotate the same thing in different ways– Don’t make different things look the same

● But don’t overdo it– Don’t annotate things that are not there– Balance: is it still the same thing?– Allow language-specific extensions

29.9.2016, Ljubljana 13

Part-of-Speech TagsOpen Closed OtherADJ ADP PUNCTADV AUX SYMINTJ CONJ X

NOUN DETPROPN NUMVERB PART

PRONSCONJ

● Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset (Petrov et al., 2012)

● All languages use the same inventory, but not all tags have to be used by all languages

29.9.2016, Ljubljana 14

FeaturesLexical Inflectional / Nominal Inflectional / Verbal

PronType Gender VerbFormNumType Animacy Mood

Poss Number TenseReflex Case Aspect

Definite VoiceDegree Person

Negative

● Standardized inventory of morphological features, based on Interset (Zeman, 2008)

● Languages select relevant features and can add language-specific features or values with documentation

29.9.2016, Ljubljana 15

Syntax

● Content words are related by dependency relations● Function words attach to closest content word● Punctuation attach to head of phrase or clause

29.9.2016, Ljubljana 16

Syntax

● Content words are related by dependency relations● Function words attach to closest content word● Punctuation attach to head of phrase or clause

Not“dependency”in the strictly

syntacticsense!

29.9.2016, Ljubljana 17

29.9.2016, Ljubljana 19

Dependency Relations● Taxonomy of 40 universal grammatical relations,

broadly attested in language typology (de Marneffe et al., 2014)– Language-specific subtypes may be added

● Organizing principles– Three types of structures: nominals, clauses, modifiers– Core arguments vs. other dependents (not arguments

vs. adjuncts)

29.9.2016, Ljubljana 20

Dependents of Clausal Predicates

Nominal Clausal Other

Core

nsubjnsubjpass

dobjiobj

csubjcsubjpass

ccompxcomp

Non-Core

nmodvocative

discourseexpl

advcl

advmodnegaux

auxpasscop

markpunct

29.9.2016, Ljubljana 21

29.9.2016, Ljubljana 22

Dependents of Nominals

Nominal Clausal Other

nmodappos

nummodacl

amoddetnegcase

29.9.2016, Ljubljana 23

“Stanford-style” Coordination

● Coordinate structures are headed by the first conjunct– Subsequent conjuncts depend on it via the conj relation– Conjunctions depend on it via the cc relation– Punctuation marks depend on it via the punct relation

29.9.2016, Ljubljana 24

Multiword Expressions

Relation Examplesmwe in spite of, as well as, ad hocname Roger Bacon, New York

compound phone book, four thousand, dress upgoeswith notwith standing, with out

● UD annotation does not permit “words with spaces”– Multiword expressions are analyzed using special relations– The mwe, name and goeswith relations are always head-initial– The compound relation reflects the internal structure

29.9.2016, Ljubljana 25

Other Relations

Relation Explanationparataxis Loosely linked clauses of same rank

list Lists without syntactic structureremnant Orphans in ellipsis linked to parallel elements

reparandum Disfluency linked to (speech) repairforeign Elements within opaque stretches of code switching

dep Unspecified dependencyroot Syntactically independent element of clause/phrase

29.9.2016, Ljubljana 26

Language-Specific Relations● Language-specific relations are subtypes of universal

relations added to capture important phenomena● Subtyping permits us to “back off” to universal relations

Relation Explanationacl:relcl Relative clause

compound:prt Verb particle (dress up)nmod:poss Genitive nominal (Mary ’s book)nmod:agent Agent in passive (saved by the bell)cc:preconj Preconjunction (both … and)det:predet Predeterminer (all those …)

29.9.2016, Ljubljana 27

Word Segmentation● Must be reproducible on new data● Surface tokens vs. syntactic words● Chinese, Vietnamese etc.: no clues, non-trivial algorithm● Arabic, Tamil etc.: part of morphological analysis● Spanish, German etc.: rather limited cases of contractions● Others: only punctuation (low-level tokenization)

29.9.2016, Ljubljana 28

Word Segmentation● Fusions

– al = a + el– naň = na + něj

● Clitics– vámonos = vamos + nos– изменяться = изменять + ся– potrafilibyśmy

= potrafili + by + jesteśmy

29.9.2016, Ljubljana 29

Where Are We Now?● Two years of UD version 1● 4 treebank releases (every 6 months)● 54 (61) treebanks● 40 (47) languages (over 50% world’s population)● Over 11M tokens; treebanks range from 1K to 1.5M● Over 120 contributors

– language group consistency SIGs– version 2 guidelines coming soon

29.9.2016, Ljubljana 30

47 Languages and Growing

29.9.2016, Ljubljana 31

Where Are We Going?

● UD guidelines version 2 coming soon

● Consistency checking

29.9.2016, Ljubljana 32

Common vocabulary is great …

… because we finallyunderstand each other …

29.9.2016, Ljubljana 33

… almost

Childs of you bevary acute!

From RenetteLouwLouw (own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], through Wikimedia Commons

29.9.2016, Ljubljana 34

Consistency Checking

● Automatic tests catch only a fraction

● Focus groups on– Romance, Germanic, Slavic, Uralic, Turkic languages

29.9.2016, Ljubljana 35

Existing Slavic Treebanks

?

29.9.2016, Ljubljana 36

Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions

29.9.2016, Ljubljana 37

Pronouns and Determiners● English + Romance languages: DET = article or pronominal

adjective (this, which, every)● We don’t have this category! (Traditionally → PRON.)● Some authors do recognize

determiners in Slavic!

29.9.2016, Ljubljana 38

Pronouns and Determiners● English + Romance languages: DET = article or pronominal

adjective (this, which, every)● We don’t have this category! (Traditionally → PRON.)● We have the words (except for articles).● Currently functional borderline (but ellipsis?)

This.DET car is expensive.This.PRON is expensive.

● Less strict in UD v2.

29.9.2016, Ljubljana 39

Pronouns Only● Personal pronouns (including reflexives, but not possessives)● Interrogative who, what● Indefinite and negative derivatives● Relative [cs] jenž

– cs: já, ty, on, my, vy, oni, se, kdo, co, někdo, něco, nikdo, nic– sk: ja, ty, on, my, vy, oni, sa, kto, čo, niekto, niečo, nikto, nič– pl: ja, ty, on, my, wy, oni, się, kto, co, ktoś, coś, nikt, nic– ru: я, ты, он, мы, вы, они, ся, кто, что, кто-нибудь, что-нибудь, никто, ничто– sl: jaz, ti, on, mi, vi, oni, se, kdo, kaj, nekdo, nekaj, nihče, nič– hr: ja, ti, on, mi, vi, oni, se, tko, što, neki, nešto, nitko, ništa– bg: аз, ти, ние, вие, се, кой, кое, някой, нещо, никой, нищо– cu: азъ, тꙑ, мꙑ, вꙑ, и, сѧ, къто, чьто

29.9.2016, Ljubljana 40

Possessives: Determiners● If they occur without a noun … ellipsis

Můj otec je starší. Tvůj má ale více zkušeností.My father is older. But yours is more experienced.

● sl: moj, tvoj, njegov, njen, najin, vajin, njun, naš, vaš, njihov, svoj● bg: мой, твой, негов, неин, наш, ваш, техен, свой● cs: můj, tvůj, jeho, její, náš, váš, jejich, svůj● sk: môj, tvoj, jeho, jej, náš, váš, ich, svoj● cu: мои, твои, нашь, вашь, свои / его, еѩ, ею, ихъ

29.9.2016, Ljubljana 41

Both Possible?● Demonstratives

– cs: ten, to, tento, tenhle, tamten, …– sl: ta, to, tisti, oni, takšen, …

● Adjectival interrogatives/relatives, indefinites, negatives– jaký, který, čí, nějaký, některý, něčí, každý, žádný– všechen, všichni, všechno

● Relative pronouns cannot be explained by ellipsis!– Muž, kterého *muže jsem vám představil.– The man, which *man I introduced to you.

29.9.2016, Ljubljana 45

Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions

29.9.2016, Ljubljana 46

Quantified Noun Phrase

29.9.2016, Ljubljana 47

Quantified Noun Phrase

Genitive!

29.9.2016, Ljubljana 48

Quantified Noun Phrase

29.9.2016, Ljubljana 49

Quantified Noun Phrase

29.9.2016, Ljubljana 50

Pronominal Quantifiers

29.9.2016, Ljubljana 56

Language-Specific Labels

29.9.2016, Ljubljana 57

Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions

29.9.2016, Ljubljana 58

Verb Forms

● Conflicting terminologies in traditional grammars

● Participle … verb or adjective?● Converb … verb or adverb?

● Tags and features apply to individual words!

29.9.2016, Ljubljana 59

Verb Forms

● POS tags and features apply to individual words!● A ko so se leta 1942 vračali, …

– past tense● … da ne bi v Atene prišli …

– conditional mood● … v prihodnje ne bodo vozili zgolj les …

– future tense

29.9.2016, Ljubljana 60

Verb Forms

● POS tags and features apply to individual words!● A ko so se leta 1942 vračali, …

– past tense● … da ne bi v Atene prišli …

– conditional mood● … v prihodnje ne bodo vozili zgolj les …

– future tense

Present

Conditional

Future Participle

Participle

Participle

Past???

29.9.2016, Ljubljana 61

Verb Forms● vračali, prišli, vozili

● [cs] “active participle” / “past tense”● [ru] “past tense” / “finite!”

– Active participle is something else: нарушивший● [bg] “participle + past (aorist) / imperfect” (two subtypes)● [cu] “participle + resultative aspect” (lang-spec)● “l-participle”

– But that would be a language-specific verb form.

29.9.2016, Ljubljana 62

Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions

29.9.2016, Ljubljana 63

Core Arguments

● Easier cross-linguistically than argument-adjunct?

● Subject of intransitive verb● Agent of transitive verb● Patient (direct object) of transitive verb

● Indirect object? Dative only?

29.9.2016, Ljubljana 64

Core vs. Oblique Dependents

● Core arguments: what exactly is it?● English:

– He gave John the book. (iobj)– He gave the book to John. (nmod)

● Spanish:– Dio el libro a John. (iobj)

● Czech:– Every Obj is translated to dobj, regardless the case and the

presence of preposition

29.9.2016, Ljubljana 65

dobj / iobj● Not as easy as accusative vs. dative.● Default: dobj● Heuristics for iobj

– Cením si vaší pomoci. (Gen)I appreciate your help.

– Čelíme velkým problémům. (Dat)We are facing big problems.

– Nedisponuje takovým rozpočtem. (Ins)He does not have such budget.

– Učí mou dceru fyziku. (2 × Acc)He teaches my daughter physics.

29.9.2016, Ljubljana 66

All Slavic Treebanks HaveNon-Accusative “Direct” Objects

● podrobit se testu; odpovídají smlouvě; jednat s někým● mówi o niej; używa wielkich słów● от которых зависит; относится к программам● potrebuje informacij; slediti evropskim smernicam;

ukvarjal se bom orožjem● odriče se imuniteta; priključiti se naporima● се характеризира с развитие; моля за внимание

29.9.2016, Ljubljana 67

Reflexive Pronouns● Direct or indirect object (dobj, iobj):Řízl se do prstu / Řízl ho do prstu.– Including reciprocal usage:

Políbili se. / They kissed each other.● Inherently reflexive verbs: smát se, bát se / laugh, fear

– expl:pv (pronominal verb; previously compound)

● Reflexive passive:To se snadněji řekne než udělá. / That is easier said than done.– expl:pass (previously auxpass:reflex)

● Impersonal construction (~ passive?):Zde se mluví německy. / German is spoken here.– expl:impers

29.9.2016, Ljubljana 68

Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions

29.9.2016, Ljubljana 69

Modal Auxiliary in English

29.9.2016, Ljubljana 70

Modal Verb in Czech

29.9.2016, Ljubljana 71

Modal Adverb in Russian

29.9.2016, Ljubljana 72

Modal / Control Verb in English

29.9.2016, Ljubljana 73

Issues of Slavic Languages in UD● Pronouns vs. determiners, numerals and quantifiers● Attachment of cardinal numbers● Verbs, participles, adjectives● Core arguments● Reflexive pronouns (clitics)● Auxiliary verbs and modal verbs● Comparative constructions

29.9.2016, Ljubljana 74

Comparative Constructions

29.9.2016, Ljubljana 75

Comparative Constructions

29.9.2016, Ljubljana 76

Wrapping Up

29.9.2016, Ljubljana 77

Wrapping Up

● UD has had a great start

● Still a long way to go.Consistency matters!

● Get involved. It’s fun!

29.9.2016, Ljubljana 78

Thank you!Questions?

Děkuji!Otázky?

Ďakujem!Otázky?

Dziękuję!Pytania?

Hvala!Pitanja?

Hvala!Vprašanja?

Благодаря!Въпроси?

Спасибо!Вопросы?

Благодарѭ!Въпроси?

29.9.2016, Ljubljana 79