universal dependencies for portuguesemedialab.di.unipi.it/depling/assets/docs/day2/02_demo2.pdf ·...

Universal Dependencies for Portuguese

Alexandre Rademaker13 Fabricio Chalub1 Livy Real6

Claudia Freitas4 Eckhard Bick5

Valeria de Paiva2

1IBM Research Brazil

2Nuance Communications USA

3FGVEMAp Brazil

4PUC-Rio Brazil

5University of Southern Denmark Denmark

6USP Brazil

Dependencies Linguistics 2017 Pisa

Rademaker et al UD for Portuguese Depling 2016 1 30

Linguistic Resources are Important

I Possibly do not need to explain it here

I At IBM Research Brazil many projects on language understandingand information extraction Portuguese and English (technicaldomains such as OampG and Mining)

I We started and still work with the OpenWordnet-PT a Portugueseversion of Princeton WordNet very useful in many applications


Universal Dependencies

I Universal Dependencies offer the promise of greater parallelismbetween languages

I Syntactic dependencies are not too far from semantic dependenciesuseful for many applications

I Manningrsquos Law grounds for individual languages good for linguistictypology rapid and consistent annotation suitable for parsing withhigh accuracy comprehended by non-linguist support downstreamlanguage understanding tasks

I In UD 12 first UD Portuguese in UD 13 one additionalUD Portuguese-BR (from the Googlersquos treebanks)


The corpus Bosque

lsquoBosquersquo means lsquowoodsrsquo in Portuguese It consists of news running textfrom both Portugal and Brazil chunked into sentences syntacticallyanalyzed in tree structures making use of both automatic parsingPALAVRAS and fully revised by linguists

PALAVRAS is a rule-based Constraint Grammar CG system designed forPortuguese It produces deep linguistic analyses with tags at themorphological syntactic (dependency) and semantic levels


The Corpus BosqueUD 12 version of UD Portuguese

I UD release 12 was the first release to include a Portuguese treebankUD Portuguese a treebank is based on the corpus Bosque (FlorestaSinta(c)tica project from Linguateca and VISL)

I This was based on Bosque 73 (AD format) converted to CoNLL-XShared Task in dependency parsing (2006)

I CoNLL version was converted to the Prague dependency style as apart of HamleDT (since 2011)

I Later versions of HamleDT added a conversion to the Stanforddependencies (2014) and to Universal Dependencies (HamleDT 302015)

Bosque rarr CoNLL-X rarr Prague Deps rarr Stanford Deps rarr UDMore at httpwwwlinguatecaptFlorestalevantamentohtml


The Corpus BosqueIn UD 12

The conversion process started from the AD format In the end wedecided to implement a direct conversion script from AD to theCoNLL-X format instead of relying on the pipeline of Eckhard Bickrsquosscripts However as far as possible his head rules are implementedOne detail that is probably different from his rules is the linkage in caseof more than one auxiliary in combination with a coordinated mainverb especially if the main verbs are accompanied by auxiliary particlesThere just wasnrsquot enough time to do this sorry In some cases the Bosque trees contain ambiguities that theannotators could not resolve For the training data ambiguity wasresolved by simply taking the first annotated possibility For the testdata sentences that contain ambiguity were discarded

httpwwwlinguatecaptflorestaCoNLL-Xreadmeconll


The Corpus UDThe consolidation

I Between September 2015 and March 2016 a set of UD conversionrules for the CG input was written as described in (Bick 2016) andapplied to the updated version of the dependency-style Bosque(Linguateca version 75 of Mar 2016)

I We started a team effort starting in Oct 2016 and throughconsistency-checking and discussion aiming at full compatibility withUD

I First version of our data UD 14 compliant included in UD release14 as UD Portuguese-Bosque In UD 14 UD Portuguese andUD Portuguese-Bosque and UD Portuguese-BR

I We accepted the challenge to update UD Portuguese-Bosque to UD20 guidelines and replace the previous UD Portuguese corpus


Why BosqueWhy not creating a new one from scratch

I Besides the original tagset and the CONLL 2006 tagset there areversions in CG AD (phrase structure tree) tgrep Penn TreeBankand TIGER formats All these are available fromhttpwwwlinguatecaptFloresta andhttpcorporadiuminhoptlinguatecaFSfshtml

I Different versions of the same material fosters the study aboutdifferent tagsest and its impacts in NLP systems

I We had on the team two researchers who had already worked onprevious versions of Bosque

I But conversion to UD scheme was much more complicated thaninitially planned


Why the effortI Incorporate changes and additions made in the original treebank after

2006I circumvent possible information loss due to previous conversionsI A comparison of the results of two different conversions might yield

interesting insightsI We wanted to build a framework where manual revision work and

consistency checks could be coordinated with automatic parserannotation and conversion rules Addressing systematic errors andthus fix them automatically based on a few examples rather thanrepeatedly fixing the same kind of error manually

I We intend to enlarge the treebank and therefore deem it importantto be able to maintain a close link between live parser output and theUD conversion method Integrate UD conversion grammar inPALAVRAS

I Having the corpus revised by native Portuguese linguists guarantees abetter annotation quality


The CG conversion grammar

I The conversion grammar ultimately used for the first conversion ofBosque to UD contained some 530 rules

I 70 were simple feature mapping rules and 130 were local MWEsplitting rules assigning internal structure POS and features to theMWEs from Bosque

I The remaining rules handled UD-specific dependency and functionlabel changes in a context-dependent fashion

I Main issues were raising of copula dependents to subjectcomplements inversion of prepositional dependency and the changefrom syntactic to semantic verb chain dependency

I In respect to punctuation attachment the grammar actually wentbeyond conversion identifying meaningful head tokens for commasparenthesis etc


similarities and differences

PALAVRAS niceline format

Esse [esse] ltgt ltdemgt DET M S gtN 1-gt2

carro [carro] ltVgt N M S SUBJgt 2-gt3

foi [ser] ltfmcgt ltauxgt V PS 3S IND VFIN FS-STA 3-gt0

achado [achar] ltvHgt ltmvgt V PCP M S ICL-AUXlt 4-gt3

em [em] ltsam-gt PRP ltADVL 5-gt4

o [o] lt-samgt ltartdgt DET M S gtN 6-gt7

inıcio [inıcio] lttempgt N M S Plt 7-gt5

de [de] ltsam-gt ltnp-closegt PRP Nlt 8-gt7

a [o] lt-samgt ltartdgt DET F S gtN 9-gt10

tarde [tarde] ltpergt N F S Plt 10-gt8

em [em] ltnp-closegt PRP Nlt 11-gt10

Engenheiro Marcilac [Engenheiro=Marcilac] ltcivgt ltgt

ltheurgt ltforeigngt PROP M S Plt 12-gt11

13-gt0


similarities and differencescont

Esse carro foi achado em o inıcio de a tarde em Engenheiro=Marcilac esse carro ser achar em o inıcio de o tarde em Engenheiro=Marcilac DET N V V PRP DET N PRP DET N PRP PROP

gt N SUBJ gt

root

ICL minus AUX

lt ADVL

gt N

P lt

gt N gt N

P lt

N lt P lt

root

Esse carro foi achado em o inıcio de a tarde em Engenheiro Marsilac esse carro ser achar em o inıcio de o tarde em Engenheiro Marsilac DET NOUN AUX VERB ADP DET NOUN ADP DET NOUN ADP PROPN PROPN PUNCT

det

nsubjpass

auxpass

root

case

det

obl

case

det

nmod

case

obl

flatname

punct



I UD version retains the additional tags for NP definiteness andcomplex tenses and the original syntactic functions tags andsecondary morphological tags (xpostag field)

I keeps its original linguistic focus but in addition it can be used forthe new machine learning scenarios

I We retain tags roots of sentences for their functions such as question(FS-QUE) command (FS-COM) or statement (FS-STA)

I In some cases the stored original function tags allow recover avalency relation otherwise lost in the underspecified UD edge labelsuch as the distinction between free adverbial prepositional phrases(eg trabalhar em (ADV) lsquowork atrsquo and valency-bound adverbial (egmorar em (ARG) lsquolive atrsquo)


Improving the datagender

Gender is one of the hallmarks of Romance languages and annotation canbe complicated as some words appear to have an underspecified genderThere are adjectives such as grande (big) or feliz (happy) that have onlyone form for both genders Sometimes we can tell by the contextsometimes not

Ex CP652-3 Por enquanto estamos felizes so com o reconhecimentoimplıcito (For now we are happy with only the implicit recognition)

Unsp (for unspecified value)


Improving the dataMWEs

The PALAVRAS annotation has MWEs tokenized as a single word

The UD version 1 guidelines proposed the dependency relations mwe orcompound so a process of dismembering these single token MWEs andassigning each of their components a POS-tag was initiated

UD version 2 different tags for MWE are used (flat fixed and name)but this conversion could be done automatically


Improving the dataParticiples

How to deal with participles was also a challenging issue PALAVRAS tagsall participles as verbs with the PCP feature

In UD can be VERB or ADJ

We worked on a set of linguistic rules to semi-automatically re-tagparticiples


Improving the dataEllipses

In version 1 ellipsis cases were dealt with via a remnant dependencyrelation In version 2 the remnant relation was discarded and a newtreatment was proposed the relation orphanEx ldquoOpala lasted 23 years Chevette 20 [ ]rdquo

O Opala durou 23 anos o Chevette 20

det nsubj nummod

obj

det

remnantremnant


det nsubj nummod

obj

det

parataxis

orphan


TokenizationMWE

I The first conversion did not handle UDrsquos tokenization Originaltreebankrsquos MWE and - syntactically motivated - splitting ofPortuguese contractions (preposition plus articledeterminerpronouneg ldquonesterdquo to ldquoem + esterdquo (in this)

I The problem in token-splitting is the need to assign (a) partial POStags (b) additional internal dependency links and (c) new internalhook-up points for existing outgoing and incoming dependency linksNot a simple table conversion

I CG3 offers context-based manipulation of not only tags but also ofentire tokens To split MWE tokens add POS features adddependency links

I The MWE lsquoao vivorsquo (live) for instance is an ADV as a whole whilelsquoaorsquo is a contraction (ADP lsquoarsquo + DET lsquoorsquo) and lsquovivorsquo (live) is an ADJ

I We adopted a MWEPOS= in the misc field


Tokenizationclitics

I Another issue related to tokenization is the problem of clitics inPortuguese Portuguese we have mesoclitics that is clitics that comeinside the verb and change the verbal structure

I Ex CP895-1 Poder-se-a dizer que o estilo resulta da sua profissaofotojornalista (It can be said that the style results from hisprofession pho- tojournalist)

I We decided to follow the traditional Portuguese grammars In theexample above lsquopoder-se-arsquo is lsquopoderaVERBrsquo followed by lsquosePRONrsquo(it can) in the future plus the reflexive


The particle lsquosersquo

1 reflexive and reciprocal constructions CF314-2 Voce se acha louca (Doyou think you are crazy)

2 pronominal verbs CF340-2 O ciclista espanhol 48 se suicidou emCaupenne drsquoArmagnac no sul da Franca com um tiro (The Spanish cyclist48 killed himself in Caupenne drsquoArmagnac south of France with a singleshot)

3 pronominal passive voice CF32-2 - Primeiro aprova-se o texto enxuto edepois negocia-se a aprovacao sem prazo definido das leis complementarese ordinarias (First the short text is approved and then without a definitedeadline the approval of the complementary and ordinary statutes isnegotiated)

4 undeterminate subject constructions CP263-3 Pense-se em KingsleyAmis Malcolm Bradbury e Albert Finney (One can think of Kingsley AmisMalcolm Bradbury and Albert Finney)



I universal dependencies this indicates that in both cases (3) and (4)we could have the particle se as the subject of the verb although thesubject remains non-explicit ldquovende-se casasrdquo (Houses are sold)

I But in UD lsquonsubjrsquo role is only applied to semantic arguments of apredicate when there is an empty argument in a grammatical subjectposition (a pleonastic or expletive) it is labeled as expl

I UD creates a certain uniformity between the cases (2) (3) and (4)Since we consider relevant the distinction between (2) (which has anexplicit subject) and (3) and (4) (which do not) we keep thisinformation Cases (3) and (4) carry the label SUBJ INDEF in themisc field


Negation

The treatment of negation has changed from UD version 1 to 2

In the UD version 2 a polarity feature was introduced (Polarity=Neg)

We understand nao ndash other words as some uses of nada (nothing) ndash asadverbs So not many words tagged PART

CP153-4 Nao estava nada a espera disto ([I] was not waiting nothing forit) both ADV Sometimes the second is pronoun

CP778-11 A coincidecia de funerarias e queijarias na nossa circunstancianao significava nada (rsquoThe coincidence of mortuaries and cheesemakersin our circumstances did not mean nothing rsquo) ndash obj(significanada)


Appositives

So far we used classic and comprehensive notion of appositives(non-restrictive and restrictive)

a) this was already the original analysis provided by PALAVRAS b) this isa gray area of the UD guidelines c) in our view the decision favorsconsistent analysis

lsquopresident Obamarsquo would be appos (restrictive appositive) if we agreethat Obama describes defines or modifies president But for UD since itis not reversible it is not appos

However there are always borderline cases

It is not clear to us why I met the president Obama should receive adifferent analysis So this cases were also tagged as lsquoapposrsquo in our corpusbut we recognize the issue is still open


Numbers

Bosque has 9368 sentences and 227653 tokens with 18140 uniquelemmas

At the moment we still have 957 lsquodeprsquo relations which we want toinvestigate since this dependency is mostly used when no other relation isapplicable

We also plan to check the coverage of the classes of verbs nounsadjectives and adverbs against OpenWordNet-PT6


Comparison and Assessment

Some big discrepancies in numbers between the 12 and 1420UD Portuguese as computed by the statistics script were easy to see

Our version had many more cases of auxiliary verbs than UD Portuguesein UD 12 Probably due verbs like lsquocontinuarrsquo (to continue) lsquocomecarrsquo (tostart) and lsquoacabarrsquo (to end) can also be seen as modal auxiliaries and thatwas our decision

Ex CP269-3 O soldado disparou para o ar mas o indivıduo continuou aavancar e foi atingido mortalmente (The soldier fired into the air but theindividual continued to advance and was struck deadly)


Comparison and Assessmentcont

We found that our version of the Bosque had many more cases ofapposition dependencies (appos)

In addition to our choice to include restrictive appositives under the tagappos the difference in numbers reflects different choices in thealignment-conversion

In the annotation provided by PALAVRAS the syntactic function NltPRED

(non-identifying apposition) can and should be converted into appos butin the UD Portuguese UD 12 all these cases were converted into nmod

When we looked for the appos relation considering the possible cases ofdifferent POS tags pairs being related we found around 50 possibilities ofPOS tag pairs Still need investigation


Contributions

We implemented the cl-conllu library is implemented in Common Lisp it isopen-source and freely available

Since we have not yet decided in our group to use any particulardependencies editor we also implemented an online CoNLL-U validationservice


Whatrsquos Next

We should note that this work is not finished

While our treebank once again is syntactically validated by the UD scriptwe are sure that many errors remain

First because like other treebanks we still have so-called lsquosemanticrsquofailures as described by the UD second level of validation

But mostly because we know that many phenomena are not as yetsusceptible of validation Coordination ellipsis and negation remain bigissues

A challenge lack of editor tabular based is easier for linguists But forfacilitate collaboration it must be web-based too


Problems

I Some cases of ltngt were not converter to NOUN althouth thedependencies are rightldquoA direcao do novo semanal sera assinada por Ewaldo Ruyrdquo (Thedirection of the new weekly will be assumed by Ewaldo Ruy)ldquoPesquisadores acham que as linhas podem ser falhas geologicasrdquo(Researchers believe that the lines may be geological faults)

I Many problems with reported speech and parataxis are inconsistentannotated

I The relation discourse is not consistent annotated

I Numerals also need to me revised cases of lsquotrinta e setersquo(37) andlsquocento e dezesseirsquo (116) must be flat

I Some obl that have PALAVRAS tag PIV should be obj

I We are now revising the appositional modifier appos versus nmod


Thanks


Linguistic Resources are Important

I Possibly do not need to explain it here

I At IBM Research Brazil many projects on language understandingand information extraction Portuguese and English (technicaldomains such as OampG and Mining)

I We started and still work with the OpenWordnet-PT a Portugueseversion of Princeton WordNet very useful in many applications








The corpus Bosque
























































13-gt0




gt N SUBJ gt

root

ICL minus AUX

lt ADVL

gt N

P lt

gt N gt N

P lt

N lt P lt

root


det

nsubjpass

auxpass

root

case

det

obl

case

det

nmod

case

obl

flatname

punct


























det nsubj nummod

obj

det

remnantremnant


det nsubj nummod

obj

det

parataxis

orphan


TokenizationMWE







Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks








The corpus Bosque
























































13-gt0




gt N SUBJ gt

root

ICL minus AUX

lt ADVL

gt N

P lt

gt N gt N

P lt

N lt P lt

root


det

nsubjpass

auxpass

root

case

det

obl

case

det

nmod

case

obl

flatname

punct


























det nsubj nummod

obj

det

remnantremnant


det nsubj nummod

obj

det

parataxis

orphan


TokenizationMWE







Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks






















































13-gt0




gt N SUBJ gt

root

ICL minus AUX

lt ADVL

gt N

P lt

gt N gt N

P lt

N lt P lt

root


det

nsubjpass

auxpass

root

case

det

obl

case

det

nmod

case

obl

flatname

punct


























det nsubj nummod

obj

det

remnantremnant


det nsubj nummod

obj

det

parataxis

orphan


TokenizationMWE







Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks




gt N SUBJ gt

root

ICL minus AUX

lt ADVL

gt N

P lt

gt N gt N

P lt

N lt P lt

root


det

nsubjpass

auxpass

root

case

det

obl

case

det

nmod

case

obl

flatname

punct


























det nsubj nummod

obj

det

remnantremnant


det nsubj nummod

obj

det

parataxis

orphan


TokenizationMWE







Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks


























det nsubj nummod

obj

det

remnantremnant


det nsubj nummod

obj

det

parataxis

orphan


TokenizationMWE







Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks


TokenizationMWE







Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks


Tokenizationclitics
















Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks













Negation







Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks


Appositives







Numbers

















Contributions




Whatrsquos Next







Problems








Thanks


Numbers

















Contributions




Whatrsquos Next







Problems








Thanks














Contributions




Whatrsquos Next







Problems








Thanks


Whatrsquos Next







Problems








Thanks


Problems








Thanks


Thanks


universal dependencies for portuguesemedialab.di.unipi.it/depling/assets/docs/day2/02_demo2.pdf ·...

Documents