zdeněk Žabokrtský e-mail: [email protected]
DESCRIPTION
Zdeněk Žabokrtský e-mail: [email protected] Czech Technical University, Department of Computer Science the following presentation can be downloaded from http://obelix.ijs.si/ZdenekZabokrtsky/PDT/. The Prague Dependency Treebank (PDT). - PowerPoint PPT PresentationTRANSCRIPT
PDT 1
Zdeněk Žabokrtský
e-mail: [email protected]
Czech Technical University, Department of Computer Science
the following presentation can be downloaded from http://obelix.ijs.si/ZdenekZabokrtsky/PDT/
PDT 2
The Prague Dependency Treebank (PDT)
• long-term project aimed at a complex annotation of a
part of the Czech National Corpus with rich annotation
scheme
• Institute of Formal and Applied Linguistics
– established in 1990 at the Faculty of Mathematics and
Physics, Charles University, Prague
– Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, …
– http://ufal.ms.mff.cuni.cz
PDT 3
The Prague Dependency Treebank
• inspiration:
– the Penn Treebank (the most widely used syntactically
annotated corpus of English)
• motivation:
– the treebank can obviously be used for further linguistic
research
– more accurate results can be obtained when using
annotated corpora than when using texts in their raw form
(unsupervised training)
PDT 4
Source of the text data
• provided by Institute of the Czech National Corpus (ICNC)
• text sample for PDT – 456 705 tokens (words and punctuations) in 26610
sentences, divided into 576 files, 50 sentences per file• 40 % - general newspaper articles (Lidové noviny, Mladá
Fronta)• 20 % - economic new and analysis (Českomoravský profit)• 20 % - popular science magazine (Vesmír)• 20 % - information technology texts
– divided into • a training set (19 126 sentences)• a development test set (3 697) • a cross-evaluation test data set (3 787)
PDT 5
Institute of the Czech National Corpus
• founded 1994 at the Faculty of Philosophy, Charles University,
• head of the institute: prof. František Čermák• 100 million words
• freely accessible: http://ucnk.ff.cuni.cz– querry language CQP (corpus query processor, developed
at the university in Stuttgart)
– regular expressions
– examples of querries:
disku[s|z]e
.+nést
PDT 6
CNC: querry example
• querry: .+nosit
• response:
tačí se trochu vybavit , <nanosit> kupu listí a sena - já ho
ie Každý mistr by se měl <honosit> nějakým rekordem či jedin
anční tísni by měly dítě <donosit> . Bezvýhradná povinnost p
í hladovění bude schopna <donosit> plod . Mimochodem i u sou
evítané těhotenství tzv. <donosit> a dítěte se vzdát ve pros
mž sedíme , nepostavil . <Vynosit> tuny kamení na zádech , t
byl v nebezpečí a naděje <donosit> dítě žádná . Jeden večer
6 - Živit mateř . mlékem <Nanosit> 57 - Ukončit létání 58 -
odstatně větší a může se <honosit> řadou úctyhodných přívlas
vy , v pokoji nekouřit , <nenosit> domů alkohol . Dodržovat
ve městě , které se mělo <honosit> jen svým " dělnickým hnut
ve městě , které se mělo <honosit> jen svým " dělnickým hnut
. . .
PDT 7
Layered structure of PDT
• morphological level– full morphological tagging (word
forms, lemmas, mor. tags)
• analytical level– surface syntax– syntactic annotation using
depencency syntax (captures analytical functions such as Subject, Object,...)
• tectogrammatical level– level of linguistic meaning
(tectogrammatical functions such as Actor, Patient,...)
raw text
morphologicallytagged text
analytic treestructures (ATS)
tectogrammaticaltree structures (TGTS)
PDT 8
The Morphological Level
• a tag and a lemma are assigned to each word form
from the input text
• 3030 tags (Czech is an inflectionally rich language)
• 6 tag variables
– number - degrees of comparison
– case - person
– gender - negation
• example:
– VPS3A - verb (indicative, present tense, sing., 3rd
person, affirmative)
PDT 9
Morphological Analysis
• an automatic process: – input: word form– output: a set of possible lemmas, each lemma
accompanied by a set of possible tags
• currently covers 720000 Czech lemmas, based on 210000 stems
• can recognize 20 million word forms
• output ambiguity:– there may be 5 different lemmas for a given word form– 27 different tags for a given lemma– example: učení - NNS1A, NNS2A, NNS3A,...,NNP5A
PDT 10
The whole process of morphological tagging
• automatic morphological analysis
• manual disambiguation– 2 annotators– in the full text context– special software tool
• automatic comparison
• manual correction
raw text
unambiguously tagged text
PDT 11
Data Format• Standard Generalized Markup Language (SGML)
• a sample of DTD (Document Type Definition) related to the morphological level:
<!ELEMENT MMl - O (#PCDATA & R? & E? & e? & T* & MMt*)
-- lemma (base form), description see the l tag;
machine assigned (by a morphological analysis program),
NOT disambiguated
-->
<!ELEMENT MDl - O (#PCDATA & R? & E? & e? & T* & MDt*)
-- lemma (base form), description see the l tag;
machine assigned (by a tagger), disambiguated
if more than 1: n-best -->. . .
<!ELEMENT MMt - O (#PCDATA)
-- morphological tag(s) as assigned by morphology,
NOT disambiguated
-->
<!ELEMENT MDt - O (#PCDATA)
-- morphological tag(s) as assigned by machine, disambiguated,
possibly also with weight/prob; if more than 1: n-best
-->
PDT 12
Example of tagged sentence • Ty mají pak někdy takovou publicitu, že to dotyčnou kancelář zlikviduje.
<s id=cmpr9415:025-p19s2/bcc14zua.fs/#18>
<f cap>Ty<MMl>ty<MMt>PP2S1<MMt>PP2S5<MMl>ten<MMt>...
... PDFP1<MMt>PDFP4<MMt>PDIP1<MMt>PDIP4<MMt>PDMP4<A>Sb<r>1<g>2
<f>mají<MMl>mít<MMt>VPP3A<A>Pred<r>2<g>0
<f>pak<MMl>pak<MMt>DB<A>Adv<r>3<g>2
<f>někdy<MMl>někdy<MMt>DB<A>Adv<r>4<g>2
<f>takovou<MMl>takový<MMt>AFS41A<MMt>AFS71A<A>Atr<r>5<g>6
<f>publicitu<MMl>publicita<MMt>NFS4A<A>Obj<r>6<g>2
<D>
<d>,<MMl>,<MMt>ZIP<A>AuxX<r>7<g>8
<f>že<MMl>že<MMt>JS<A>AuxC<r>8<g>6
<f>to<MMl>ten<MMt>PDNS1<MMt>PDNS4<A>Sb<r>9<g>13
<f>dotyčnou<MMl>dotyčný<MMt>AFS41A<MMt>AFS71A<A>Atr<r>10<g>11
<f>kancelář<MMl>kancelář<MMt>NFS1A<MMt>NFS4A<A>Obj<r>11<g>13
<f>prakticky<MMl>prakticky_^(*1ý)<MMt>DG1A<A>Adv<r>12<g>13
<f>zlikviduje<MMl>zlikvidovat_:W<MMt>VPS3A<A>Obj<r>13<g>8
<D>
<d>.<MMl>.<MMt>ZIP<A>AuxK<r>14<g>0
PDT 13
The Analytical Level• the dependency structure was chosen to represent the syntactic
relations within the sentence.
• output of the analytical level: analytical tree structure (ATS)
– oriented, acyclic graph with one entry node
– every word form and punctuation mark is represented as a node
– the nodes are annotated by attribute-value pairs
• new attribute: analytical function
– determines the relation between the dependent node and its governing
nodes
– values: Sb, Obj, Adv, Atr,....
PDT 14
Example of ATS• V návrzích na případné změny vycházejí ze svých většinou
několikaletých podnikatelských zkušeností.
PDT 15
Selected attributes of ATS’s nodes
PDT 16
Selected values of the analytical function
PDT 17
Example of tagged sentence• ...ve sledovaném období žádný okres nezlepšil svoji pozici...
<f>ve<MMl>v<MMt>RV4<MMt>RV6<A>AuxP<r>4<g>9
<f>sledovaném<MMl>sledovaný_^(*2t)<MMt>AIS61A<MMt>AMS61A<MMt>ANS61A<A>Atr<r>5<g>6
<f>období<MMl>období<MMt>NNP1A<MMt>NNP2A<MMt>NNP4A<MMt>NNP5A<MMt>NNS1A<MMt>NNS2A<MMt>NNS3A<MMt>NNS4A<MMt>NNS5A<MMt>NNS6A<A>Adv<r>6<g>4
<f>žádný<MMl>ľádný<MMt>PNFIS4<MMt>PNFYS1<MMt>PNFYS5<A>Atr<r>7<g>8
<f>okres<MMl>okres<MMt>NIS1A<MMt>NIS4A<A>Sb<r>8<g>9
<f>nezlepšil<MMl>zlepąit_:W<MMt>VRYSN<A>Pred_Co<r>9<g>11
<f>pozici<MMl>pozice<MMt>NFS3A<MMt>NFS4A<MMt>NFS6A<A>Obj<r>10<g>9
PDT 18
The Tectogrammatical Level
• based on the framework of the Functional Generative
Description as developed by Petr Sgall
• in comparison to the ATSs, the tectogrammatical tree
structures (TGTSs) have the following characteristics:
– only autosemantic words have an own node, function words
(conjunctions, prepositions) are attached as indices to the
autosemantic words to which they belong
– nodes are added in case of clearly specified deletions on the
surface level
– analytical functions are substituted by tectogrammatical
functions (functors), such as Actor, Patient, Addressee,...
PDT 19
Example of TGTS• Podle předběžných odhadů se totiž počítá, že do soukromého vlastnictví
bude prodáno minimálne 10000 bytů
PDT 20
Selected attributes of a TGTS‘s node
PDT 21
Functors
• tectogrammatical counterparts of analytical functions
• about 40 functors in 2 groups:
– actants
• Actor, Patient, Adressee, Origin, Effect
– free modifiers
• LOC, DIR1, RSTR, TWHEN, TTIL,...
• provide more detailed information about the relation
to the governing node than the analytical function
PDT 22
Example of ATS ...• Kdo chce investovat dvě stě tisíc korun do nového automobilu, nelekne
se, že benzín byl změnou zákona trochu zdražen.
PDT 23
... and the corresponding TGTS
PDT 24
Tectogrammatical tagging
• 2 parallel streams
ATS treebank
smaller set of fully tagged TGTSs
larger set of partially tagged TGTSs (only changes of tree structure, functor and TFA assignment)
PDT 25
Problems of automatic functor assignment
• za roh - DIR3
• za hodinu - TWHEN
• za svobodu - OBJ
• po otci
– TWHEN (Přišel po otci.)
– NORM (Jmenuje se po otci.)
– HER (Zdědil dům po otci.)
– . . .
PDT 26
Summary
• the current state of art:– there are several manually annotated files of TGTSs– methods for automatic transformation from ATS into
TGTS form are in development
Czech National Corpus
morphologically tagged corpus
ATStreebank
TGTStreebank
September, 1994November, 1996 March, 2000