glarf-ula: ula08 workshop march 19, 2007 glarf-ula: working towards usability unified linguistic...
TRANSCRIPT
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
GLARF-ULA: Working Towards Usability
Unified Linguistic Annotation Workshop
Adam Meyers
New York University
March 19, 2008
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Outline
• Introduction to the GLARF Approach• What is a standard anyway?• Improving & Distributing Easy to Use Parts• Participation in CONLL• Chinese GLARF
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
GLARF Approach to ULA• A Typed Feature Structure Representation• Produces a single-theory analysis
– Not Reversible• GLARF System combines:
– hand-annotation– automatically generated annotation– combination of manual/automatic annotation
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Example Sentence• Meanwhile, they made three bids.
– Offset of first character = 123• Meanwhile: ARG1 = previous S, ARG2 = current S
– PDTB• made: ARG0 = they, ARG1 = three bids
– PropBank• bids: ARG0 = they, Support = made
– NomBank• (S (ADVP (RB Meanwhile)) (, ,)
(NP (PRP they)) (VP (VBN made)
(NP (CD three) (NNS bids))) (. .))– Penn Treebank
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
GLARF TFS(S (ADV (ADVP (HEAD (ADVX (HEAD (RB Meanwhile 0)) (P-ARG1 (S (EC-TYPE PB) (INDEX 0+0)) (P-ARG2 (S (EC-TYPE PB) (INDEX 0)))) (POINTER 0:1)))) (PUNCTUATION (, , 1))
(SBJ (NP (HEAD (PRP they 2)) (INDEX 1) (POINTER 2:1)))) (PRD (VP (HEAD (VX (HEAD (VBN made 3)) (P-ARG0 (NP (EC-TYPE PB) (INDEX 1))) (P-ARG1 (NP (EC-TYPE PB) (INDEX 3))) (INDEX 2)))
(OBJ (NP (T-POS (CD three 4)) (HEAD (NX (HEAD (NNS bids 5)) (P-ARG0-Supp (NP (EC-TYPE PB) (INDEX 1))) (Support (VX (EC-TYPE PB) (INDEX 2))))) (INDEX 3) (POINTER 4:1))) (POINTER 3:1))) (PUNCTUATION (. . 6)) (POINTER 0:2) (TREE-NUM 1) (INDEX 0)
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
What is a Standard Anyway?• Wide Usage (VHS/Betamax, cassette/8-track, Windows/MAC)
– Quality, the first of its kind, etc.– Papers written by happy users– A Shared Task like CONLL
• What need does GLARF-ULA fill?– Unified Detailed Linguistic Annotation
• German, Czech, Japanese, but not English– A la carte analyses with compatible encodings insufficient– Because it is desirable to have common
• tokenization, phrase boundaries, POS tags, etc.• obvious to GALE participants (part of SRI team uses GLARF)
• Working toward a standard, not necessarily GLARF– Make the “useful” pieces available – Contribute to the CONLL representation
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Parts of GLARF-ULA that non-GLARF-users Want
• Last Year’s ULA meeting– Tokenization splits around hyphens
• Based on NomBank and NE tags– Offset information– Possibly POS correction (if accurate)
• CONLL – Tokenization splits around hyphens
• All real words (not just NomBank) • NE tags
– NP-internal relations • apposition, relative, possessive, etc.
– NE modification relations• POST-HON, TITLE
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
CONLL Splitting at Hyphens/Slashes 1
• Split tokens: – Assign POS tags
• Automatic results for sample of 179 tokens– 153 correct (85.5%), 14 incorrect (7.8%), 12 unclear (6.7%)
– Decimal token numbers• (VP (NP (NNP New 6)
– (NNP York 7.1)))– (HYPH – 7.2) – (VBN based 7.3))
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
CONLL Splitting at Hyphens/Slashes 2
• Split Segments iff:– COMLEX words, numbers, prefixes (from a list)– Required by BBN NE tags (we made a gazatteer)
• Relations from GLARF– Conjunction cases: Japan-U.S. agreement– Everything else (distinguish HMOD/HEAD)
• GLARF distinguishes them further
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
NP-internal Relations• NP internal relations used for CONLL
– Title: Mr. John Smith– Post-Hon: John Smith Jr. III, Inc., Ph.D., etc.– APPOsite: John Smith, president of the U.S.– SUFFIX: John 's– Near 100% accuracy for small sample
• 45 correct, 2 unclear
• All NP GLARF Roles – RELATIVE, COMP, A-POS, T-POS, Q-POS, etc.– 224 correct (83.9%), 32 wrong (12%), 11 unclear (4.1%)
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Automatic GLARF for ULA-OANC-1• Out of the Box with Charniak parser
– Role Precision for 1st 5 sentences in Kaufman– NomBank: 8/10 (80%)– PropBank: 25/31 (81%)– PDTB: 7/11 (64%)
• Tune Charniak results• Run/Tune on Treebank (and other hand data)• Process CONLL style• Use for LAW 2 WG task
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Chinese TreeBank and PropBank
police now investigate this matter “The police are investigating this matter.”
NP
ADVP NPVV
VP
VP
IP
警方
正在
此调查 事
DT
NN
DPNN AD
predicate
Arg0
Arg1
NP
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Chinese GLARF
(IP (SBJ (NP (HEAD (NN警方 )) (INDEX 1)) (PRD (VP (ADV (ADVP (HEAD (AD 正在 ))))
(HEAD (VX (HEAD (VV 调查 )) (P-ARG0 (NP (EC-
TYPE PB)
(INDEX 1))) (P-ARG1 (NP (EC-
TYPE PB)
(INDEX 2))))) (OBJ (NP (T-POS (DP (HEAD
(DT 此 ))) (HEAD (NX (HEAD
(Nn事 ))) (INDEX 2)))))
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Summary
• Helped build a CONLL standard– Adopting the “useful” parts of GLARF
• Interoperability– Automatic GLARF– Input Annotation (hand or automatic)
• Extend to Chinese (and Japanese)
GLARF-ULA: ULA08 WorkshopMarch 19, 2007
Future for GLARF-ULA• NE-like integration, e.g. TIMEX, Opinion
– Structure-changing vs. match dependency head– NEs with markable Nom/PropBank structure
• PDTB and NomBank overlap occasionally– For example, As a result, etc.– adjudication procedures needed
• TimeML relations, NonOvert PDTB• More CONLL integration