glarf-ula: ula08 workshop march 19, 2007 glarf-ula: working towards usability unified linguistic...

15
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March 19, 2008

Upload: kennedi-laybourn

Post on 16-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

GLARF-ULA: Working Towards Usability

Unified Linguistic Annotation Workshop

Adam Meyers

New York University

March 19, 2008

Page 2: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Outline

• Introduction to the GLARF Approach• What is a standard anyway?• Improving & Distributing Easy to Use Parts• Participation in CONLL• Chinese GLARF

Page 3: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

GLARF Approach to ULA• A Typed Feature Structure Representation• Produces a single-theory analysis

– Not Reversible• GLARF System combines:

– hand-annotation– automatically generated annotation– combination of manual/automatic annotation

Page 4: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Example Sentence• Meanwhile, they made three bids.

– Offset of first character = 123• Meanwhile: ARG1 = previous S, ARG2 = current S

– PDTB• made: ARG0 = they, ARG1 = three bids

– PropBank• bids: ARG0 = they, Support = made

– NomBank• (S (ADVP (RB Meanwhile)) (, ,)

(NP (PRP they)) (VP (VBN made)

(NP (CD three) (NNS bids))) (. .))– Penn Treebank

Page 5: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

GLARF TFS(S (ADV (ADVP (HEAD (ADVX (HEAD (RB Meanwhile 0)) (P-ARG1 (S (EC-TYPE PB) (INDEX 0+0)) (P-ARG2 (S (EC-TYPE PB) (INDEX 0)))) (POINTER 0:1)))) (PUNCTUATION (, , 1))

(SBJ (NP (HEAD (PRP they 2)) (INDEX 1) (POINTER 2:1)))) (PRD (VP (HEAD (VX (HEAD (VBN made 3)) (P-ARG0 (NP (EC-TYPE PB) (INDEX 1))) (P-ARG1 (NP (EC-TYPE PB) (INDEX 3))) (INDEX 2)))

(OBJ (NP (T-POS (CD three 4)) (HEAD (NX (HEAD (NNS bids 5)) (P-ARG0-Supp (NP (EC-TYPE PB) (INDEX 1))) (Support (VX (EC-TYPE PB) (INDEX 2))))) (INDEX 3) (POINTER 4:1))) (POINTER 3:1))) (PUNCTUATION (. . 6)) (POINTER 0:2) (TREE-NUM 1) (INDEX 0)

Page 6: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

What is a Standard Anyway?• Wide Usage (VHS/Betamax, cassette/8-track, Windows/MAC)

– Quality, the first of its kind, etc.– Papers written by happy users– A Shared Task like CONLL

• What need does GLARF-ULA fill?– Unified Detailed Linguistic Annotation

• German, Czech, Japanese, but not English– A la carte analyses with compatible encodings insufficient– Because it is desirable to have common

• tokenization, phrase boundaries, POS tags, etc.• obvious to GALE participants (part of SRI team uses GLARF)

• Working toward a standard, not necessarily GLARF– Make the “useful” pieces available – Contribute to the CONLL representation

Page 7: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Parts of GLARF-ULA that non-GLARF-users Want

• Last Year’s ULA meeting– Tokenization splits around hyphens

• Based on NomBank and NE tags– Offset information– Possibly POS correction (if accurate)

• CONLL – Tokenization splits around hyphens

• All real words (not just NomBank) • NE tags

– NP-internal relations • apposition, relative, possessive, etc.

– NE modification relations• POST-HON, TITLE

Page 8: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

CONLL Splitting at Hyphens/Slashes 1

• Split tokens: – Assign POS tags

• Automatic results for sample of 179 tokens– 153 correct (85.5%), 14 incorrect (7.8%), 12 unclear (6.7%)

– Decimal token numbers• (VP (NP (NNP New 6)

– (NNP York 7.1)))– (HYPH – 7.2) – (VBN based 7.3))

Page 9: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

CONLL Splitting at Hyphens/Slashes 2

• Split Segments iff:– COMLEX words, numbers, prefixes (from a list)– Required by BBN NE tags (we made a gazatteer)

• Relations from GLARF– Conjunction cases: Japan-U.S. agreement– Everything else (distinguish HMOD/HEAD)

• GLARF distinguishes them further

Page 10: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

NP-internal Relations• NP internal relations used for CONLL

– Title: Mr. John Smith– Post-Hon: John Smith Jr. III, Inc., Ph.D., etc.– APPOsite: John Smith, president of the U.S.– SUFFIX: John 's– Near 100% accuracy for small sample

• 45 correct, 2 unclear

• All NP GLARF Roles – RELATIVE, COMP, A-POS, T-POS, Q-POS, etc.– 224 correct (83.9%), 32 wrong (12%), 11 unclear (4.1%)

Page 11: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Automatic GLARF for ULA-OANC-1• Out of the Box with Charniak parser

– Role Precision for 1st 5 sentences in Kaufman– NomBank: 8/10 (80%)– PropBank: 25/31 (81%)– PDTB: 7/11 (64%)

• Tune Charniak results• Run/Tune on Treebank (and other hand data)• Process CONLL style• Use for LAW 2 WG task

Page 12: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Chinese TreeBank and PropBank

police now investigate this matter “The police are investigating this matter.”

NP

ADVP NPVV

VP

VP

IP

警方

正在

此调查 事

DT

NN

DPNN AD

predicate

Arg0

Arg1

NP

Page 13: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Chinese GLARF

(IP (SBJ (NP (HEAD (NN警方 )) (INDEX 1)) (PRD (VP (ADV (ADVP (HEAD (AD 正在 ))))

(HEAD (VX (HEAD (VV 调查 )) (P-ARG0 (NP (EC-

TYPE PB)

(INDEX 1))) (P-ARG1 (NP (EC-

TYPE PB)

(INDEX 2))))) (OBJ (NP (T-POS (DP (HEAD

(DT 此 ))) (HEAD (NX (HEAD

(Nn事 ))) (INDEX 2)))))

Page 14: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Summary

• Helped build a CONLL standard– Adopting the “useful” parts of GLARF

• Interoperability– Automatic GLARF– Input Annotation (hand or automatic)

• Extend to Chinese (and Japanese)

Page 15: GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March

GLARF-ULA: ULA08 WorkshopMarch 19, 2007

Future for GLARF-ULA• NE-like integration, e.g. TIMEX, Opinion

– Structure-changing vs. match dependency head– NEs with markable Nom/PropBank structure

• PDTB and NomBank overlap occasionally– For example, As a result, etc.– adjudication procedures needed

• TimeML relations, NonOvert PDTB• More CONLL integration