naacl 2007 treebank-based acquisition of multilingual lfg resources1 treebank-based acquisition of...

14
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG 1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007

Upload: gertrude-ophelia-lester

Post on 17-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 1

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer

Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland

Treebank Workshop NAACL 2007

Page 2: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 2

• “Shallow” grammar: defines language (set of strings)• “Deep” Grammar: as above + maps strings to “meaning”

representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution

• Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted • Very difficult & expensive to scale to unrestricted text• Motivation for treebank-based deep grammar acquisition

(LFG/CCG/HPSG/TAG/DepGr/…)!!

• LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001]• Constraint-based (“unification”), lexicalised• c(onstituent)-str & f(unctional) structure• c-str: surface configuration (CFG trees)• f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL,

COMP, XCOMP, ADJN, POSS, APP, …)• f-str: AVM (feature-structure) encoding of dependencies/pred-

arg.

Lexical-Functional Grammar (LFG)

Page 3: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 3

Lexical-Functional Grammar LFG

Page 4: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 4

Lexical-Functional Grammar LFG

• Treebank: trees• How do we get from trees to f-structures?• What’s missing is the equations!

• Automatic f-structure annotation algorithm • Traverses tree and assigns LFG equations • Principle-based c-str/f-str interface

Page 5: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 5

F-Structure Annotation Algorithm

• Algorithm exploits:

– Categorial information (NP, VP, VBZ, …)

– Configurational information:• Local head, left/right of head• Leftmost NP sister to right of V(erbal) head: (OBJ)=

– Morphological information:• Him: (OBJ)=

– “Functional” tag information: • -LGS (PASSIVE)=+ , -SBJ, -CLR, …

– Trace/co-indexation information • Translate traces + co-indexation to corresponding re-

entrancies at f-str.

Page 6: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 6

F-Structure Annotation Algorithm

Left-Right Context Annotation Principles

Coordination Annotation Principles

Catch-All and Clean-Up

Traces

ProtoF-Structures Proper

F-Structures

Head-Lexicalization [Magerman,1994]

Lemmatization + Macros Lexical Entries

Defaults – “Functional Tags”

Page 7: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 7

Treebank Annotation: Control & Wh-Rel. LDD

Page 8: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 8

Multilingual Treebank-Based LFG Resources

• English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer

• Pilots/proof of concept: multilingual treebank-based LFG acquisition:

– German: TIGER (Cahill et al 2003, 2005)

– Chinese: CTB (Burke et al 2004)

– Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006)

• GramLab Project (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German

Page 9: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 9

Multilingual Treebank-Based LFG Resources

LanguageTreebank

• English Penn-II• Chinese CTB 5.1• Japanese KTC 4.0• German TIGER 2.0• German TűBa-D/Z• Spanish Cast3LB• Arabic ATB• French P7T

Size Coding/Data

50,000 CFG+traces+FT 18,000 CFG+traces+FT38,000 Dep (+traces)50,000 Graphs+CFG+Dep22,000 CFG+Dep+f-traces 3,500 CFG+Dep+f-traces300,000 (words)20,000 CFG+Dep+f-traces-------- > 200,000

Page 10: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 10

Q2

• What was missing in TB resource?

– F-structures, pred-argument structure, dependencies => f-structure annotation algorithm

– Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG)

– GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero-anaphora, tense/aspect, …

• What was done by hand?

– F-structure annotation algorithm (principle-based c-/f-str interface)

– No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T)

– No manual additions (unlike CCG/HPSG/TAG)

– Future work …

Page 11: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 11

Q3

• Methodological Issues - Quality Assurance:

• Evaluation against hand-crafted/corrected Gold Standard DepBanks

– PARC 700

– CBS 500

– PropBank

– Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French (200-500)

• CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG)

• Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500)

Page 12: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 12

Q4

• Phrase Structure or Dependencies?

• Both!!! Why?:

• Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing)

• Dependencies close to f-structure/predicate-argument structures …

– Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags

– TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges

– Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths

Page 13: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 13

Q5 & Q6

• Pros/Cons Formalism-Specific Treebank?– Formalism-Specific Treebank? Bad! Limits usefulness/user group/…

– Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars

• Grammar First vs. Treebank First?– Depends on what you want to do …

– If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping)

– Problem: many traditionally trained linguists see TreeBanking as menial task

– Highly qualified and interesting task: empirical linguistics: confront/rather than invent data

– Sociological task: how to make treebanking/bootstrapping sexy?

Page 14: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 14

Some Resources

• ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier

• http://www.computing.dcu.ie/~josef/Malaga06.ppt

• LFG parser demo:

• http://lfg-demo.computing.dcu.ie/lfgparser.html

• A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia

• J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia

• R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005

• A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005

• R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005

• M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004

• A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp. 320-7, Barcelona, Spain, 2004

• Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp.76--95. 2002