empower 2 empirical methods for multilingual processing, ‘onoring words, enabling rapid ramp-up...

22
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Tony Kroch, Lyle Ungar University of Pennsylvania March 23, 2000 TIDES KICKOFF

Upload: mervin-daniel

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

EMPOWER2

Empirical Methods for Multilingual Processing,

‘Onoring Words, Enabling Rapid Ramp-up

Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Tony Kroch, Lyle Ungar

University of Pennsylvania

March 23, 2000

TIDES KICKOFF

Page 2: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Penn approach

• Relies on lexically based linguistic analysis– Humans annotate naturally occurring text

• (hand correct output of automatic parsers, e.g. Fiddich, XTAG)

– Train statistical POStaggers, parsers, etc.– Common thread is predicate-argument structure

Hypothesis: More linguistically sophisticated analyzers

More accurate output

Page 3: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

EMPOWER2

Approach

• Annotations enriched with semantics and pragmatics

• Provide companion lexicons for annotated corpora• Extend our coverage to other languages

Goal – Parallel annotated corpora/lexicons will enable rapid ramp-up of MT

Page 4: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

TOOLS/RESOURCES

• Morphological Analyzers

• Stochastic parsers

• Lexicalized grammars

• Lexical classifications for cross-lingual mappings

LANGUAGES

• English

• Chinese

• Korean

• Hindi/Tamil

Faster development of Tools/annotation:

Page 5: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Current Status

• English Q&A using coreference• English annotation

– adding semantics to Penn TreeBank– creating companion lexicon

• Korean/English annotation – syntactic annotation and some semantics– companion transfer lexicon

• Chinese annotation – syntactic annotation (Chinese TreeBank)

Page 6: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

English Q&A – Tom MortonTREC-8 Approach

• Extract sentences based on:– Words in the sentence

– Category of the answer

– Words in co-reference relationships• pronouns

• common nouns

• dates

• Results – Placed 4th out of 20 participants.

Page 7: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Examples (pronouns)

• Who killed Lee Harvey Oswald? - demo– ..., and the hat. There was the suit he wore on the day he(JACK

RUBY) killed Oswald, a diamond-studded watch, a silver and diamond ring, two pairs of swim trunks, a shower cap, an athletic

supporter and a letter written to a woman. Other than th…

• Future Plans– use WordNet and syntactic constructions to

determine semantic categories of noun phrases– Cross-document co-reference

Page 8: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Semantic Annotation –Hoa Dang, Joseph Rosenzweig, John Duda

• Current syntactic annotation – POS, phrase structure bracketing– Logical Subject, locative, temporal adjuncts

• New semantic augmentations– Sense tag verbs and noun arguments/adjuncts– Predicate-argument relations for verbs, label

arguments (arg0, arg1, arg2)

Page 9: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

First Experiment (Siglex99)• WSJ 5K word corpus

– running text– WordNet 1.6

• 2100 words sense tagged twice (10 days)– 89% inter-annotator agreement – 700 verb tokens – 81% agreement

(disagreement in 90/350 verb tokens)

• Automatic predicate-argument labeling – 81% precision on 162 structures– Hand corrected 2100 words in one day

Page 10: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Example

• I was shaking the whole time.

<arg0> <WN2> <temporal>

• The walls shook; the building rocked.

<arg1> <WN3>; <arg1> <WN1>

Page 11: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Second Experiment: Methodology(150K target – Penn TreeBank II, with Christiane Fellbaum)• Sense tagging

– Two human annotators (replace one with automatic WSD if possible)– WordNet senses, but allow for revision of entries

• Predicate argument labels – Rosenzweig’s converter– Uses TreeBank “cues”– Consults lexical semantic KB

• Verb subcategorization frames and alternations• Ontology of noun-phrase referents• Multi-word lexical items

• XML annotation in external file referencing IDs

Page 12: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter

Page 13: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter

Page 14: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

New language/English MT Components

• New language– Morphological

Analyzer (POStags)

– Parser/Generator

– TreeBank

– Companion pred-arg lexicon

• English– POStagger

– Parser/Generator

– TreeBank

– Companion pred-arg lexicon

Transfer Lexicon

Page 15: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Korean/English MT Chunghye Han, Juntae Yoon, Meesook Kim, Eonsuk Ko

(CoGenTex/Penn/Systran: ARL)• Parallel TreeBanks for Korean/English enable

– Training of domain-specific Korean parsers• Collins parser and SuperTagger (also English)

– Alignment of Korean/English structures• Attempt automatic and semi-automatic testing and generation of transfer

lexicon (with CoGenTex)• Apply statistical MT techniques

• Lexical semantics (Systran, mapped to EuroWordNet-IL) should improve

– Accuracy of parsers – Recovery of dropped arguments

• http://www.cis.upenn.edu/~xtag/koreantag/index.html

Page 16: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

AdditionalKorean/English parallel data?

• Current parallel corpus not public domain

• Can use tools trained on this corpus to quickly annotate additional corpora– Translate sections of Penn TreeBank into

Korean?– Use existing Korean newswire text – translate

into English?– Both?

Page 17: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Example translation

Page 18: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Transfer lexicon entries: Mapping predicate argument structures across languages

Page 19: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Chinese TreeBank – DODFei Xia, Ninwen Xue, Fu-dong Chiou

http://www.ldc.upenn.edu/ctb/index.html

• Workshop of interested members of Chinese community, June ‘98

• Guidelines and sample files posted on web– Segmentation, March, ‘99– POStagging, March, ‘99– Bracketing, First pass, October, ’99– Bracketing, Second Pass, May, ’00

• 95%+ inter-annotator consistency

• Release of 100K annotated data, July, ’00• Follow-up workshop, Hong Kong, ACL’00

Page 20: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
Page 21: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Goal for Chinese

• Parallel, annotated corpora – Hong Kong news?

• Parse English with WSJ trained parsers, correct• Extend English TreeBank lexicon as needed• Parse Chinese with CTB trained parsers, correct• Start with lexicon extracted from CTB, extend

Experiment with using semi-automated techniques wherever possible to speed up process

Page 22: EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Past results

• XTAG project http://www.cis.upenn.edu/~xtag/

• Penn TreeBank http://www.cis.upenn.edu/~treebank/

• Enabled the development of tools: POStaggers, parsers, co-reference, etc http://www.ircs.upenn.edu/knowledge/licensing.html