semantic annotation for interlingual representation of mulilingual texts

28
Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen Rambow (Columbia), Flo Reeder (MITRE), Advaith Siddharthan (Columbia) LREC 2004 Workshop: “Beyond Named Entity Recognition:

Upload: khuong

Post on 20-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Semantic Annotation for Interlingual Representation of Mulilingual Texts. Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen Rambow (Columbia), - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semantic Annotation for Interlingual Representation of Mulilingual Texts

Semantic Annotation for Interlingual Representation of

Mulilingual Texts

Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI),

Lori Levin (CMU), Owen Rambow (Columbia), Flo Reeder (MITRE), Advaith Siddharthan (Columbia)

LREC 2004 Workshop: “Beyond Named Entity Recognition:Semantic labelling for NLP tasks”

Page 2: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Page 3: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

IAMTC (Interlingua Annotation of Multilingual Corpora) Project

• Collaboration: – New Mexico State University– University of Maryland– Columbia University– MITRE– Carnegie Mellon University– ISI, University of Southern California

Page 4: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Goals of IAMTC

• Interlingua design – Three levels of depth

• Annotation methodology– manuals, tools, evaluations

• Annotated multi-parallel texts– Foreign language original and multiple English

translations– Foreign languages: Arabic, French, Hindi,

Japanese, Korean, Spanish

Page 5: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Getting at Meaning(Two translations of Korean original text)

Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF.

The Subscribers cannot switch again to another provider for the first 3 months, but they can cancel the switch in 14 days if they are not satisfied with services like voice quality.

Starting January 1st of next yearcustomers of SK Telecom can change their service company toLG Telecom or KTF … Once a service company swap has

been made, customers are not allowed to change companies again within the first three months, although they can cancel the change anytime within 14 days if problems such as poor call quality are experienced.

Page 6: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Color Key

• Black: same meaning and same expression

• Green: small syntactic difference

• Blue: Lexical difference

• Red: Not contained in the other text

• Purple: Larger difference.– Need to use some inference to know that the

meaning is the same

Page 7: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Getting at meaning(Two translations of a Japanese original text)

• This year, • too, • in addition to • the birth • of Mitsubishi Chemical, • which has already been

announced, • other rather large-scale

mergers • may continue, • and be recorded • as a "year of mergers."

• This year, • which has already seen • the announcement • of the birth • of Mitsubishi Chemical

Corporation • as well as • the continuous • numbers of big mergers, • may • too • be recorded • as the "year of the merger“• for all we know.

More lexical similarity.

More differences in dependency relations.

Page 8: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Toward a ‘Theory of Annotation’• Recently, sharp increase in number of annotated

resources being built: – Penn Treebank, Propbank, many others…

• For annotation, need – Theory behind phenomena being annotated (for) – Annotation termsets (even WordNet, FrameNet, verbnet,

HowNet…) – Standard (?) annotation corpus (same old Treebank?)– Annotation tools—they make an immense difference – Carefully considered annotation procedure (interleaving

per text vs. per sentence, etc.) – Reconciliation and consistency checking procedures – Evaluation measures, appropriately defined

Page 9: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Corpus and Data

• Initial Corpus – 10+ texts in each language – 2+ translations each into English

• Interlingua designed for MT– Multiple English translations of same source show

translation divergences. Some phenomena: • Lexical level: word changes • Syntactic level: phrasing, thematization, nominalization • Semantic level: additional/different content • Discourse level: multi-clause structure, anaphor • Pragmatic level: Speech Acts, implicatures, style, interpersonal

• Causes of divergence– Genuine ambiguity/vagueness of source meaning – Translator error/reinterpretation

Page 10: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

IL Development: Staged, deepening

• IL0: simple dependency tree gives structure • IL1: semantic annotations for Nouns, Verbs, Adjs,

Advs, and Theta Roles – Not yet ‘semantic’—”buy”≠“sell’, many remaining

simplifications – Concept ‘senses’ from ISI’s Omega ontology – Theta Roles from Dorr’s LCS work – Elaborate annotation manuals – Tiamat annotation interface – Post-annotation reconciliation process and interface – Evaluation scores: annotator agreement

• IL2: that comes next…

Page 11: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Details of IL0

• Deep syntactic dependency representation:– Removes auxiliary verbs, determiners, and some

function words – Normalizes passives, clefts, etc. – Includes syntactic roles (Subj, Obj)

• Construction:– Dependency parsed using Connexor (English)

– Tapanainen and Jarvinen, 1997

– Hand-corrected

• Extensive manual and instructions on IAMTC Wiki website

Page 12: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Example of IL0

TrEd, Pajas, 1998

Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

Page 13: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Example of IL0

• Sheikh Mohammed, who is also the Defens Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

announced V RootMohamed PN Subj

Sheikh PN ModDefense_Minister PN Mod

who Pron Subjalso Adv Modof P Mod

UAE PN Objat P Mod

ceremony N Objinauguration N Mod

Page 14: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Details of IL1

• Intermediate semantic representation:– Annotations performed manually by each person alone

• Associate open-class lexical items with Omega Ontology items • Replace syntactic relations by one of approx. 20 semantic (theta)

roles (from Dorr), e.g., AGENT, THEME, GOAL, INSTR…

– No treatment of prepositions, quantification, negation, time, modality, idioms, proper names, NP-internal structure…

• Nodes may receive more than one concept– Average: about 1.2

• Manual under development; annotation tool built

Page 15: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Example of IL1 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

Page 16: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Example of IL1: internal representation

The study led them to ask the Czech government to recapitalize CSA at this level.[3, lead, V, lead, Root, LEAD<GET, GUIDE][2, study, N, study, AGENT, SURVEY<WORK, REPORT][4, they, N, they, THEME, ---, ---][6, ask, V, ask, PROPOSITION, ---, ---] [9, government, N, government, GOAL, AUTHORITIES,

GOVERNMENTAL-ORGANIZATION] [8, Czech, Adj, Czech, MOD, CZECH~CZECHOSLOVAKIA, ---] [11, recapitalize, V, recapitalize, PROP, CAPITALIZE<SUPPLY, INVEST] [12, csa, N, csa, THEME, AIRLINE<LINE, ---] [16, at, P, value_at, GOAL, ---, ---] [15, level, N, level, ---, DEGREE, MEASURE] [14, this, Det, this, ---, ---, ---]

Semantic Roles

Concepts from the Omega Ontology

Page 17: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Details of IL2 – In development• Start capturing meaning:

– Handle proper names: one of around 5 classes (PERSON,

LOCATION, TIME, ORGANIZATION…) – Conversives (buy vs. sell) at the FrameNet level– Non-literal language usage (open the door to customers

vs. start doing business) – Extended paraphrases involving syntax, lexicon,

grammatical features– Possible incorporation of other ‘standardized’ notations

for temporal and spatial expressions

• Still excluded: – Quantification and negation – Discourse structure – Pragmatics

Page 18: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Omega ontology

• Single set of all semantic terms, taxonomized and interconnected (http://omega.isi.edu)

• Merger of existing ontologies and other resources: – Manually built top structure from ISI– WordNet (110,000 nodes) from Princeton – Mikrokosmos (6000 nodes) from NMSU – Penman Upper model (300 nodes) from ISI– 1-million+ instances (people, locations) from ISI – TAP domain relations from Stanford…

• Undergoing constant reconciliation and pruning • Used in several past projects (metadata formation

for database integration; MT; QA; summarization)

Page 19: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Dependency parser and Omega ontologyOmega (ISI):110,000 concepts

(WordNet, Mikrokosmos, etc.), 1.1 mill instances

URL: http://omega.isi.edu

Dependency parser (Prague)

Page 20: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Tiamat: annotation interface

For each new sentence:

Candidate concepts Step 1: find Omega concepts for objects and events

Step 2: select event frame (theta roles)

Page 21: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Evaluation webpage

Page 22: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Evaluation

• Three approaches to evaluation:– Inter-annotator agreement — completed – Sentence generation from extracted annotation

structure — to be completed– Comparison of interlingual structures (graph

comparisons) — not planned

• Inter-annotator agreement: Is the IL sufficiently defined to permit consistent annotation?– Impacts ontology, theta-roles: coverage and

precision

Page 23: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Annotation Issues

1. Post-annotation consistency checking– Novice annotators may make inconsistent

annotations within the same text.– Intra-annotator consistency checking procedure

• e.g. If two nodes in different sentences are co-indexed, then annotators must ensure that the two nodes carry the same meaning in the context of the two different sentences

2. Post-annotation reconciliation

Page 24: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

2. Post-annotation reconciliation Question: How much can annotators be brought into

agreement? • Procedure:

– Annotator sees all annotations, votes Yes/Maybe/No on each

– Annotators then discuss all differences (telephone conf) – Annotators then vote again, independently – We collapse all Yes and Maybe votes, compare them

with No to identify all serious disagreement

• Result: – Annotators derive common methodology – Small errors and oversights removed during discussion – Inter-annotator agreement improved – Serious problems of interpretation or error identified

Page 25: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Annotation across Translations

Question: How different are the translations? • Procedure:

– Annotator sees annotations across both translations, identifies differences of form and meaning

– Annotator selects ‘true’ meaning(s)

• Results (work still in progress): – Impacts ontology richness/conciseness – Improvement in Interlingua representation ‘depth’– Useful for IL2 design development

• Observations: – This is very hard work – Methodology unclear: what is seen first, how to show

alternatives, what to do with results…

Page 26: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Principal problems to date • Proper nouns

– Proposed solution: automatically tag with one of 6 types (Person, Location, Org, DateTime, etc.)

• Noun compounds – Alternatives: tag head only; parse and tag whole structure

• Omega is too rich – Hard to distinguish from the others– Granularity of concept selection

• Light verbs – Proposed solution: rephrase to remove light verb if possible (“take a

shower” “shower”, but “take a shower” ?)

• Vagueness and ambiguity – Annotate all plausible senses (“propose” as Urge and Suggest)

• Idioms and metaphors – Proposed solution: ?

Page 27: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Discussion and conclusion • Results are encouraging

– But more work must be done to solidify them• Outcomes—how have we done?

– IL design —partly, and IL2 in the works – Annotation methodology, manuals, tools, evals — yes – Annotated parallel texts — approx. 150 done

• Six texts, two translations, 10-12 annotators

• Next steps– Foreign language annotation standards and tools – Development of IL2– Addressing coverage gaps (1/3 of open class words

marked as having no concept)– Generation of surface structure from deep structure

• Is it possible?

Page 28: Semantic Annotation for Interlingual Representation of Mulilingual Texts

LREC 2004 Workshop

Contact information

• URLs and Wiki pages: – Project website: http://aitc.aitcnet.org/nsf/iamtc/