sonia bergamaschi serena sorrentino dbgroup.unimo.it

14
Università di Modena e Reggio Emilia 1 DB Group @ unimo Semi-automatic compound nouns Semi-automatic compound nouns annotation for data integration annotation for data integration systems systems Tuesday, 23 June 2009 Tuesday, 23 June 2009 SEBD 2009 SEBD 2009 Sonia Bergamaschi Sonia Bergamaschi Serena Sorrentino Serena Sorrentino www.dbgroup.unimo.it www.dbgroup.unimo.it Dipartimento di Ingegneria dell’Informazione Dipartimento di Ingegneria dell’Informazione Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena

Upload: bowen

Post on 20-Mar-2016

49 views

Category:

Documents


1 download

DESCRIPTION

Semi-automatic compound nouns annotation for data integration systems Tuesday, 23 June 2009 SEBD 2009. Sonia Bergamaschi Serena Sorrentino www.dbgroup.unimo.it Dipartimento di Ingegneria dell’Informazione Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena. The Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 1

DB

Gro

up @

uni

mo

Semi-automatic compound nouns Semi-automatic compound nouns annotation for data integration annotation for data integration

systemssystemsTuesday, 23 June 2009Tuesday, 23 June 2009

SEBD 2009SEBD 2009Sonia BergamaschiSonia BergamaschiSerena SorrentinoSerena Sorrentino

www.dbgroup.unimo.itwww.dbgroup.unimo.itDipartimento di Ingegneria dell’Informazione Dipartimento di Ingegneria dell’Informazione

Università di Modena e Reggio Emilia, via Vignolese 905, 41100 ModenaUniversità di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena

Page 2: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 2

DB

Gro

up @

uni

mo

• Data integration systems: Data integration systems: produce a comprehensive global schema successfully integrating data from heterogeneous structured and semi-structured data sources– Starting from the “meanings” associated to schema elements it is possible Starting from the “meanings” associated to schema elements it is possible

to discover mappings among the elements of different schematato discover mappings among the elements of different schemata• Lexical Annotation :Lexical Annotation :

– the explicit inclusion of the “meaning“ of a data source element (i.e. class/attribute name) w.r.t. a thesaurus (WordNet (WN) in our case)

– Automatic Lexical Annotation becomes crucial as a starting point for Automatic Lexical Annotation becomes crucial as a starting point for mappingmapping

discoverydiscovery• Problem : Problem :

– many schemata names are non-dictionary words non-dictionary words (compound nouns, acronyms, abbreviations etc.) i.e. not be present in the lexical resource

– in this work, we will concentrate only on non-non-dictionary Compound Nouns dictionary Compound Nouns (CNs)(CNs)

– the result of lexical annotation is strongly affected by the presence of these non-dictionary CNs in the schema

The ProblemThe Problem

Page 3: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 3

DB

Gro

up @

uni

mo Proposed Solution & MotivationProposed Solution & Motivation

• In some approaches the constituents of a CN are treated as single words. E.g. the CN “teacher_judgment" is split into two tokens (“teacher" and “judgment") and its relatedness to other sources element is calculated as an average relatedness between each token and the other elements

• A large set of relationships among different schemata is discovered, including a great amount of false positive relationships

• We propose a semi-automatic method for the lexical annotation of non-dictionary CNs

Page 4: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 4

DB

Gro

up @

uni

mo Compound Noun annotationCompound Noun annotation

• Compound Noun (CN)Compound Noun (CN): a word composed of more than one words called CN constituents– In order to perform semi-automatic CNs annotation a method for their

interpretation has to be devised• The interpretation of a CN interpretation of a CN is the task of determining the semantic

relationships among the constituents of a CN

• CNs can be divided in four categories: endocentric, exocentric, copulative and appositional and to consider only endocentric CNs

• EndocentricEndocentric CNCN: consists of a headhead (i.e. the part that contains the basic meaning of the whole CN) and modifiersmodifiers, which restricts this meaning. A CN exhibits a modifier-head structure modifier-head structure with a sequence of nouns composed of a head noun and one or more modifiers where the head noun occurs always after the modifiers

Page 5: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 5

DB

Gro

up @

uni

mo Compound Noun annotationCompound Noun annotation

• Our restriction is motivated by different elements:• the the vast majority of schemata CNs fall in the endocentric category• endocentric CNs are the most common type of CNs in English• exocentric and copulative CNs, which are represented by a unique word, are often present in a dictionary (e.g. “loudmouth”, “sleepwalk”, etc.)• appositional compound are not very common in English and less likely used as element of a schema (e.g.“sweet-sour”)

• Our method can be summed up into four main steps:• CN constituents disambiguationCN constituents disambiguation• redundant constituents identification and pruningredundant constituents identification and pruning• CN interpretation via semantic relationships CN interpretation via semantic relationships • creation of a new WN meaning for a CNcreation of a new WN meaning for a CN

Page 6: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 6

DB

Gro

up @

uni

mo

CN constituents disambiguation & pruning CN constituents disambiguation & pruning • CN constituents disambiguation CN constituents disambiguation

– Compound Noun syntactic analysis: syntactic analysis of CN constituents, performed by a parser

– Disambiguating head and modifier: by applying our CWSD (Combined Word Sense Disambiguation) algorithm, each word is automatically mapped into its corresponding WordNet 2.0 synsets

• Redundant constituents identification and pruning Redundant constituents identification and pruning RedunRedundant words: dant words: words that do not contribute new information, i.e. derived from the schema or from the lexical resource E.g. the attribute “company_address” of the class “company”: “company” is not considered as the relationship holding among a class and its attributes is implicit in the schema

Page 7: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 7

DB

Gro

up @

uni

mo CN interpretation via semantic relationshipsCN interpretation via semantic relationships

• Our goal is to select, among a set of predefined semantic relationships, the one that best capture the relation between the head and the modifier• 9 possible semantic relationship: CAUSE, HAVE, MAKE, IN, FOR,

ABOUT, USE, BE, FROM (Levi’s semantic relationships set)

• the semantic relationship between the head and the modifier of a CN is the same holding between their top level WN nouns in the WN hierarchy

• The top level concepts of the WN hierarchy are the 25 uniqueunique beginnersbeginners for WN English nouns defined by Miller

Page 8: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 8

DB

Gro

up @

uni

mo

• To each couple of unique beginners we associate the relationship from the Levi's set that best describes their combined meaning

• For example, we interpret the CN “teacher judgment“ by the MAKE relationship as “teacher" is an hyponym of “person" and “judgment" is an hyponym of “act“ and for the couple (person, act) of unique beginners we choose the relationship MAKE

CN interpretation via semantic relationshipsCN interpretation via semantic relationships

Person#1

hyponym …

Educator#1

hyponym …

Teacher#1

Act#2

hyponym

Judgment#2MAKE

MAKE

Page 9: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 9

DB

Gro

up @

uni

mo Creation of a new WN meaning for a CNCreation of a new WN meaning for a CN

• (a) Gloss definition(a) Gloss definition: we create the gloss to be associated to a CN, starting from the relationship associated to a CN and exploiting the glosses of the CN constituents

Teacher #1 Glossjudgment #2 GlossA person whose

occupation is teaching. The act to judging or assessing a person or situation or event.

+ +Modifier MAKE Head

A person whose occupation is teaching make the act to judging or assessing a person or situation or event.

Teacher_judgment Teacher_judgment Gloss:Gloss:

Page 10: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 10

DB

Gro

up @

uni

mo

• (b) Inclusion of the new CN meaning in WN(b) Inclusion of the new CN meaning in WN: as the concept denoted by a CN is a subset of the concept denoted by the head we create an hyponym relationship between the new CN meaning and its head meaning a generic relationship RT (Related term), corresponding to WN relationships as member meronym, part meronym etc. between the CN meaning and its modifier we use the WNEditor tool to create/manage the new meaning and to set new relationships between it and WN meanings

Creation of a new WN meaning for a CNCreation of a new WN meaning for a CN

judgment#2 Teacher#1

Teacher_judgment#1 SYNSETµ

SYNSETβhypernym/hyponym

Related ToWNEditorWNEditor

Teacher_judgment#1

Page 11: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 11

DB

Gro

up @

uni

mo ExampleExample

Teacher_judgment#1

hypernym Related To

Page 12: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 12

DB

Gro

up @

uni

mo Evaluation: Experimental ResultEvaluation: Experimental Result

• CNs annotation extends the automatic annotation tool within the MOMIS system

• Evaluation over a real data sources environment: three sources of an application scenario of the NeP4B project (491 schema elements) which contain a lot of CNs (about 50%).

• Without CNs annotation, CWSD obtains a very low recall value. Our method increases the recall without significantly worsening precision. However, the recall value is not very high: presence of a lot of acronym terms.

• A CN has been considered correctly annotated if the Levi's relationship selected manually by the user is the same returned by our method

Page 13: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 13

DB

Gro

up @

uni

mo ConclusionConclusion

• The experimental results showed the effectiveness of our method which significantly improves the result of the lexical annotation process

• Our method may be applied in general in the context of mapping discovery, ontology merging and data integration system

• Future work will be devoted to investigate on the role of the set of semantic relationships chosen for the CNs interpretation process

• We will extend the tool with a component which deals with acronyms and abbreviations expansion (to appear at 28th International Conference on Conceptual Modeling, ER 2009)

Page 14: Sonia Bergamaschi Serena Sorrentino dbgroup.unimo.it

Università di Modena e Reggio Emilia 14

DB

Gro

up @

uni

mo

Thanks for your attention!Thanks for your attention!