eswc 2011 blooms+

31
Contextual Ontology Alignment of LOD with an Upper Ontology: A Case Study with Proton PrateekJain, Peter Z. Yeh, KunalVerma, Reymonrod Vasquez, Mariana Damova, Pascal Hitzler and Amit P. Sheth Kno.e.sis, Wright State University, Dayton, OH Ontotext, Sofia, Bulgaria, Accenture Technology Labs, San Jose, CA

Upload: prateek-jain

Post on 01-Jul-2015

161 views

Category:

Technology


1 download

DESCRIPTION

Presentation during ESWC 2011 for BLOOMS+

TRANSCRIPT

Page 1: ESWC 2011 BLOOMS+

Contextual Ontology Alignment of LOD with

an Upper Ontology: A Case Study with

Proton

PrateekJain, Peter Z. Yeh, KunalVerma, Reymonrod Vasquez, Mariana

Damova, Pascal Hitzler and Amit P. Sheth

Kno.e.sis, Wright State University, Dayton, OH

Ontotext, Sofia, Bulgaria,

Accenture Technology Labs, San Jose, CA

Page 2: ESWC 2011 BLOOMS+

2

Outline

• Introduction

• Background

• Challenges

• Existing Approaches

• BLOOMS+ Approach

• Conclusion & Future Work

• References

Page 3: ESWC 2011 BLOOMS+

3

Outline

• Introduction

• Background

• Challenges

• Existing Approaches

• BLOOMS+ Approach

• Conclusion & Future Work

• References

Page 4: ESWC 2011 BLOOMS+

4

Web of Data

Page 5: ESWC 2011 BLOOMS+

5

Linked Open Data

• “The term Linked Data is used to describe a method of exposing,

sharing, and connecting data via de-referenceable URIs on the

Web.”- Wikipedia

• Datasets part of Linked Open Data include– Geographical Datasets

– Movies

– Life Science, Genes, Proteins

– General Information (Wikipedia), Customer Reviews,…

– US Census, Senator Voting Records,….

• Links primarily at instance level to assert equality between

entities

Example: linkedMDB:film/77 owl:sameAsdbpedia:resource/Pulp_Fiction

• By September 2010 LOD is estimated to have 25 billion RDF

triples, interlinked by around 395 million RDF links.

Page 6: ESWC 2011 BLOOMS+

6

Outline

• Introduction

• Background

• Challenges

• Existing Approaches

• BLOOMS+ Approach

• Conclusion & Future Work

• References

Page 7: ESWC 2011 BLOOMS+

7

If everything is nice, why am I here..

• Lack of Conceptual Description of Datasets

• Absence of Schema Level Links

• Lack of expressivity

• Difficulties with respect to querying using SPARQL

– Schema heterogeneity

– Entity disambiguation

– Ranking of results

Page 8: ESWC 2011 BLOOMS+

8

What can be done?

• Relationships are at the heart of Semantics.

• LOD captures instance level relationships, but lacks class level

relationships.

– Superclass

– Subclass

– Equivalence

• How to find these relationships?

– Perform a matching of the LOD Ontology’s using state of the art schema

matching tools.

• Desirable

– Considering the size of LOD, at least have results which a human can

curate.

Page 9: ESWC 2011 BLOOMS+

9

Schema Matching

• Schema matching is the process of identifying that two objects

are semantically related.

• In two schemas DB1.Student (Name, SSN, Level, Major, Marks)

and DB2.Grad-Student (Name, ID, Major, Grades); possible

matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN =

DB2.ID etc. and possible transformations or mappings would be:

DB1.Marks to DB2.Grades (100-90 A; 90-80 B..).

• Need for high quality data for querying and analytics in large

enterprises.

• Schema mapping provides a way of resolving discrepancies in

data.

Page 10: ESWC 2011 BLOOMS+

10

Why does it matters?

• Massive amount of data available within enterprise which refers

to same entities, terminology is different.

• Enterprise information asset awareness.

• Finding relevant and related schemata,

• Project planning.

– Can project specific requirements be fulfilled with the data at

disposal.

• Generating an exchange schema.

– Collaboration with clients which use different schemas.

Reference: K. Smith, P. Mork, L. Seligman, A. Rosenthal, M. Morse, D. Allen, and M. Li. The Role

of Schema Metching in Large Enterprises. CIDR, 2009.

Page 11: ESWC 2011 BLOOMS+

11

Outline

• Introduction

• Background

• Challenges

• Existing Approaches

• BLOOMS+ Approach

• Conclusion & Future Work

• References

Page 12: ESWC 2011 BLOOMS+

12

Existing Approaches

A survey of approaches to automatic Ontology matching by Erhard Rahm, Philip A. Bernstein in the VLDB

Journal 10: 334–350 (2001)

Page 13: ESWC 2011 BLOOMS+

13

Outline

• Introduction

• Background

• Challenges

• Existing Approaches

• BLOOMS+ Approach

• Conclusion & Future Work

• References

Page 14: ESWC 2011 BLOOMS+

14

Our Approach

Use knowledge contributed by users

To improve

Structured knowledge contributed by

users

Page 15: ESWC 2011 BLOOMS+

15

Rabbit out of a hat?

• Traditional auxiliary data sources like (WordNet, Upper Level

Ontologies) have limited coverage and are insufficient for LOD

datasets.

• LOD datasets have diverse domains

• Community generated data although noisy but is rich in

• Content

• Structure

• Has a “self healing property”

• Problems like Schema Matching have a dimension of context

associated with them. Since community generated data is

created by diverse set of people, hence captures diverse

context.

Page 16: ESWC 2011 BLOOMS+

16

Wikipedia

• The English version alone contains more than 2.9 million

articles.

• It is continually expanded by approximately 100,000 active

volunteer editors world-wide.

• Allows multiple points of view to be mentioned with their proper

contexts.

• Article creation/correction is an ongoing activity with no down

time.

Page 17: ESWC 2011 BLOOMS+

17

Schema Matching on LOD using Wikipedia

Categorization

• On Wikipedia, categories are used to organize the entire project.

• Wikipedia's category system consists of overlapping trees.

• Simple rules for categorization

– “If logical membership of one category implies logical

membership of a second, then the first category should be

made a subcategory”

– “Pages are not placed directly into every possible category,

only into the most specific one in any branch”

– “Every Wikipedia article should belong to at least one

category.”

Page 18: ESWC 2011 BLOOMS+

18

BLOOMS+ Approach – Step 1

• Pre-process the input schema

• Remove property restrictions

• Remove individuals, properties

• Tokenize the class names

• Remove underscores, hyphens and other delimiters

• Breakdown complex class names

– example: SemanticWeb => Semantic Web

Page 19: ESWC 2011 BLOOMS+

19

BLOOMS+ Approach – Step 2

• For each concept name processed in the previous step

– Identify article in Wikipedia corresponding to the concept.

– Each article related to the concept indicates a sense of the usage of the

word.

• For each article found in the previous step

– Identify the Wikipedia category to which it belongs.

– For each category found, find its parent categories till level 4.

• Once the “BLOOMS tree” for each of the sense of the source

concept is created (Ti), utilize it for comparison with the

“BLOOMS tree” of the target concepts (Tj).

– BLOOMS trees are created for individual senses of the concepts.

Page 20: ESWC 2011 BLOOMS+

20

BLOOMS+ Approach – Step 3

• In the tree Ti, find n (the number of common nodes which occurs

in Tj).

• Compute overlap Os between the source and target tree.

• Exponentiation of the inverse depth of common node gives less

node to nodes which appear lower in the hierarchy (generic

nodes)

• Log of tree avoids bias against large trees.

Page 21: ESWC 2011 BLOOMS+

21

Contextual Similarity

• BLOOMS+ computes contextual similarity between a source

class C and target D to further determine if they should be

aligned.

• Information about super classes of C and D is a good source of

contextual information.

• If the super classes agree, it is a good alignment otherwise it

should be penalized.

• For example, Jaguar has super classes such as Car and Vehicle,

and Cat has super classes such as Feline and Mammal, then the

alignment should be penalized because its contextual similarity

is low.

Page 22: ESWC 2011 BLOOMS+

22

BLOOMS+ Approach – Step 4

• BLOOMS+ retrieves all super classes of C and D up to level 2

(can be changed). The set of super classes is N( C ) and N (D).

• For each BLOOMS+ tree pair ( Ti, Tj) between C and D, BLOOMS+

determines the number of super classes in N(C) and N(D) in

following way.

• A super class c ∈ N(C) is supported by Tiif either of the following

conditions are satisfied:–

– The name of c matches a node inTj

– The Wikipedia article (or article category) corresponding to c

based on a Wikipedia search web service call using the name

of c – matches a node in Ti.

Page 23: ESWC 2011 BLOOMS+

23

BLOOMS Approach – Step 5

• BLOOMS+ computes the overall contextual similarity between C

and D with respect to Ti and Tj using the harmonic mean, which

is instantiated as:

• We chose the harmonic mean to emphasize super class

neighborhoods that are not well supported (and hence should

significantly lower the overall contextual similarity).

Page 24: ESWC 2011 BLOOMS+

24

BLOOMS Approach – Step 6

• BLOOMS+ computes the overall similarity between classes C

and D w.r.t. BLOOMS+ trees Ti and Tj by taking the weighted

average of the class and contextual similarity.

• BLOOMS+ defaults alpha and beta to 1 to give equal importance.

• BLOOMS+ then selects the tree pair (Ti,Tj) ∈ FC × FD with the

highest overall similarity score and if this score is greater than

the alignment threshold HA.

Page 25: ESWC 2011 BLOOMS+

25

Alignment decision

• If O(Ti,Tj) = O(Ti,Tj), then BLOOMS+ sets

– C owl:equivalentClass D.

• If O(Ti,Tj) <O(Tj,,Ti), then BLOOMS+ sets

– C rdfs:subClassOf D. –

• Otherwise, BLOOMS+ sets D rdfs:subClassOf C.

Page 26: ESWC 2011 BLOOMS+

26

Results BLOOMS+

Page 27: ESWC 2011 BLOOMS+

27

Outline

• Introduction

• Background

• Challenges

• Existing Approaches

• BLOOMS+ Approach

• Conclusion & Future Work

• References

Page 28: ESWC 2011 BLOOMS+

28

Conclusion

• We have presented a system called BLOOMS+ for performing

ontology alignment using contextual information.

• BLOOMS+ has been evaluated on alignment of three different

LOD ontologies to PROTON, created manually by human experts

for real world application called FactForge.

• To the best of our knowledge, BLOOMS+ is the only system

which utilizes contextual information present in ontology and

Wikipedia category hierarchy for ontology matching.

• BLOOMS+ significantly outperforms state of the art solutions for

the task of ontology alignment.

Page 29: ESWC 2011 BLOOMS+

29

Future Work

• Extended BLOOMS to utilize contextual information available on

community generated data.

• New weighting mechanism for identifying matches between the

concepts in the dataset.

• Develop a polling mechanism for identifying the best source to

assist in the process of schema alignment.

• Allow seamless querying across datasets by utilizing the

generated alignments (preliminary work LOQUS).

Page 30: ESWC 2011 BLOOMS+

30

References

• PrateekJain,Peter Z. Yeh, KunalVerma, Reymonrod Vasquez, Mariana

Damova, Pascal Hitzler and Amit P. Sheth, “Contextual Ontology Alignment

of LOD with an Upper Ontology: A Case Study with Proton”. Proceedings of

the 8th Extended Semantic Web Conference 2011, volume 6643 of Lecture

Notes in Computer Science, Heidelberg, 2011. Springer Berlin

• Prateek Jain, Pascal Hitzler, Amit P. Sheth, KunalVerma, Peter Z. Yeh:

Ontology Alignment for Linked Open Data. Proceedings of the 9th

International Semantic Web Conference 2010, Shanghai, China, November

7th-11th, 2010. Pages 402-417.

• Prateek Jain, Pascal Hitzler, Peter Z. Yeh, KunalVerma, and AmitP.Sheth,

Linked Data Is Merely More Data. In: Dan Brickley, Vinay K. Chaudhri, Harry

Halpin, and Deborah McGuinness: Linked Data Meets Artificial Intelligence.

Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp.

82-86. ISBN 978-1-57735-461-1.

Page 31: ESWC 2011 BLOOMS+

Thank You!

Questions?