cis department @ uoeugene, or may 21, 2010 towards mutual understanding: ontologies, ontology...

47
CIS Department @ UO Eugene, OR May 21, 2010 Towards Mutual Understanding: Ontologies, Ontology Matching, and their Applications Jingshan Huang Jingshan Huang Assistant Professor Assistant Professor School of Computer and Information School of Computer and Information Sciences Sciences University of South Alabama University of South Alabama http://cis.usouthal.edu/~huang/

Upload: tamsyn-harmon

Post on 28-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

CIS Department @ UO Eugene, OR May 21, 2010

Towards Mutual Understanding: Ontologies,

Ontology Matching,and their Applications

Jingshan HuangJingshan Huang

Assistant ProfessorAssistant ProfessorSchool of Computer and Information SciencesSchool of Computer and Information Sciences

University of South AlabamaUniversity of South Alabamahttp://cis.usouthal.edu/~huang/

Presentation Outline

• Research Motivation

• Learning-Based Ontology Matching – SOCCER

• Ongoing Research

• Summary

Research Motivation – Overview

• Information from heterogeneous sources has different semantics

Long (English)

Long (Chinese Pinyin) -> 龙 ->

• Integrating the information from heterogeneous sources must make use of all available clues, including syntax, semantics, context, and pragmatics

• Ontologies are a formal model to encode semantics

• Ontological techniques are critical in semantic integration

Quick FactsQuick Facts

• What is Ontology?What is Ontology?– a computational model of some domain of the world– describes the semantics of the terms used in the domain– often captured in the form of DAG (directed acyclic graph)– a finite set of concepts + properties + relationships

• What is Ontology Heterogeneity?What is Ontology Heterogeneity?– an inherent characteristic of ontologies developed by different parties for the

same (or similar) domains– the heterogeneous semantics may occur in different ways

(1) different terms could be used for the same concept;(2) an identical term could be adopted for different concepts;(3) properties and relationships could be different “translation” is way from good enough, not even close…

• What is Ontology Matching?What is Ontology Matching?– a.k.a. “Ontology Alignment” or “Ontology Mapping”– the process of determining correspondences between concepts from

heterogeneous ontologies– involving many different relationships, e.g., equivalentWith, subClassOf,

superClassOf, and siblings

President

sex

Person

female or male

Heterogeneity in Ontologies – A Simple Example

• Formal definition of ontologiesA knowledge representation model of some portion of the worldIt reflects its designers’ conceptual views

• Ontology = Concepts + Relationships + Constraints

1. Concept – a category “President”

2. Property – maps between concepts and data types“gender” of “President”

3. Relationship – maps between concepts“President” is a subClassOf “People”

4. Constraint – on properties or relationships“gender”: range = “male”

Concept semantics: name + properties + relationships

Heterogeneity in Ontologies – Running ExampleThe Semantic Web

Heterogeneity in Ontologies – Running Example (cont.)1. Type “professor university” in Swoogle, 129 different results are returned

2. All created and maintained by ontology professionals

Heterogeneity in Ontologies – Running Example (cont.)

Heterogeneity in Ontologies – Running Example (cont.)

Heterogeneity in Ontologies – Running Example (cont.)

• Semantic integration is important in Computer Science and Information Technology

• Ontologies are the foundation for semantic integration; at the same time, they are inherently heterogeneous

• The only way out – match/align ontologies such that to understand different semantics

Ontology matching is far from being solved despite its importance and the number of researchers that have investigated it

Research Motivation – Summary

Classification For Current Algorithms

1. Rule-Based Matching– Consider schema information alone– Specify a set of rules– Apply them to schema information

2. Learning-Based Matching– Consider both schema and instances– Apply different machine learning

techniques

Pros and Cons for Current Approaches

1. Rule-Based Matching– Is relatively fast ()– Ignores instance information ()– Uses ad hoc predefined weights ()concept semantics: name + properties +

relationships

2. Learning-Based Matching– Obtains extra clues from instances ()– Runs longer ()– Has difficulty in getting sufficient instances

()

Presentation Outline

• Research Motivation

• Learning-Based Ontology Matching – SOCCER

• Ongoing Research

• Summary

SOCCER (Similar Ontology Concept ClustERing) – a learning-based

algorithm

•Challenges and main idea

•Details

•Evaluation

Problems with Existing Matching Algorithms

• Rule-Based Matching– Ignores instance information ()– Requires ad hoc predefined weights ()

• Learning-Based Matching– Runs longer ()– Has difficulty in getting sufficient instances ()

Try to:1. Adopt machine learning techniques to avoid ad hoc

predefined weights2. Base learning on schema information alone to avoid the

difficulty in getting sufficient instances

The goal:To find equivalent concept pairs among different ontologies, which is the first, and the most critical step in semantic integration

Challenges

Very difficult for machines to learn how to match ontology schemas by providing schema information alone

1. Diversities in terminology

2. Diversities in relationships

Current learning-based algorithms make use of instances, more or less

Anecdotally, instances usually has much less variety than schemas have

Main Idea of SOCCER

• Equivalent concepts from different ontologies tend to stay “closer” to each other in a clustering space with structural dimensions

• Each cluster contains a number of concepts that are from different ontologies and are equivalent to each other

• SOCCER aims at finding such clusters by exploiting ontology schemas alone

Details – Overview

• Build a three-dimensional vector for each concept, corresponding to name, properties, and relationships

• Calculate the similarity between pairwise concepts

• Apply an agglomerative algorithm to generate clusters

Therefore, SOCCER has two phases:Phase I – weight learningPhase II – clustering

– Task T: match two ontologies

– Performance measure P: Precision, Recall, F-Measure, and Overall with regard to manual matching

– Training experience E: a set of equivalent concept pairs by manual matching

– Target function V: a pair of concepts

– Target function representation: )()(ˆ3

1

i

iiswbV

SOCCER Phase I – learn weights (1)Learning problem’s formal description

SOCCER Phase I – learn weights (2)

• Hypothesis space: weight vector (w1, w2, w3)

• Learning objective: find the weight vector that best fits the training examples

• Training rule: delta rule

• Searching strategy: minimize the training error

SOCCER Phase I – learn weights (3)

•Similarity in concept names d: edit distance between two stringsl: length of the longer string

•Similarity in concept properties n: number of pairs of matched propertiesm: smaller cardinality of lists p1 and p2

•Similarity in concept relationships (super/subClassOf)calculate the similarity values for pairwise concepts in ancestor lists and choose the maximum value

lds 11

mns 2

3s

SOCCER Phase I – learn weights (4)

• Overall similarity

• Create a matrix M between O1 and O2 (n1 x n2)

1. cell[i, j] stores the similarity between the ith concept in O1 and the jth concept in O2

2. wi’s are randomly initialized, and then updated by the learning process

)(3

1ii

i

sws

SOCCER Phase I – learn weights (5)

Training error Weight update rule

D: training example set

tr: maximum value for row i

tc: maximum value for column j

od: network output for a specific training example d

: the learning rate

sid: the si value for d

Dd

dcdr ototwE 2)]()[(2

1)(

Dd

dd otwE 2)(2

1)(

Dd

iddcdri sototw )]()[(

Dd

idddi sotw )(

SOCCER Phase II – clustering (1) Apply the learned weights to recalculate similarity

matrices for pairwise ontologies Cluster similar concepts among a set of ontologies

Input: A set of ontologies and the corresponding matrices

1. Each concept forms a singleton cluster

2. Find two clusters, (a) and (b), with maximum similarity

3. If s[(a), (b)] > threshold, go to step 4; else go to step 7

4. Merge (a) and (b) into (a, b)

5. Update matrix: s[(a, b), (c)] = (s[(a), (c)] + s[(b), (c)])/2

6. Repeat steps 2 and 3

7. Output current clusters

The key is then to determine the threshold

SOCCER Phase II – clustering (2)

Let the number of concepts in Oi be ni (i in [1, k])

WLOG, suppose n1 is the largest one in ni’s Total number of clusters should be in [ ]

k

iinn

11 ,

Evaluation Strategy

The hypothesis: a set of clusters exist across different ontologies

Need to show:1. Weight learning is correct

2. Resultant clusters are meaningful

Evaluation – test ontologies (1)Test ontologies are eight independently developed, real-world ones

1. http://www.csd.abdn.ac.uk/~cmckenzi/playpen/rdf/akt_ontology_LITE.owl

2. http://www.mindswap.org/2004/SSSW04/aktive-portal-ontology-latest.owl

3. http://annotation.semanticweb.org/iswc/iswc.owl4. http://www.mondeca.com/owl/moses/ita.owl5. http://protege.stanford.edu/plugins/owl/owl-library/ka.owl6. http://ontoware.org/frs/download.php/18/semiport.owl7. http://www.mondeca.com/owl/moses/univ.owl8. http://reliant.teknowledge.com/DAML/Mid-level-

ontology.owl

Evaluation – test ontologies (2)

Characteristics of test ontologies

Evaluation – result (1)

Weight convergence

Evaluation – result (2)Clustering result

Evaluation – Four Measures• Precision p – percentage of correct predictions

over all predictions

• Recall r – percentage of correct predictions over correct matching

• F-Measure f (= ) – a.k.a. Harmonic Mean, avoids the bias from adopting Precision or Recall alone

• Overall o (= ) – Post-Match Effort, i.e., how much human effort is needed to remove false matches and add missed ones

prrp

2

)2( 1pr

Evaluation – result (3)Four measures

6.0)2(75.0

79.0

75.0

83.0

83.01

75.083.075.083.02

86257257

309257

o

f

r

p

SOCCER Summary• SOCCER: A learning-based ontology

matching algorithm, and the first one based on ontology schemas alone

• Our contributions:1. ANN technique was integrated so that the

weights for different semantic aspects can be learned instead of being specified by a human in advance

2. Moreover, the learning technique was carried out based on the ontology schemas alone, which distinguishes it from most other learning-based algorithms.

Presentation Outline

• Research Motivation

• Learning-Based Ontology Matching – SOCCER

• Ongoing Research

• Summary

Ongoing Research: Bioinformatics/Medical Informatics (1)

• An abundance of medical/biological digital data has promised a profound impact in both the quality and rate of discovery and innovation

• Worldwide health scientists are producing, accessing, analyzing, integrating, and storing massive amounts of digital medical data daily

• If we were able to effectively transfer and integrate data from all possible resources, then the following would be granted:– A deeper understanding of all these data sets– Better exposed knowledge– Appropriate insights and actions that follow

• But…in many cases, the data users are not the data producers, and they thus face challenges in harnessing data in unforeseen/unplanned ways

• Fortunately, ontological techniques can render help in this regard!

• Ontological techniques have been widely applied to medical and biological research

• The most successful example is the Gene Ontology (GO) project– The GO’s aim: to standardize the representation of gene and gene

product attributes across species and databases– Three ontologies in the GO: Cellular Component, Molecular

Function, and Biological Process– The GO provides a controlled vocabulary of terms for describing

gene product characteristics and gene product annotation data– It also provides tools to access and process such data– The focus of the GO is to describe how gene products behave in a

cellular context

• Ontologies constructed under the auspices of the OBO (Open Biomedical Ontologies) group exhibit great variety

• Semantic integration becomes an indispensable step in biological and biomedical data mining

Ongoing Research: Bioinformatics/Medical Informatics (2)

An Experiment in Bio Data Mining

1. The characteristics of many biomedical ontologies: i) a rich set of super/subClassOf relationships; ii) numeric strings adopted as concept names; and iii) little, if any, instance data

2. SOCCER suitably serves the goal of integrating semantics in computational biology

Ongoing Research: Bioinformatics/Medical Informatics (3)

Ongoing Research: Digital Forensics (1)

• Challenges exist in Digital ForensicsChallenges exist in Digital Forensics– to maintain the integrity of evidence found by different

parties (usually from distributed geographic areas, or even with cultural barriers)

– the accurate interpretation of evidence– the trustworthy conclusion drawn thereafter

• Different parties are likely to adopt different formats and metadata for storing evidence’s contents – due to different people’s specific needs

• The seamless communication among different parties, along with the knowledge sharing and reuse that follow, become a non-trivial problem

• Being a formal knowledge representation model, Being a formal knowledge representation model, ontologies may help us to handle the aforementioned ontologies may help us to handle the aforementioned challenges in Digital Forensicschallenges in Digital Forensics

• But …But …

There is no such central ontology that is large enough to include all concepts of interest to every individual criminal investigator

• Anyone can design ontologies according to his/her own conceptual view, ontological heterogeneity is thus an inherent feature

• That is, each need for a conceptual model from any individual party will have to provide its own particular extensions – different from and incompatible with extensions added by other parties

Ongoing Research: Digital Forensics (2)

• An agreed-upon, global, and “all-in-one” ontology is not a feasible solution

• Different groups should maintain their own conceptual models, while utilizing ontological techniques to synthesize their data with others’ models

• This way, it is possible to effectively decouple the evidence semantics from its logical description and organization

Digital Investigation Evidence Acquisition Model Based on Ontology Matching (DIEAOM) to facilitate:(1) knowledge collection from disparate, heterogeneous evidence sources(2) knowledge sharing and reuse(3) decision support for criminal investigators

Ongoing Research: Digital Forensics (3)

• The DIEAOM aims to synthesize vast amounts of evidence from different parties by matching conceptual models

• Our goal is to benefit the current criminal investigation procedure with higher automation, enhanced effectiveness, and better knowledge sharing and reuse

Other Research Opportunities (1) Heterogeneous Knowledge

Acquisition/Management

• Increasing growth in the scale, complexity, and diversity of data has been witnessed in recent years

• In addition, the data are often used in ways not envisioned by those who created them

• New techniques are thus needed to repurpose, transform, and integrate multiple and uncoordinated data sources; interoperability is the fundamental goal

• In order to better achieve interoperability among distributed knowledge sources, accurate and effective semantic integration is the first, critical step to handle the heterogeneity in data

Other Research Opportunities (2) Component-Based Software Engineering

• Engineered software is decomposed into functional or logical components, with well-defined interfaces for communication across components

• Reusability is an important feature of a high quality component

• (Semi)automated methodology to annotate, discover, compose, and execute the software components

• Semantic integration techniques are important and fundamental in such automation processes

Other Research Opportunities (3) Semantics-Enriched Image Knowledge

Bases

• Create image knowledge bases by using ontologies to semantically encode image features

• Semantic search allows users to make use of concept search, instead of traditional keyword search

• It also paves the way for more advanced search strategies

• Users can specialize or generalize a query with the help of a concept hierarchy

• Queries can be formed using information from ontologies

Presentation Outline

• Research Motivation

• Learning-Based Ontology Matching – SOCCER

• Ongoing Research

• Summary

Summary

• Information from heterogeneous sources has different semantics, and semantic integration is necessary for a better use of every possibly available information

• As a formal knowledge representation model, ontologies can render help in this regard

• SOCCER, the first learning-based approach relied on schemas alone, was developed to tackle the ontology-matching problem, which is a critical component in semantic integration

• Ontological techniques can be applied to many areas to generate challenging interdisciplinary research topics

Thank you!!!

•Suggestions?

•Comments?

•Questions?