Evaluating Ontology-Mapping Tools:Requirements and Experience
Natalya F. Noy
Mark A. Musen
Stanford Medical Informatics
Stanford University
Types Of Ontology Tools
There is not just ONE class ofONTOLOGY TOOLS
There is not just ONE class ofONTOLOGY TOOLS
Ontology ToolsOntology Tools
Development ToolsDevelopment Tools
Protégé-2000, OntoEditOilEd, WebODE, Ontolingua
Mapping ToolsMapping Tools
PROMPT, ONION, OBSERVER,Chimaera, FCA-Merge, GLUE
Evaluation Parameters forOntology-Development Tools
Interoperability with other tools Ability to import ontologies from other languages Ability to export ontologies to other languages
Expressiveness of the knowledge model Scalability Extensibility Availability and capabilities of inference services Usability of tools
Evaluation Parameters ForOntology-Mapping Tools
Can try to reuse evaluation parameters for development tools, but:
Ontology ToolsOntology Tools
Development ToolsDevelopment Tools Mapping ToolsMapping Tools
Differenttasks, inputs,and outputs
Similartasks, inputs,and outputs
Development Tools
Domainknowledge
Ontologiesto reuse
Requirements
Domainontology
Create anontologyCreate anontology
Input OutputTask
Mapping Tools: Tasks
C=Merge(A, B)C=Merge(A, B)
AA BB
iPROMPT, Chimaera
Map(A, B)
AA BB
Anchor-PROMPT, GLUEFCA-Merge
AA BB
Articulation ontologyArticulation ontology
ONION
Mapping Tools: Inputs
ClassesClasses ClassesClassesClassesClasses ClassesClasses ClassesClasses
Sharedinstances
Sharedinstances
Instancedata
Instancedata
DLdefinitions
DLdefinitions
Slots andfacets
Slots andfacets
Slots andfacets
Slots andfacets
iPROMPTChimaera GLUE FCA-Merge OBSERVER
Mapping Tools: Outputs and User Interaction
GUI for interactivemerging
iPROMPT, Chimaera
Lists of pairs ofrelated terms
Anchor-PROMPT, GLUEFCA-Merge
List of articulationrules
ONION
Can We Compare Mapping Tools?
Yes, we can! We can compare tools in the same group How do we define a group?
Architectural Comparison Criteria
Input requirements Ontology elements
Used for analysis Required for analysis
Modeling paradigm Frame-based Description Logic
Level of user interaction: Batch mode Interactive User feedback
Required? Used?
Architectural Criteria (cont’d)
Type of output Set of rules Ontology of mappings List of suggestions Set of pairs of related terms
Content of output Matching classes Matching instances Matching slots
From Large Pool To Small Groups
Space ofmapping tools
Architectural criteria
Performancecriterion
(within a single group)
Resources Required For Comparison Experiments
Source ontologies Pairs of ontologies covering similar domains Ontologies of different size, complexity, level of
overlap
“Gold standard” results Human-generated correspondences between terms Pairs of terms, rules, explicit mappings
Resources Required (cont’d)
Metrics for comparing performance Precision (how many of the tool’s
suggestions are correct) Recall (how many of the correct matches the
tool found) Distance between ontologies Use of inference techniques Analysis of taxonomic relationships (a-la
OntoClean)
Experiment controls Design Protocol
Suggestions that the tool
produced
Operations that the user
performed
Suggestions that the user
followed
Where Will The Resources Come From?
Ideally, from researchers that do not belong to any of the evaluated projects
Realistically, as a side product of stand-alone evaluation experiments
Evaluation Experiment: iPROMPT
iPROMPT is A plug-in to Protégé-2000 An interactive ontology-merging tool
iPROMPT uses for analysis Class hierarchy Slots and facet values
iPROMPT matches Classes Slots Instances
Evaluation Experiment
4 users merged the same 2 source ontologies
We measured Acceptability of iPrompt’s suggestions Differences in the resulting ontologies
Sources
Input: two ontologies from the DAML ontology library
CMU ontology: Employees of
academic organization
Publications Relationships
among research groups
UMD ontology: Individals CS departments Activities
Experimental Design
User’s expertise: Familiar with Protégé-2000 Not familiar with PROMPT
Experiment materials: The iPROMPT software A detailed tutorial A tutorial example Evaluation files
Users performed the experiment on their own. No questions or interaction with developers.
Experiment Results
Quality of iPROMPT suggestions: Recall: 96.9% Precision: 88.6%
Resulting ontologies Difference measure: fraction of frames that
have different name and type Ontologies differ by ~30%
Limitations In The Experiment
Only 4 participants Variability in Protégé expertise Recall and precision figures without
comparison to other tools are not very meaningful
Need better distance metrics
Research Questions
Which pragmatic criteria are most helpful in finding the best tool for a task
How do we develop a “gold standard” merged ontology? Does such an ontology exist?
How do we define a good distance metric to compare results to the gold standard?
Can we reuse tools and metrics developed for evaluating ontologies themselves?