towards distributed information retrieval in the semantic web: query reformulation using the...

Towards Distributed Information Retrieval in

the Semantic Web:Query Reformulation Usingthe Framework

[email protected]

Wednesday 14th of June, 2006

Raphaël Troncy, Umberto Straccia

Motivation

Various SW repositories, using different vocabularies, distributed on the web

Already large amounts of data out thereSwoogle hits 1.5M unique Semantic Web

documents (05/06/2006)Problem:How to search and retrieve information in

such an environment?

Example scenario

Kim Clijsters (courtesy of AFP)

Montenegro independence (courtesy of Euronews)

NewsML

SportsML

EventsML

iCalendar

TimeML

<newsItem schema="0.7" version="2"> <itemMeta> <contentClass code="ccls:photo" /> </itemMeta> <contentMeta> <infoSource literal="AFP" /> <locCreated code="city:Paris"/> </contentMeta> ...</newsItem>

EXIF

EBU P/Meta

MXF

MPEG-7

Distributed Search in the SW:Resource selectionSelect a subset of some relevant

resourcesQuery reformulationReformulate the information need into

the vocabulary used by the resource Data fusion and rank aggregationMerged and ranked all the results

together

Resource Selection

Compute an approximation of the content of each resources

For some random queries, an approximation consists of:The ontology the resource relies onSome instances (sampling annotated

documents)

Query reformulation

Transformation rules:From the query vocabulary to the vocabularies

used by the resources Semantic Web: Ontology Alignment

Establishing relationships holding between the entities (subsumption, equivalence, disjointness…)

With a confidence measure Automatically computed

1..0

an ontologyalignment framework

oMAP: Ontology Alignment Tool

TerminologicalClassifiers

Machine Learning-based Classifiers

Structural andSemantics-based

Classifiers

- Formal and open framework- Classifiers customization (parameter, chaining)

subsumption

equivalent

oMAP: A Formal Framework

Sources of inspiration:Formal work in data exchange [Fagin et al.,

2003]GLUE: combining several specialized

components for finding the best set of mappings [Doan et al., 2003]

Notation:A mapping is a triple: M = (T, S, ∑)S and T are the source and target ontologiesSi is an OWL entity (class, datatype property,

object property) of the ontology∑ is a set of mapping rules: αij Tj ← Si

oMAP: Combining Classifiers

Weight of a mapping rule:αij = w (Si,Tj, ∑)

Using different classifiers:w (Si,Tj,CLk) is the classifier's

approximation of the rule Tj ← Si

Combining the approximations:Use of a priority list: CL1 CL2 … CLn Weighted average of the classifiers

prediction

Terminological Classifiers

Same entity names (or URI)

Same entity name stems

otherwise 0

name, same have , if 1),,(

ji

Nji

TSCLTSw

otherwise 0

stem, same have , if 1),,(

ji

Sji

TSCLTSw


String distance name

Iterative substring matching

See [Stoilos et al., ISWC'05]

))(),(max(

),(),,(

ji

jinLevenshteiLDji TlengthSlength

TSdistCLTSw

),(),(),(),,( jijijiISji TSwinklerTSDiffTSCommCLTSw


WordNet distance name

lcs is the longest common substring between Si

and Tj

sim =

otherwise

)()(

*2,max

synonyms, are , if 1

),,(

ji

ji

WNji

TlengthSlength

lcssim

TS

CLTSw

)()(

)()(

ji

ji

TsynonymSsynonym

TsynonymSsynonym

Machine Learning-Based Classifiers

Collecting bag of words:label for the named individualsdata value for the datatype propertiestype for the anonymous individuals and the

range of object properties…

Recursion on the OWL definition:depth parameter

Use statistical methods on the collected bag of words

Machine Learning-Based Classifiers

ExampleIndividual (x1 type (Conference)

value (label "European SW Conf") value (location x2) )

Individual (x2 type (Address)

value (city "Budva") value (country "Montenegro") )

u1 = (" European SW Conf ", "Address")u2 = ("Address", "Budva", "Montenegro")

Naïve Bayes text classifier

kNN text classifier

jTux um

iiNBji SmSCLTSw),(

)Pr()Pr(),,(

Structural and Semantics-Based Classifier

∑ is a set of mapping rules: αij Tj ← Si

∑ sets are computed by taking the OWL definition of the entities to alignrecursively in the OWL structure... without looping thanks to cycles

detection

Structural and Semantics-Based Classifier If Si and Tj are property names:

If Si and Tj are concept names1:

otherwise ),,('

if 0),,(

ji

ij

ji TSw

STTSw

otherwise ),,(max),,('1)Set(

1

and 0D if ),,('

if 0

),,(

),(

t

setDjCijisetji

ijji

ij

ji

DCwTSw

STTSw

ST

TSw

1 Where D = D(Si) * D(Tj) ; D(Si) represents the set of concepts directly parent of Si

Structural and Semantics-Based ClassifierLet CS=(QR.C) and DT=(Q’R’.D), then1:

Let CS=(op C1…Cm) and DT=(op’ D1…Dm), then2:

),,(),',()',(),,( DCwRRwQQwDCw QTS

),min(

),,(max

)',(),,(),(

nm

DCw

opopwDCwsetDjCi

jiset

opTS

1 Where Q,Q’ are quantifiers, R,R’ are property names and C,D concept expressions

2 Where op, op’ are concept constructors and n,m ≥ 1

Evaluation

OAEI Contests (2004, 2005, 2006): http://oaei.ontologymatching.org/Systematic benchmark tests on

bibliographic dataTests 2xx: aligning an ontology with

variations of itself where each OWL constructs are discarded or modified one per oneTests 3xx: four real bibliographic ontologies

Web categories alignmenthttp://oaei.ontologymatching.org/2005/

results/

http://oaei.ontologymatching.org/



http://oaei.ontologymatching.org/2005/results/





Benchmark Tests

dublin20 0.92

Falcon 0.91

FOAM 0.90

oMAP 0.85

CMS 0.81

OLA 0.80

ctxMatch 0.72

edna 0.45

Falcon 0.89

OLA 0.74

dublin20 0.72

FOAM 0.69

oMAP 0.68

edna 0.61

ctxMatch 0.20

CMS 0.18oMAP:4th with the global F-Measure1st on 3xx tests (real ontologies to align)

Precision Recall

Aligning Web Categories

Aligning Google, Loksmart and Yahoo web categories [Avesani et al., ISWC'05]

Blind tests: only recall results are available

ctxMatch

FOAM CMS Dublin20

Falcon OLA oMAP

9.4% 11.9% 14.1% 26.5% 31.2% 32.0% 34.4%

Distributed Search in the SW

Q: retrieve course material dealing with history of the Americas and "Columbus"query(d)<- History_Americas(d,"Columbus")

is re-written as two queries0.63 query(d)<-

Latin_American_History(d,"Columbus")0.84 query(d)<- American_History(d,"Columbus")

Each document score is then multiplied with the confidence score of the rule.

university courses

Conclusion

Distributed Search in the SWresource selection / query reformulation /

data fusion and rank aggregationoMAP: a formal framework for aligning

automatically OWL ontologiesCombining several specific classifiersTerminological classifiersMachine learning-based classifiersStructural and semantics-based classifier

Future Work

Implementing the three steps proposedKeyword-based or structured (SPARQL) queriesRanked list of results

oMAPUsing additional classifiers:KL-distance, other resources, background K, etc.Straightforward theoretically but practically

difficult!Finding complex alignmentname = firstName + lastName

OWL and rule-based languages:Take into account this additional expressivity

http://www.cwi.nl/~troncy/oMAP/

Any questions ?






Structural and Semantics-Based ClassifierPossible values for wop and wQ

weights

wop wQ⊓ ⊔ ¬

⊓ 1 1/4 0

⊔ 1 0

¬ 1

1 1/4

1

n n m

1 1/3

m

1

towards distributed information retrieval in the semantic web: query reformulation using the...

Documents