towards distributed information retrieval in the semantic web: query reformulation using the...
TRANSCRIPT
Towards Distributed Information Retrieval in
the Semantic Web:Query Reformulation Usingthe Framework
Wednesday 14th of June, 2006
Raphaël Troncy, Umberto Straccia
Motivation
Various SW repositories, using different vocabularies, distributed on the web
Already large amounts of data out thereSwoogle hits 1.5M unique Semantic Web
documents (05/06/2006)Problem:How to search and retrieve information in
such an environment?
Example scenario
Kim Clijsters (courtesy of AFP)
Montenegro independence (courtesy of Euronews)
NewsML
SportsML
EventsML
iCalendar
TimeML
<newsItem schema="0.7" version="2"> <itemMeta> <contentClass code="ccls:photo" /> </itemMeta> <contentMeta> <infoSource literal="AFP" /> <locCreated code="city:Paris"/> </contentMeta> ...</newsItem>
EXIF
EBU P/Meta
MXF
MPEG-7
Distributed Search in the SW:Resource selectionSelect a subset of some relevant
resourcesQuery reformulationReformulate the information need into
the vocabulary used by the resource Data fusion and rank aggregationMerged and ranked all the results
together
Resource Selection
Compute an approximation of the content of each resources
For some random queries, an approximation consists of:The ontology the resource relies onSome instances (sampling annotated
documents)
Query reformulation
Transformation rules:From the query vocabulary to the vocabularies
used by the resources Semantic Web: Ontology Alignment
Establishing relationships holding between the entities (subsumption, equivalence, disjointness…)
With a confidence measure Automatically computed
1..0
an ontologyalignment framework
oMAP: Ontology Alignment Tool
TerminologicalClassifiers
Machine Learning-based Classifiers
Structural andSemantics-based
Classifiers
- Formal and open framework- Classifiers customization (parameter, chaining)
subsumption
equivalent
oMAP: A Formal Framework
Sources of inspiration:Formal work in data exchange [Fagin et al.,
2003]GLUE: combining several specialized
components for finding the best set of mappings [Doan et al., 2003]
Notation:A mapping is a triple: M = (T, S, ∑)S and T are the source and target ontologiesSi is an OWL entity (class, datatype property,
object property) of the ontology∑ is a set of mapping rules: αij Tj ← Si
oMAP: Combining Classifiers
Weight of a mapping rule:αij = w (Si,Tj, ∑)
Using different classifiers:w (Si,Tj,CLk) is the classifier's
approximation of the rule Tj ← Si
Combining the approximations:Use of a priority list: CL1 CL2 … CLn Weighted average of the classifiers
prediction
Terminological Classifiers
Same entity names (or URI)
Same entity name stems
otherwise 0
name, same have , if 1),,(
ji
Nji
TSCLTSw
otherwise 0
stem, same have , if 1),,(
ji
Sji
TSCLTSw
Terminological Classifiers
String distance name
Iterative substring matching
See [Stoilos et al., ISWC'05]
))(),(max(
),(),,(
ji
jinLevenshteiLDji TlengthSlength
TSdistCLTSw
),(),(),(),,( jijijiISji TSwinklerTSDiffTSCommCLTSw
Terminological Classifiers
WordNet distance name
lcs is the longest common substring between Si
and Tj
sim =
otherwise
)()(
*2,max
synonyms, are , if 1
),,(
ji
ji
WNji
TlengthSlength
lcssim
TS
CLTSw
)()(
)()(
ji
ji
TsynonymSsynonym
TsynonymSsynonym
Machine Learning-Based Classifiers
Collecting bag of words:label for the named individualsdata value for the datatype propertiestype for the anonymous individuals and the
range of object properties…
Recursion on the OWL definition:depth parameter
Use statistical methods on the collected bag of words
Machine Learning-Based Classifiers
ExampleIndividual (x1 type (Conference)
value (label "European SW Conf") value (location x2) )
Individual (x2 type (Address)
value (city "Budva") value (country "Montenegro") )
u1 = (" European SW Conf ", "Address")u2 = ("Address", "Budva", "Montenegro")
Naïve Bayes text classifier
kNN text classifier
jTux um
iiNBji SmSCLTSw),(
)Pr()Pr(),,(
Structural and Semantics-Based Classifier
∑ is a set of mapping rules: αij Tj ← Si
∑ sets are computed by taking the OWL definition of the entities to alignrecursively in the OWL structure... without looping thanks to cycles
detection
Structural and Semantics-Based Classifier If Si and Tj are property names:
If Si and Tj are concept names1:
otherwise ),,('
if 0),,(
ji
ij
ji TSw
STTSw
otherwise ),,(max),,('1)Set(
1
and 0D if ),,('
if 0
),,(
),(
t
setDjCijisetji
ijji
ij
ji
DCwTSw
STTSw
ST
TSw
1 Where D = D(Si) * D(Tj) ; D(Si) represents the set of concepts directly parent of Si
Structural and Semantics-Based ClassifierLet CS=(QR.C) and DT=(Q’R’.D), then1:
Let CS=(op C1…Cm) and DT=(op’ D1…Dm), then2:
),,(),',()',(),,( DCwRRwQQwDCw QTS
),min(
),,(max
)',(),,(),(
nm
DCw
opopwDCwsetDjCi
jiset
opTS
1 Where Q,Q’ are quantifiers, R,R’ are property names and C,D concept expressions
2 Where op, op’ are concept constructors and n,m ≥ 1
Evaluation
OAEI Contests (2004, 2005, 2006): http://oaei.ontologymatching.org/Systematic benchmark tests on
bibliographic dataTests 2xx: aligning an ontology with
variations of itself where each OWL constructs are discarded or modified one per oneTests 3xx: four real bibliographic ontologies
Web categories alignmenthttp://oaei.ontologymatching.org/2005/
results/
Benchmark Tests
dublin20 0.92
Falcon 0.91
FOAM 0.90
oMAP 0.85
CMS 0.81
OLA 0.80
ctxMatch 0.72
edna 0.45
Falcon 0.89
OLA 0.74
dublin20 0.72
FOAM 0.69
oMAP 0.68
edna 0.61
ctxMatch 0.20
CMS 0.18oMAP:4th with the global F-Measure1st on 3xx tests (real ontologies to align)
Precision Recall
Aligning Web Categories
Aligning Google, Loksmart and Yahoo web categories [Avesani et al., ISWC'05]
Blind tests: only recall results are available
ctxMatch
FOAM CMS Dublin20
Falcon OLA oMAP
9.4% 11.9% 14.1% 26.5% 31.2% 32.0% 34.4%
Distributed Search in the SW
Q: retrieve course material dealing with history of the Americas and "Columbus"query(d)<- History_Americas(d,"Columbus")
is re-written as two queries0.63 query(d)<-
Latin_American_History(d,"Columbus")0.84 query(d)<- American_History(d,"Columbus")
Each document score is then multiplied with the confidence score of the rule.
university courses
Conclusion
Distributed Search in the SWresource selection / query reformulation /
data fusion and rank aggregationoMAP: a formal framework for aligning
automatically OWL ontologiesCombining several specific classifiersTerminological classifiersMachine learning-based classifiersStructural and semantics-based classifier
Future Work
Implementing the three steps proposedKeyword-based or structured (SPARQL) queriesRanked list of results
oMAPUsing additional classifiers:KL-distance, other resources, background K, etc.Straightforward theoretically but practically
difficult!Finding complex alignmentname = firstName + lastName
OWL and rule-based languages:Take into account this additional expressivity
http://www.cwi.nl/~troncy/oMAP/
Any questions ?
Structural and Semantics-Based ClassifierPossible values for wop and wQ
weights
wop wQ⊓ ⊔ ¬
⊓ 1 1/4 0
⊔ 1 0
¬ 1
1 1/4
1
n n m
1 1/3
m
1