putting the spirit of the web back into semantic web … · 2010-11-22 · motivation vision:...
TRANSCRIPT
PUTTING THE SPIRIT OF THE WEB BACK INTO SEMANTIC WEB QUERYING
Cosmin Basca, Abraham Bernstein
Motivation
Vision: towards a globally query-able and truly open Semantic Web
We want to: Query the Web of Data (WoD) on-demand Provide up-to date results (within the query execution
interval, typically seconds) Impose no or limited restrictions on data publishers Be flexible regarding participating triple stores Preserve the “openness” of WoD
Openness
By “openness” we mean: Assume that servers are:
Independent (unaware of other servers) Heterogeneous
Assume no control and limited knowledge over their distribution & availability
Data publishing: Not having to adhere to fixed guidelines
Motivating example
Consider: Sites holding LOD Linked Movie and DBPEDIA data Find out which movies and related information, were
produced by “Producers Circle” studios SELECT ?title ?photoCollection ?name WHERE {
?film dc:title ?title; movie:actor ?actor; owl:sameAs ?sameFilm.
# link to other datasets
?actor a foaf:Person; movie:actor_name ?name .
?sameFilm dbpedia:hasPhotoCollection ?photoCollection. ?sameFilm dbpedia:studio ‘‘Producers Circle’’.
}
Problem
Key space A given in SW via URIs Tradeoff between globalism and performance (address
space vs. size in bytes)
Joining datasets
Currently no system / algorithm to achieve goal entirely
Problem
High
Local
Restrictiveness
Goal
Cloud
Global
Clustered
Fixed id partitioning
Triple levelFederation
Low
Sesame
URIInstance levelFederation
Intended Addressing Space
Problem
High
Local
Restrictiveness
Goal
YARS2
Cloud
Global
Clustered
Fixed id partitioning
Triple levelFederation
Low
AllegroGraph
4Store
Sesame
URIInstance levelFederation
Intended Addressing Space
Problem
High
Local
Restrictiveness
Goal
SemWiq
YARS2
Cloud
Global
Clustered
Fixed id partitioning
Triple levelFederation
Low
AllegroGraph
4Store
Sesame
DARQ
URIInstance levelFederation
Intended Addressing Space
Problem
RDF Peers
Hartig et. al.
High
Local
Restrictiveness
Goal
SemWiq
YARS2
Cloud
Global
Clustered
Fixed id partitioning
Triple levelFederation
Low
AllegroGraph
4Store
Sesame
DARQ
URIInstance levelFederation
Intended Addressing Space
Problem
RDF Peers
Hartig et. al.
High
Local
Restrictiveness
?
Goal
SemWiq
YARS2
Cloud
Global
Clustered
Fixed id partitioning
Triple levelFederation
Low
AllegroGraph
4Store
Sesame
DARQ
URIInstance levelFederation
Intended Addressing Space
Closer to Goal
Avalanche
!"#"$%&!"#"$%&!'()")*)$$%+"#),&!,-.%&'(")"&*&&&&!/#$.&+,-./.01&!"#"$%0&111&&&&222&&&!2-.%3#$.&+341+/5-6.7+/8&99:;8+7,1;6&$/;,01<<1=
Avalanche SPARQL endpoint
Avalanche
!"#"$%&!"#"$%&!'()")*)$$%+"#),&!,-.%&'(")"&*&&&&!/#$.&+,-./.01&!"#"$%0&111&&&&222&&&!2-.%3#$.&+341+/5-6.7+/8&99:;8+7,1;6&$/;,01<<1=
Endpoints Directory or Search Engine
Avalanche SPARQL endpoint
1
Avalanche
!"#"$%&!"#"$%&!'()")*)$$%+"#),&!,-.%&'(")"&*&&&&!/#$.&+,-./.01&!"#"$%0&111&&&&222&&&!2-.%3#$.&+341+/5-6.7+/8&99:;8+7,1;6&$/;,01<<1=
Endpoints Directory or Search Engine
Avalanche SPARQL endpoint
1 2
Avalanche
!"#"$%&!"#"$%&!'()")*)$$%+"#),&!,-.%&'(")"&*&&&&!/#$.&+,-./.01&!"#"$%0&111&&&&222&&&!2-.%3#$.&+341+/5-6.7+/8&99:;8+7,1;6&$/;,01<<1=
Endpoints Directory or Search Engine
Avalanche SPARQL endpoint
1 23
Challenges and Implications
Web of Data is growing: LoD ~25B triples (Sept 2010) Lack of (high) quality statistics (join estimations) Physical constraints
Bandwidth, latency, unavailability, many sites
Completeness not considered First K results
Exponential search space due to flexibility Efficient heuristics to search
Architecture
AVALANCHE Mediator Execution Pipeline
AVALANCHE endpoints Web Directory or Search Engine
query preprocessing phase
query execution phase
PlansQueue
Plan Generator
FinishedPlansQueue
ResultsQueue
Query Stopper
Executor
MaterializerExecutor
Executor
Executor
Materializer
Materializer
Materializer
Res
ults
Statistics Requester QueryQuery
Parser
Planning
AVALANCHE Mediator Execution Pipeline
AVALANCHE endpoints Web Directory or Search Engine
query preprocessing phase
query execution phase
PlansQueue
Plan Generator
FinishedPlansQueue
ResultsQueue
Query Stopper
Executor
MaterializerExecutor
Executor
Executor
Materializer
Materializer
Materializer
Res
ults
Statistics Requester QueryQuery
Parser
Planning
Greedy multipath search inspired by Best First Search
Total space is O(n3)!, but size increases by M * H with each exploratory step (H=number of sites, M=number of paths)
In practice the space is tractable: most queries are not fully connected graphs!
Can be further reduced Windowed approach
Planning
7 triple patterns and 6 unbounded variables if graph is undirected and fully connected : 240 possible paths In practice we have a sparse directed graph 11 paths
Search step: each path assigned to all servers involved i.e. for 100 hosts: 1100 states
Join (average) paths to form full query graph 4 average joins to full graph: 4400 plans (ordered)
SELECT ?title ?photoCollection ?name WHERE {
?film dc:title ?title; movie:actor ?actor; owl:sameAs ?sameFilm. # link to other datasets
?actor a foaf:Person; movie:actor_name ?name .
?sameFilm dbpedia:hasPhotoCollection ?photoCollection. ?sameFilm dbpedia:studio ‘‘Producers Circle’’.
}
€
n(n −1)2n−3
Planning Heuristics
Default Extended
€
Edges(N1)CNTN1
€
min(CNTN1,CNTN 2)
,first node
,otherwise U=
€
1
€
(L +CNTN 2B
+CNTN1 +CNTN 2CNTN1
) Edges(Query)Edges(N2)
,first node C=
EU=
€
w1⋅ JOINN1,N 2 + w2⋅UN1,N 2
€
w2⋅UN1,N 2
,N1 N2 selective
,otherwise
€
JOINN1,N 2 ≈ −1k⋅ln(m⋅ Z1 + Z2 − Z12
Z1⋅ Z2)
ln(1− 1m)
L=latency B=bandwidth Cost to execute remote subquery Cost to execute local subquery Scaling factor (aid convergence)
Bloom filters (expensive only selective queries)
Zi=number of 0 bits in bloom filter i K=number of bloom hash functions M=size in bits of the bloom filter
Execution
AVALANCHE Mediator Execution Pipeline
AVALANCHE endpoints Web Directory or Search Engine
query preprocessing phase
query execution phase
PlansQueue
Plan Generator
FinishedPlansQueue
ResultsQueue
Query Stopper
Executor
MaterializerExecutor
Executor
Executor
Materializer
Materializer
Materializer
Res
ults
Statistics Requester QueryQuery
Parser
Execution
?sameFilm dbpedia:hasPhotoCollection ?photoCollection. ?sameFilm dbpedia:studio ‘‘Producers Circle’’.
?actor a foaf:Person. ?actor movie:actor_name ?name.
?film dc:title ?title. ?film movie:actor ?actor. ?film owl:sameAs ?sameFilm.
?sameFilm ?actor
q1 q2
q3
1) Join(q1,q2)
2) R1=Execute(q1)
3) Send(R1)
4) FR2=ExecuteFilter(R1)
5) Join(q2,q3)
6) Send(FR2)
7) FR3=ExecuteFilter(FR2)
8) Update(q3,q2)
10) Send(R3)
12) R2=Filter(FR2, FR3)
13) Send(R2)
14) R1=Filter(R1, R2) 9) R3=FR3
11) Update(q2,q1)
Materializing and Stopping
materialization: same as execution, but request string representation
from endpoints that completed the plan
stopping: timeout relative saturation
New results received over a sliding window
first K results
Preliminary Results 5 sites, 35 million triples
0 1.5
3 4.5
6 7.5
9 10.5
12 13.5
Q1 Q1 Q2 Q2 Q3 Q3
Tim
e (s
econ
ds)
Queries
execution timeFirst Results (default)
Total Results (default)
First Results (extended)Total Results (extended)
Preliminary Results 5 sites, 35 million triples
0
40
80
120
160
200
240
280
Q1 Q1 Q2 Q2 Q3 Q3
#Res
ults
(uni
que)
Queries
# resultsFirst Results (default)
Total Results (default)
First Results (extended)Total Results (extended)
Preliminary Results 5 sites, 35 million triples
0 15 30 45 60 75 90
105 120 135 150 165 180
1 10 100 1000 10000
# N
ew R
esul
ts
# Total Results
Planner Convergence
Q1Q2Q3
Saturation Q1Saturation Q2Saturation Q3
Conclusions
Avalanche: Makes no or limited assumptions about data distribution
partitioning and availability Provides up-to date results as exposed by the
endpoints Flexible since it does not have knowledge about triple
store structure
Demo
See Avalanche live visit us @ISWC demo and poster session
thank you