web of data usage mining
TRANSCRIPT
What you should learn:
• describe the architectural differences between content negotiation and Linked Data queries;
• develop applications that use different strategies to consume Linked Data;
• develop usage mining methods that exploit the atomic parts of the SPARQL query language.
Linked Data principles 1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the access to resources on the Web.
3. On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).
4. Set RDF links to resources published by other parties to allow the discovery of more resources.
http://dbpedia.org/resource/Berlin �
��
�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S
owl:sameAs P dbpedia:Berlin O
h"p://www.w3.org/DesignIssues/LinkedData.html
Content Negotiation
Linked Data principles 1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the access to resources on the Web.
3. On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).
4. Set RDF links to resources published by other parties to allow the discovery of more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin �
��
�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S
owl:sameAs P dbpedia:Berlin O
Content Negotiation
Linked Data principles 1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the access to resources on the Web.
3. On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).
4. Set RDF links to resources published by other parties to allow the discovery of more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin �
��
�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S
owl:sameAs P dbpedia:Berlin O
Content Negotiation
Linked Data principles 1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the access to resources on the Web.
3. On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).
4. Set RDF links to resources published by other parties to allow the discovery of more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin �
��
�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S
owl:sameAs P dbpedia:Berlin O
Content Negotiation
Linked Data principles 1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the access to resources on the Web.
3. On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL).
4. Set RDF links to resources published by other parties to allow the discovery of more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin �
��
�http://dbpedia.org/page/Berlin�http://dbpedia.org/data/Berlin �yago-res:Berlin S
owl:sameAs P dbpedia:Berlin O
Content Negotiation
Linked Data exploits RDF
h"p://markus-luczak.de#me
“MarkusLuczak-Roesch“
foaf:name
h"p://markus-luczak.de#me
h"p://hannes.muehleisen.org#me
foaf:knows
Linked Data vocabularies
• Vocabulary reuse: – Geo – FOAF – GoodRelations – SIOC – DOAP – …
• Vocabulary development: – Thing
• Person – OfficeHolder – …
• …
http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/OfficeHolder http://xmlns.com/foaf/0.1/knows
Linked Data vocabularies
• Mixing: – Geo – FOAF – Dublin Core – DBpedia Ontology – ...
http://xmlns.com/foaf/0.1/Person
http://www.w3.org/2003/01/geo/wgs84_pos#lat
http://dbpedia.org/ontology/leader http://dbpedia.org/ontology/City
Linked Data is self-descriptive
Instancelevel Schemalevel
int:resA
ont:ClassAowl:sameAs
„ABC“
foaf:name
ext:resA
int:resB
rdf:type
owl:equivalentClass
rdf:type
foaf:name
rdf:type
rdf:type
rdf:type
rdfs:subClassOf
foaf:Agentrdf:type
foaf:Person
rdfs:subClassOf
owl:sameAs
owl:equivalentClass
h"p://markus-luczak.de#me
“MarkusLuczak-Roesch“
rdf:type
u_id firstname surname
45 Markus Luczak-Roesch
… … …
foaf:name
foaf:Person
“3.375.222“
dbpedia:Berlin
c_id city country inhabitants
67 Berlin Germany 3.375.222
… … …
dbp:populaVon
h"p://markus-luczak.de#me
“MarkusLuczak-Roesch“
rdf:type
foaf:name
dbp:birthPlace
foaf:Person
“3.375.222“dbp:populaVon
dbpedia:Berlin
h"p://markus-luczak.de#me
foaf:basedNear
dbp:birthPlace
h"p://markus-luczak.de/res/Soton
dbpedia:CiVes_in_Europe
skos:subject
dbpedia:Berlin
skos:subject
dbpedia:Southampton
h"p://markus-luczak.de#me
foaf:basedNear
dbp:birthPlace
h"p://markus-luczak.de/res/Soton
dbpedia:CiVes_in_Europe
skos:subject
dbpedia:Berlin
skos:subject
dbpedia:Southampton
rdfs:seeAlso
h"p://markus-luczak.de#me
foaf:basedNear
h"p://markus-luczak.de/res/Soton
rdfs:seeAlso
rdf:type
foaf:Person
owl:equivalentClass
dbp:Person
rdf:type
dbpedia:Southampton
dbp:birthPlace
dbpedia:Benny_Hill
Linked Data Infrastructure
Imagesource:Tom
Heathand
ChrisV
anBize
r(2011)LinkedDa
ta:EvolvingtheWeb
intoa
Glob
alDataSpace(1stediVo
n).SynthesisLecturesontheSemanVcW
eb:The
oryand
Techno
logy,1:1,1-136.M
organ&Claypoo
l.
Consuming Linked Data
• stateless • request-response
t
Client Server
request
response
TCPlifecycle
derivedfromR.Tolksdorf
Open connection
Close connection
Consuming Linked Data GET / HTTP/1.1 User-Agent: Mozilla/5.0 … Firefox/10.0.3 Host: markus-luczak.de:80 Accept: */*
HTTP/1.1 200 OK Server: Apache/2.0.49 Content-Language: en Content-Type: text/html Content-length: 2990 <!DOCTYPE html> <html xml:lang="en" …
Clie
nt Server
derivedfromR.Tolksdorf
Server
Consuming Linked Data
Representation 1 index.html
Representation 2 index.rdf
Information Resource
http://example.com/content/index
Client
HTTP GET
Consuming Linked Data
• Discover URIs – Lookup services
• http://rkbexplorer.com
– Web of Data search engines • http://sindice.com • http://ws.nju.edu.cn/falcons/objectsearch/index.jsp
Consuming Linked Data
• Discover additional data for the resource at hand • follow links („follow your nose“)
– rdfs:seeAlso – owl:sameAs
• Co-Reference services – http://sameas.org
• Web of Data search engines
Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.
SPARQL-recap
• Basic principle: pattern matching – describe pattern – query RDF triple set („RDF graph“) – matching subset comes into results
?s
http://dbpedia.org/resource/Berlin
SPARQL-recap
?s
dbp:Klaus_Wowereit
dbp:Reinhard_Mey
dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
Berlino
dbp:Axel_Springer
SPARQL queries on the Web
• RESTful service endpoint GET /sparql?query=PREFIX+rdf… HTTP/1.1 Host: dbpedia.org
h"p://www.w3.org/TR/rdf-sparql-XMLres/ h"p://www.w3.org/TR/rdf-sparql-json-res/
Querying Linked Data
dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
http://www.markus-luczak.de/me
dbp:birthPlace
Querying Linked Data • distribution of data creates challenges for querying them
• Query approaches – follow-up queries ß application-dependent, proprietary – query a central data repository (e.g. LOD cache) ß trivial – federated queries ß more interesting
• idea: query a mediator that distributes the sub-queries and returns aggregated result (as of SPARQL 1.1)
– link traversal ß very interesting • idea: follow links in the results retrieved from a source to expand the data
dynamically
Dataset
UserClient/ApplicaVon
QueryPa"ernAccess
ResourceCenteredAccessHTTP
QueryProcessing
GraphCreaVonandContentNegoVaVon
GET/resou
rce/resA
GET/sparql?qu
ery=SELECT
…
applicaV
on/rdf+xml,…
Evaluateand
pe
rformque
ry,
createre
sultset
Processa
nd
selectre
sult
text/xml,…
DataPublisherDataConsum
erDa
taPub
lishe
rDa
taCon
sumer
h"p://www.flickr.com/photos/therichbrooks/4040197666/,CC-BY2.0,h"ps://creaVvecommons.org/licenses/by/2.0/
A game of pairs with SPARQL
SPARQL queries are self-descriptive data themselves
{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person
}
TPTP BGP
SPARQL queries are self-descriptive data themselves
{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person
}
h"p://markus-luczak.de#me
“MarkusLuczak-Roesch“
rdf:type
foaf:name
foaf:Person
✔✗
✗
SPARQL queries are self-descriptive data themselves
{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2
}
✔✗
✗✗
Statistical analysis
missingfacts
inconsistentdata
• ns:Bandns:knownFor?x• ns:Bandns:naVonality?y
• ns:Bandns:instrument?x• ns:Bandns:genre?y• ns:Bandns:associatedBand?z
Statistical analysis
(a) SWC (b) DBpedia (c) LGD
Abbildung 20: Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet),Quelle: eigene Darstellung
dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind dieMoglichkeiten gegeben, eine hohere Anzahl an speziellen Konzepten zu nutzen. Eben-so konnen theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte undDokumente verwendet werden.
Experiment 3 In diesem Experiment wurde evaluiert, welche Ergebnisse mit dem ent-wickelten ubKCE Algorithmus in Abhangigkeit der Gewichtung der Aspekte Dichte undNutzung erhalten werden. Damit soll es moglich sein, die Frage zu beantworten, ob sichSchlusselkonzepte uberhaupt auf Basis von Nutzungsdaten ermitteln lassen. Dies kanneindeutig mit Ja beantwortet werden. Schon anhand der Ergebnisse von Experiment 2ließ sich erkennen, dass viele der KCE - Schlusselkonzepte ebenfalls am starksten vonNutzern des Datensets verwendet werden. Je nach gewahlter Gewichtung der Aspektevariiert jedoch die Ubereinstimmung der Ergebnisse von ubKCE mit den von KCE er-mittelten Schlusselkonzepten. So werden beispielsweise bei einer Gewichtung von 100%Nutzung zu 0% Dichte im Allgemeinen andere Schlusselkonzepte als bei einem 30:70Verhaltnis ermittelt. Bezuglich der Ubereinstimmung zu KCE kann gesagt werden, dassdiese im Allgemeinen steigt, je hoher die Gewichtung der Dichte in ubKCE ist. Dies istnicht verwunderlich, da der Dichte-Aspekt aus KCE ubernommen wurde.[27] gab zur Evaluation des KCE - Algorithmus an, dass es ab einer 70% Ubereinstimmungder KCE - Ergebnisse zu den von Experten ermittelten Schlusselkonzepten nicht mehrmoglich ist zu erkennen, ob ein bestimmtes Ergebnis von einem Menschen oder vomAlgorithmus bestimmt wurde. Um eine Gewichtung zu finden, die moglichst gut mitdem KCE - Ergebnis ubereinstimmt, jedoch gleichzeitig so stark wie moglich auf demNutzungsaspekt beruht, werden auch im Rahmen der vorliegenden Arbeit diese 70%Ubereinstimmung angestrebt. Nach Aussage von [27] bedeutet diese Ubereinstimmung,dass man nicht unterscheiden konnte, ob ein Ergebnis von einem Experten, von KCEoder von ubKCE stammt.Durch die unterschiedlichen Gewichtungen der Aspekte kann die Zusammenfassung der
91
Source:MasterthesisofMarkusBischoff
Estimating the effects of change
Usage-dependent maintenance of structured Web data sets
to be added to the DBpedia 3.4 data set conforming to our approach16.
Table 7.14: Recommended predicates to be added to the data set and the estimatede↵ects of change.
Primitive to add E↵ects of change Exists in data set
dbp:manufacturer 0.004505372 x
dbp:firstFlight 0.004505372 x
dbp:introduced 0.004505372 x
dbp:nationalOrigin 0.004505372
dbo:thumbnail 0.021986718 x
dbo:director 0.025047524
dbp:director 0.02503915 x
dbp:abstract 0.025797024 x
dbo:starring 0.034066643
dbp:starring 0.034066643 x
dbp:stars 0.034066643 x
skos:Concept 0.040946128 x
skos:broader 0.04116386 x
dbp:redirect 0.066441677 x
Since this change recommendation is only additive it is clear that no negativee↵ects are estimated. However, it is possible to estimate the positive potential of achange and consequently to prioritize the changes to be performed in case of conflict-ing or contradicting recommendations.
More complex and also subtractive change recommendations may emerge fromadditive ones. This is typified by the recommendation to add dbo:director anddbp:director for example to the data set which appear to be contradicting. Hence,they should be either matched to each other by an owl:equivalentProperty relationor one of the two should be eliminated.
7.3.3 Further data set analysis
Our case study has shown how the usage-dependent data set maintenance approachperforms in the context of a cross-domain data set like DBpedia. We will now presentresults from our studies with SWDF and LGD as two di↵erent domain-specific datasets.
16To save space we apply the following namespace prefixes in addition to the ones defined before:dbo:http://dbpedia.org/ontology/, dbp:http://dbpedia.org/property/.
178
Logfiles
Selectedlogfiles
Preprocessedqueries
Decomposedqueriesand
transac<ontables
Pa=erns
Changerecommenda<ons
[0,1]
What’s in your SPARQL shopping bag? {
?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person
}
{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2
}
{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person
}
{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2
}
{?s1foaf:name“MarkusLuczak-Roesch”.?s1rdf:typedbp:Person
}
{dbpedia:Benny_Hilldbp:birthPlace?o1.?sdbp:basedNear?o1.?sfoaf:name?o2
}
T1
T2T1
…
…30m
ins.,sam
eIP,sam
euseragent
…
…
…
Usa
ge
-d
ep
en
de
nt
ma
in
te
na
nc
eo
fstru
ctu
re
dW
eb
da
ta
se
ts
Figure 7.20: Visualization of association rules computed by application of the unknown predicates restriction in thecontext of the LGD log file (size: support, color: lift).
184
LGD
Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.
SPARQL
7. Evaluation
The visualization shows how primitives on the left hand side (LHS) of a rule implyparticular ones on the right hand side (RHS) and which likelihood such an associa-tion has. In our specific case this allows us to analyze which primitives are queriedtogether frequently in failing queries. We spot two characteristic usage patterns: (1)the properties and classes queried in the context of http://dbpedia.org/ontology/Aircraft; (2) the properties and classes queried in the context of an object variable.These can be further analyzed by exporting the association rules to GraphML and vi-sualizing the network by use of a network visualization and analysis tool like Gephi15
for example. Figure 7.13 depicts one filtered network representation for our examplecase. Nodes with a degree lower than 5 are filtered out (k-core network with k = 5)to derive a well-arranged visualization of the most important primitives in failingqueries. Nodes represent LHS and RHS of the computed rules. Edges point from theLHS to the RHS of the particular rules.
^"SUHGBYDULDEOH�KWWS���GESHGLD�RUJ�SURSHUW\�QDPH`
^"SUHGBYDULDEOH�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`
^KWWS���GESHGLD�RUJ�RQWRORJ\�$LUFUDIW`
^KWWS���GESHGLD�RUJ�SURSHUW\�DEVWUDFW�KWWS���GESHGLD�RUJ�SURSHUW\�QDPH`
^KWWS���GESHGLD�RUJ�SURSHUW\�DEVWUDFW�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`
^KWWS���GESHGLD�RUJ�SURSHUW\�ILUVW)OLJKW`
^KWWS���GESHGLD�RUJ�SURSHUW\�LQWURGXFHG`
^KWWS���GESHGLD�RUJ�SURSHUW\�PDQXIDFWXUHU`
^KWWS���GESHGLD�RUJ�SURSHUW\�QDPH�KWWS���GESHGLD�RUJ�SURSHUW\�W\SH`
^KWWS���GESHGLD�RUJ�SURSHUW\�QDPH�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`
^KWWS���GESHGLD�RUJ�SURSHUW\�QDWLRQDO2ULJLQ`
^KWWS���GESHGLD�RUJ�SURSHUW\�W\SH�KWWS���[POQV�FRP�IRDI�����GHSLFWLRQ`
Figure 7.13: Filtered visualization of the association rule network (k-core 5 filterapplied to reduce nodes with degree lower than 5).
Table 7.14 lists the an exemplary set of primitives which would be recommended
15http://gephi.org/
177
{ ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person
} h"p://markus-luczak.de#me
“MarkusLuczak-Roesch“
rdf:type
foaf:name
foaf:Person
✔ ✗
✗ query applied to dataset
The server can trace detailed usage.
Linked Data Fragments Querying Datasets on the Web with High Availability 5
generic requests
high client effort
high server availability
specific requests
high server effort
low server availability
data
dump
Linked Data
document
sparqlresult
triple pattern
fragments
various types of
Linked Data Fragments
Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differin the specificity of the data they contain, and thus the effort needed to create them.
3.2 Formal definitions
As a basis for our formalization, we use the following concepts of the rdf datamodel [16] and the sparql query language [12]. We write U , B, L, and V todenote the sets of all uris, blank nodes, literals, and variables, respectively.Then, T = (U [ B)⇥ U ⇥ (U [ B [ L) is the (infinite) set of all rdf triples. Anytuple tp 2 (U [ V)⇥ (U [ V)⇥ (U [ L [ V) is a triple pattern. Any finite set ofsuch triple patterns is a basic graph pattern (bgp). Any more complex sparql
graph pattern, typically denoted by P , combines triple patterns (or bgps) usingspecific operators [12,20]. The standard (set-based) query semantics for sparql
defines the query result of such a graph pattern P over a set of rdf triplesG ✓ T as a set that we denote by [[P ]]G and that consists of partial mappingsµ : V ! (U [ B [ L), which are called solution mappings. An rdf triple t isa matching triple for a triple pattern tp if there exists a solution mapping µ
such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain byreplacing the variables in tp according to µ.
For the sake of a more straightforward formalization, in this paper, we as-sume without loss of generality that every dataset G published via some kind offragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ✓ T ⇤
where T ⇤= U ⇥ U ⇥ (U [ L). Each fragment of such a dataset contains triples
that somehow belong together; they have been selected based on some condition,which we abstract through the notion of a selector:
Definition 1 (selector). A selector is a partial function s : 2
T ! {true, false}.A more concrete type of this abstract notion are triple pattern selectors, whichselect triples that match a certain triple pattern:
Definition 2 (triple pattern selector). Given a triple pattern tp, the triplepattern selector for tp is the selector stp that, for any singleton set {t}✓2
T, is
defined by
stp({t}) =(true if t is a matching triple for tp,
false else.
When publishing data on the Web, we should equip its representations withhypermedia controls [1, 8, 9]. We encounter them on a daily basis when browsinghtml pages; they are usually present as hyperlinks or forms. What all thesecontrols have in common is that, given some (possibly empty) input, they resultin our browser performing a request for a specific url.
Definition 3 (control). A control is a function that maps from some set to U .
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.
xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000] "GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200 1309 "http://fragments.dbpedia.org/2014/en" …
10 Ruben Verborgh et al.
Data: (predecessor Ip, bgp B = {tp1, . . . , tpn} with n � 2, start page �0)1 I nil; c the triple pattern control in the control set C0 of �0;2 Function BasicGraphPatternIterator.next()3 µ nil;4 while µ = nil do
5 while I = nil do
6 µp Ip.next();7 return nil if µp = nil;8 � {�i
1 | �i1 = http GET first fragment page using url c(µp[tpi])};
9 ✏ i such that cnt�i1 = min({cnt�1
1 , . . . , cnt�n1 });
10 I✏ TriplePatternIterator(StartIterator(), µp[tp✏],�✏1);
11 I BasicGraphPatternIterator(I✏, {µ[tp] | tp 2 B \ {tp✏}},�✏1);
12 µ I.next();13 return µ [ µp;
Algorithm 1: For all mappings µp of a predecessor Ip, a bgp iterator fora pattern B = {tp1, . . . , tpn} creates a triple pattern iterator I✏ for the leastfrequent pattern tp✏, passed to a bgp iterator for the remainder of P .
fetches the first page of the corresponding ldf. This page contains the cnt meta-data, which tells us how many matches the dataset has for each triple pattern.The pattern is then decomposed by evaluating it using a) a triple pattern iter-ator for the triple pattern with the smallest number of matches, and b) a newbgp iterator for the remainder of the pattern. This results in a dynamic pipelinefor each of the mappings of its predecessor, as visualized in Fig. 2. Each pipelineis optimized locally for a specific mapping, reducing the number of requests.
To evaluate a sparql query over a triple pattern fragment collection, we pro-ceed as follows. For each bgp of the query, a bgp iterator is created. Dedicatediterators are necessary for other sparql constructs such as UNION and OPTIONAL,but their implementation need not be ldf-specific; they can reuse the triplepattern fragment bgp iterators. The predecessor of the first iterator is a startiterator. We continuously pull solution mappings from the last iterator in thepipeline and output them as solutions of the query, until the last iterator re-sponds with nil. This pull-based process is able to deliver results incrementally.
...
B00= { Drago_Ibler a Architect. }
Alen_PeternacDrago_IblerJuraj_Neidhardt...
?person birthPlace Zagreb.
B0= { ?person a Architect. ?person birthPlace Zagreb. }
ZagrebBudapestRome...
?city subjectCapitals_in_Europe.
B= { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. }
Fig. 2: A bgp iterator decomposes a bgp B = {tp1, . . . , tpn} into a triple patterniterator for an optimal tpi and, for each resulting solution mapping µ of tpi, createsa bgp iterator for the remaining pattern B0 = {tp | tp = µ[tpj ] ^ tpj 2 B} \ {µ[tpi]}.
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.
Querying Datasets on the Web with High Availability 9
4.2 Dynamic iterator pipelines
A common approach to implement query execution in database systems is throughiterators that are typically arranged in a tree or a pipeline, based on which queryresults are computed recursively [10]. Such a pipelined approach has also beenstudied for Linked Data query processing [13, 15]. In order to enable incremental
results and allow the straightforward addition of sparql operators, we imple-ment a triple pattern fragments client using iterators.
The previous algorithm, however, cannot be implemented by a static iteratorpipeline. For instance, consider a query for architects born in European capitals:
SELECT ?person ?city WHERE {?person a dbpedia-owl:Architect. # tp1
?person dbpprop:birthPlace ?city. # tp2
?city dc:subject dbpedia:Capitals_in_Europe. # tp3
} LIMIT 100
Suppose the pipeline begins by finding ?city mappings for tp3. It then needsto choose whether it will next consider tp1 or tp2. The optimal choice, however,differs depending on the value of ?city:– For dbpedia:Paris, there are ±1,900 matches for tp2, and ±1,200 matches
for tp1, so there will be less http requests if we continue with tp1.– For dbpedia:Vilnius, there are 164 matches for tp2, and ±1,200 matches for
tp1, so there will be less http requests if we continue with tp2.With a static pipeline, we would have to choose the pipeline structure in advanceand subsequently reuse it.
In order to generate an optimized pipeline for each (sub-)query, we proposea divide-and-conquer strategy in which a query is decomposed dynamically intosubqueries depending on partial solution mappings. The main function of aniterator is next(), which either returns a mapping or nil if no mappings are left.
We first introduce a trivial start iterator, which outputs the empty map-ping µ0 on the first call to next(), and nil on all subsequent calls.
Next, we implement a previously defined triple pattern iterator [15] for triplepattern fragments. This iterator Itp is initialized with a predecessor iterator Ip,a triple pattern tp, and a page �0 of an arbitrary triple pattern fragment of a col-lection F . The iterator then extends mappings from its predecessor by readingtriples from the ldf corresponding to triple pattern tp. The url of this ldf is re-trieved through the collection control in the start page �0. Each call to Itp.next()results in mappings for tp in F , depending on the predecessor’s mappings.
To solve bgps of sparql queries, we introduce a triple pattern fragmentbgp iterator. Such a bgp iterator is initialized with a predecessor Ip, a bgp B =
{tp1, . . . , tpn}, and an arbitrary triple pattern fragment page �0 of a collection F .For an empty pattern (n = 0), a bgp iterator is equal to a start iterator. Fora pattern length n = 1, it is constructed by creating a triple pattern iteratorfor (Ip, tp1,�0). For n � 2, a bgp iterator uses Algorithm 1.
bgp iterators evaluate a bgp by recursively decomposing it into smaller itera-tors. For each triple pattern in the bgp mapped by each result of Ip, the iterator
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.
Wikidata
• API access to • items • edit history • items’ discussions • items’ access statistics • and more
• Linked Data interface • MediaWiki API • Wikidata Query • SPARQL • Linked Data Fragments
Access to more than “just” usage.
Thank you very much! @mluczak | http://markus-luczak.de
h"p://www.flickr.com/photos/therichbrooks/4040197666/,CC-BY2.0,h"ps://creaVvecommons.org/licenses/by/2.0/
References • Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and
Ontology Dynamics (EvoDyn2011), CEUR WS. • Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin,
Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138. • Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective
Query Patterns. COLD, 782. • Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data
(ILD 2012), 61. • Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International
Workshop on Usage Analysis and the Web of Data, Lyon, France. • Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv
preprint arXiv:1103.5043. • Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin
Heidelberg. • Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying
datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing. • Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through
Linked Data Fragments. In LDOW.