federated query processing for linked data: …
TRANSCRIPT
March 20, 2014 Ralf Schenkel
FEDERATED QUERYPROCESSING FOR LINKED DATA:
SOLUTIONS AND CHALLENGES
joint work withLuis Galárraga, Peter Haase, Katja Hose, Steffen Metzger, Michael Schmidt, Andreas Schwarte
Ralf Schenkel
March 20, 2014 Ralf Schenkel
Outline of the Talk
• Introduction• Querying Federations of Knowledge Bases• Challenges• Cooperative Knowledge Services
March 20, 2014 Ralf Schenkel
Semantic Information
Resource Description Framework:• Represent knowledge about resources (things) in a machine-
readable way.• Resources and their relations identified by URIs
• Statements (triples) with prefixes represent facts
<http://xmlns.com/foaf/0.1/name><http://www.mpii.de/yago/resource/John_Doe>
PREFIX yago: <http://www.mpii.de/yago/resource/>PREFIX foaf: <http://xmlns.com/foaf/0.1/>
yago:John_Doe foaf:name “John Doe”
Subject Predicate Object
March 20, 2014 Ralf Schenkel
RDF & SPARQL
RDF data can be seen as data graph
yago:John_Doe
“John Doe”
yago:Max_Mustermann
“Max Mustermann”
foaf:namefoaf:knows
foaf:name
Subject Predicate Object
yago:John_Doe foaf:name “John Doe”
yago:John_Doe foaf:knows yago:Max_Mustermann
yago:Max_Mustermann foaf:name “Max Mustermann”
SPARQL: Query language for RDF from the W3Cfor graph pattern queries on the knowledge base
March 20, 2014 Ralf Schenkel
Ontologies for RepresentingKnowledge
politician
person
resource
location
city
Honolulu
subclassOf
subclassOf
isAbornIn
isA
subclassOf
subclassOf
classes
instances/entities(URIs)
relations
scientists
subclassOf
04‐08‐1961bornOn
“Barack Obama”
“44th president”
label
label
bornIn
domain range
More complex „schema“ languages based on Description Logics (OWL, OWL2)
March 20, 2014 Ralf Schenkel
SPARQL – Example
Example query:Find all actors from Ontario (that are in the knowledge base)
Jim_Carrey
actor
Ontario
Canada
Mike_Myers
NewmarketScarborough
isA
bornIn bornIn
locatedIn
locatedInlocatedIn
isA
actor
Ontario
?person
?loc
bornIn
locatedIn
isA
Find subgraphs of this form:
variables
constants
SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc.?loc locatedIn Ontario.
March 20, 2014 Ralf Schenkel
Examples for Semantic Data
• General Knowledge Bases: DBPedia, Freebase, YAGO• Domain-specific knowledge: Biology, Geo, Government,
Publications, Movies, Songs, …• Linked Data as large integrated knowledge base
March 20, 2014 Ralf Schenkel
Linked Data Principles
1.Use URIs as names for things2.Use HTTP URIs so that people can look up those
names3.When someone looks up a URI, provide useful
information, using the standards (RDF, SPARQL)4. Include links to other URIs, so that they can
discover more things
http://www.w3.org/DesignIssues/LinkedData.html
March 20, 2014 Ralf Schenkel
Semantic Data Grows Rapidly
More than 31 billion triples in the LOD cloudDBPedia: 3.6 million entities, 1.2 billion triples
March 20, 2014 Ralf Schenkel
Varying access to datasets
Semantic data can be provided• as set of triples for download• through a SPARQL endpoint• by dereferencing URIs of entities
and may or may not come with• links to other data sets (such as owl:sameAs)• RDFS or OWL schema description• meta data describing content (such as VoID)
March 20, 2014 Ralf Schenkel
Queries can be complex, too
SELECT DISTINCT ?a ?b ?lat ?long WHERE{ ?a dbpedia:spouse ?b.?a dbpedia:wikilink dbpediares:actor.?b dbpedia:wikilink dbpediares:actor.?a dbpedia:placeOfBirth ?c.?b dbpedia:placeOfBirth ?c.?c owl:sameAs ?c2.?c2 pos:lat ?lat.?c2 pos:long ?long.
}
Q7 on BTC2008 in [Neumann & Weikum, 2009]Find actors that are married to each other and were bornIn the same place, together with coordinates of that place
March 20, 2014 Ralf Schenkel
… and may do useful things
SELECT ?genedescription ?taxonomy ?interaction WHERE{ ?interaction biopax2:PARTICIPANTS ?p .?interaction biopax2:NAME ?interactionname .?p biopax2:PHYSICAL-ENTITY ?protein .?protein skos:exactMatch ?uniprotaccession .?uniprotaccession core:organism ?taxonomy .?taxonomy core:scientificName 'Homo sapiens' .?geneid gene:uniprotAccession ?uniprotaccession .?geneid gene:description ?genedescription .?geneid gene:chromosome 'Y' .
}
LLD8 on Bio2RDF in [Schwarte et al., 2012]Select all human genes located on the Y‐chromosomewith known molecular interactions
March 20, 2014 Ralf Schenkel
Alternatives for Query Processing
• Centralized repository of semantic data (data warehouse)• Virtual integration of SPARQL endpoints through federation• Live exploration of URIs in query and partial results• Hybrid methods combining any of the above
Choice of method depends on (among others)• availability of data from sources• application requirements (speed, freshness, availability)
Good overview article by Olaf Hartig in Datenbank‐Spektrum 13(2), 2013
March 20, 2014 Ralf Schenkel
Centralized Repository
Simple to setup:– Copy all data to central server
in offline process– Execute query at central server
DataSource
DataSource
DataSource
centralrepository
query
Many problems:volume of data (>31 billion triples), changes of base data,sources may not provide RDF dump (only SPARQL access)
Advantages:Highly efficient, highly available,easy to add reasoning
March 20, 2014 Ralf Schenkel
Querying with Live Exploration
dbpedia.org
<http://dbpedia.org/Braunschweig> :population ?p
HTTP GET
...<http://dbpedia.org/Braunschweig>
rdf:label „Braunschweig“;dbpedia:country dbpedia:Germany;:population 250556;
... ?p 250556
Pros:Always executed on the live data
No SPARQL endpoint or dataset neededQuery „self-indexing“
Cons:Sources may be slow to respond
Query optimization difficult for complexqueries or frequent constants
March 20, 2014 Ralf Schenkel
Outline of the Talk
• Introduction• Querying Federations of Knowledge Bases
– FedX– BBQ
• Challenges• Cooperative Knowledge Services
March 20, 2014 Ralf Schenkel
Federated Query Processing
Advantages:• Access to live data• No local storage and maintenance • On‐demand access to sources
Federation
SPARQLEndpoint
SPARQLEndpoint
SPARQLEndpoint
DataSource
DataSource
DataSource
query
Federation layer at central server• Computes (distributed) execution plan• Fetches subresults from local sources (SPARQL)• Combines subresults
But:• Sources provide only limited levelof cooperation
• Only limited information aboutdata in each source
• User must select sources toinclude in federation
March 20, 2014 Ralf Schenkel
Naive Federated Processing
• Iteratively evaluate triple patterns at all sources• For each resulting binding, fill value in next triple pattern and
submit to all sources (nested loop join)• Continue until all patterns are evaluated
?Country ns:capital ?Capital.?Country ns:population ?CountryPop.?Capital ns:population ?CapitalPop.
Evaluate this at all sources:• 200 results from source1:(?Country,?Capital) bindings• no results from other sources• overall 4 requests
For each of the 200 ?Country bind.:• replace ?Country by value(e.g., „Austria ns:population ?CP“)
• submit to all sources• overall 200*4 requests, 100 res.(the same from sources 2 and 3)
For each of the 100 ?Capital bind.with matching ?Country bind.:• replace ?Capital by value(e.g., „Wien ns:population ?CP“)
• submit to all sources• overall 100*4 requests, 100 res.
Example: 3 triple patterns, 4 sources
Many unnecessary requests:Sources do not have results or overlap in results; inefficient NL join
Our approach:Apply techniques from logical, physical, and cost‐based query optimization
March 20, 2014 Ralf Schenkel
Query Optimization in FedX
Specific optimization techniques in FedX:1. Source Selection2. Exclusive Groups3. Join Order4. Bound Joins
A.Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt:ESWC 2011 (demo), ISWC 2011
March 20, 2014 Ralf Schenkel
Technique 1: Source Selection
Which sources contribute results for a pattern?• One SPARQL ASK request per source• Local cache to reduce remote communication
(with time-based invalidation) save on subsequent queries with this pattern
• Annotate triple patterns with relevant sources(for constructing the query)
only DBpedia relevant for this triple pattern
?Country ns:capital ?Capital.DBPedia: ASK ?Country WHERE {?Country ns:capital ?Capital.}NYTimes: ASK ?Country WHERE {?Country ns:capital ?Capital.}LinkedMDB: ASK ?Country WHERE {?Country ns:capital ?Capital.}
TRUEFALSEFALSE
Example: Federation (DBpedia, NYTimes, LinkedMDB)
March 20, 2014 Ralf Schenkel
Technique 2: Exclusive Groups
Group joining triple patterns with the same single relevant source
• Needs only a single request• Evaluate join at the source, no communication needed
Example: Federation (DBpedia, NYTimes, LinkedMDB)
SELECT ?President ?Party ?Title WHERE { ?President rdf:type dbpedia:President . ?President dbpedia:Party ?Party .?President dc:title ?Title .
}
Execute multiple triple patterns in a single request
Exclusive Group@ DBpedia@ DBpedia@ DBpedia, NYTimes
Source Selection
March 20, 2014 Ralf Schenkel
Technique 3: Join Order
Determine optimal execution order of • triple patterns• joinsin order to minimize intermediate results
Example: Federation (DBpedia, LinkedMDB), 100 results
SELECT ?actor WHERE { ?actor rdf:type imdb:actor . ?actor bornIn Salzburg .
}
>1 million results in LinkedMDB1000 results in DBPedia
Execute second triple pattern first
Need for selectivity and join statistics at federated level
March 20, 2014 Ralf Schenkel
Technique 4: Bound Joins
Perform joins in a block nested loop fashion• Connect bound triple patterns with SPARQL UNIONS• Apply local post-processing to retain correctness• Rename variables to represent original bindings
Block Input?S=s1?S=s2?S=s3?S=s4?S=s5
Before (NLJ)SELECT ?O WHERE { s1 p ?O }SELECT ?O WHERE { s2 p ?O }SELECT ?O WHERE { s3 p ?O }SELECT ?O WHERE { s4 p ?O }SELECT ?O WHERE { s5 p ?O }
Now (bound joins)SELECT ?O_1 ?O_2 .. ?O_5 WHERE {{ s1 p ?O_1 } UNION{ s2 p ?O_2 } UNION…{ s5 p ?O_5 } }
Example: Process join for patterns (?S type U) and (?S p ?O), where results for left argument (?S type U) are already computed
Execute in a single remote request
March 20, 2014 Ralf Schenkel
Evaluation
Benchmarks using FedBench: SPARQL Federation
Often large improvements over state‐of‐the‐art systems
March 20, 2014 Ralf Schenkel
Outline of the Talk
• Introduction• Querying Federations of Knowledge Bases
– FedX– BBQ
• Challenges• Cooperative Knowledge Services
March 20, 2014 Ralf Schenkel
Revisiting the Source Selection Problem
SPARQL example (simplified – no prefixes etc.):SELECT ?a WHERE{ ?a dc:authorOf ?p. ?p dc:publishedAt SIGMOD2012 .}
Source selection problem:Which of the 325 sources to query?
Many sources contain the same factsMany duplicate results
Many unnecessary requests
Obvious problem: overlapping sources
March 20, 2014 Ralf Schenkel
Example for Overlapping Sources
• 6 results overall• 2 sources enough to retrieve all results• Source 1 alone is „optimal“ if
– only one access possible– or 5 results are enough
Source 1 Source 2 Source 3
Our contribution: Determine „optimal“set of sources without seeing the results
[SWIM@SIGMOD 2012]
March 20, 2014 Ralf Schenkel
Problem Definition
Given SPARQL query with triple patterns P and possiblesources S, compute query plan qpPS (which pattern isexecuted at which source) such that
– all results are retrieved with a minimal number ofrequests to sources (minimal exact plan)
– as many results as possible are retrieved with |qp|≤max(maximize recall)
– as little requests as possible are performed to retrieveat least r results (minimal approximate plan)
March 20, 2014 Ralf Schenkel
BBQ: High-Level Solution Overview
1. Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead ofboolean yes/no)
2. Estimate source overlap with summaries3. Select sources incrementally based on benefit
Functional properties of summaries for sets:– Size of set (number of distinct elements)– Size of union of two sets– Size of intersection of two sets– Summary smaller than the data– Data not be reproducible from the summary
Examples: Bloom Filters, kmv synopsis, …
Significant reduction of query cost compared to standard solutions
March 20, 2014 Ralf Schenkel
Source Selection for SingleTriple Pattern
• Benefit of a source: number of new results it can contribute• Incremental selection algorithm:
– Maintain summary for union of results from sourcesalready selected
– Estimate source benefit from summary– Select source with highest benefit
• Stop when target (# results or # requests) reached• Finally: Evaluate triple pattern at all selected sources; select
more sources if too few results
March 20, 2014 Ralf Schenkel
Example (Single Triple Pattern)
1 1 0 1 1 05: 0 1 1 0 0 02: 1 0 1 0 1 03:
Source 1 Source 2 Source 3
1. ASK each source
0 0 0 0 0 00:current result summary
2. Select source with highest number of results
1 1 0 1 1 05:
3. Stop if stopping condition is met (recall or number of results)4. Compute benefit for each remaining source
0 1 1 0 0 02:Source 2: 2 ‐ 1 1 0 1 1 05: = 11 0 1 0 1 03:Source 3: 3 ‐ 1 1 0 1 1 05: = 1
5. Select source with highest benefit6. Continue with step 3
1 1 1 1 1 06:
March 20, 2014 Ralf Schenkel
Star-Shaped Queries
Multiple triple patterns with a single identical variable
• Not enough to consider each triple pattern separately• Need to focus on the intersection of the result sets• Extended incremental algorithm:
– Init: Pick one source for each triple pattern with most results– Benefit of evaluating a triple pattern at a source:
number of new results in the intersection– Estimated by intersection of per-pattern summaries (union of
summaries from each selected source)
?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic.
March 20, 2014 Ralf Schenkel
Complex Queries
Queries with >1 variable and >1 triple patterns
• Summaries not applicable for whole query:– no connection of summaries for variables ?m and ?p– Do new bindings for ?p join with existing bindings for ?m ?
• But: separate source selection for each pattern possible• Plus: exclude join candidates at execution time
reduces effort for nested-loop joins run full query at sources if no cross-joins possible
imdb:Tom_Cruise imdb:actedIn ?m.
?m imdb:producedBy ?p.
naive:3x3 joins
improved:6 joins best:
3 local joins
March 20, 2014 Ralf Schenkel
Experimental Evaluation: Setup
• RDF Dataset from first 100,000 IMDB moviesand their actors and directors
• Generate overlapping partitions– For movies based on genre (28 partitions)– For persons based on birthplace and birthdate (22 p.)
• Queries:– 20 single triple patterns– 20 star-shaped queries
• Consider minimal exact plan• Bloom filters of different sizes, kmv synopsis
March 20, 2014 Ralf Schenkel
Triple Pattern Queries
Much fewer requests while retrieving (almost) all results
March 20, 2014 Ralf Schenkel
Extensions
• Include owl:sameAs information when checkingfor overlap BBQ with SALSA
• Use precomputed index instead of extended ASK queries DAW (DERI & U Leipzig @ ISWC 2013)
March 20, 2014 Ralf Schenkel
Outline of the Talk
• Introduction• Querying Federations of Knowledge Bases• Challenges• Cooperative Knowledge Services
March 20, 2014 Ralf Schenkel
Challenge 1: Optimization & QP
• Increase sources‘ level of cooperation:– Export extensive selectivity and join statistics
(improves federated join order)– Interfaces beyond SPARQL (enables more efficient joins)
• Caching of data at federated level(reduces latency and risk of unavailable sources)
• Best-effort execution for given cost budget(time, messages, money), considering– overlap of sources– fraction of results retrieved
March 20, 2014 Ralf Schenkel
Challenge 2: Data Quality
• Data provided by sources may be incorrect for a number of reasons:– outdated:(Ralf_Schenkel, worksFor, MPI-INF)
– incomplete:(Leslie_Lamport,hasWon,TuringAward) missing from KB
– extraction errors– missing/incorrect links between entities
• Unclear how to measure quality of a source or howto assess correctness of a single fact
• Some applications need quality guarantees
March 20, 2014 Ralf Schenkel
Challenge 3: Heterogeneity ofSources
• Integrate semantic data from many different kindsof sources in a single query processor– SPARQL endpoints– Downloadable datasets– Dereferenceable URIs– Web services– Relational databases, XML, …– Text documents, Web tables, Excel sheets, …
March 20, 2014 Ralf Schenkel
Outline of the Talk
• Introduction• Querying Federations of Knowledge Bases• Challenges• Cooperative Knowledge Services
March 20, 2014 Ralf Schenkel
Most of Today‘s Ontologies/Endpoints are Black Boxes
SPARQL
results
InformationExtraction
Sourcedocuments
Type System Schema
Rules
Reasoner
WSDNER
NLP Tools
Base Facts
Fact Candidates
Often highly customized and tunedImpossible to replace components
Impossible to replace sourcesNo provenance, quality, freshness info
March 20, 2014 Ralf Schenkel
Collaborative Knowledge Services
• From monolithic knowledge bases to collaboration ofspecialized services
• Different classes of services:– Storage services (RDF stores, relational
wrappers, Web services)– Creation services (Information extraction, crowd
sourcing)– Reasoning services– Query and aggregation services
[SSW@VLDB2012]
March 20, 2014 Ralf Schenkel
On-the-fly Combination of Services
Query
Aggregation
IMDB MovieDB
Reasonerrule:?show hasMovie ?movie ?show happensAt ?cinema ?cinema bs:shows ?movie
Aggregation
MyCinesExtractorwww.bs‐cinema.de
q={?movie rdf:type imdb:horror.BS_Cinema bs:shows ?movie}
q1q2
March 20, 2014 Ralf Schenkel
Core Challenges of thisFramework
• Per-query on-the-fly combination of services• Description of service properties (functional and
domain)• Service quality and cost• Uncertain and contradicting facts• Trust and authority of services• Provenance of results• Feedback and exchange of information• Query routing, optimization, processing
Linked Data Cloud is an instance of this framework!
March 20, 2014 Ralf Schenkel
Summary
• Fast growing volume of semantic data and numberof independent data sources are big challengesfor data management
• Federated solutions as one viable alternative forquerying Linked Open Data
• Cooperation beyond beyond plain SPARQL interfaces required
• Next Big Thing: Collaborative Knowledge Services