federated query processing for linked data: …

March 20, 2014 Ralf Schenkel

FEDERATED QUERYPROCESSING FOR LINKED DATA:

SOLUTIONS AND CHALLENGES

joint work withLuis Galárraga, Peter Haase, Katja Hose, Steffen Metzger, Michael Schmidt, Andreas Schwarte

Ralf Schenkel


Outline of the Talk

• Introduction• Querying Federations of Knowledge Bases• Challenges• Cooperative Knowledge Services


Semantic Information

Resource Description Framework:• Represent knowledge about resources (things) in a machine-

readable way.• Resources and their relations identified by URIs

• Statements (triples) with prefixes represent facts

<http://xmlns.com/foaf/0.1/name><http://www.mpii.de/yago/resource/John_Doe>

PREFIX yago: <http://www.mpii.de/yago/resource/>PREFIX foaf: <http://xmlns.com/foaf/0.1/>

yago:John_Doe foaf:name “John Doe”

Subject Predicate Object


RDF & SPARQL

RDF data can be seen as data graph

yago:John_Doe

“John Doe”

yago:Max_Mustermann

“Max Mustermann”

foaf:namefoaf:knows

foaf:name

Subject Predicate Object

yago:John_Doe foaf:name “John Doe”

yago:John_Doe foaf:knows yago:Max_Mustermann

yago:Max_Mustermann foaf:name “Max Mustermann”

SPARQL: Query language for RDF from the W3Cfor graph pattern queries on the knowledge base


Ontologies for RepresentingKnowledge

politician

person

resource

location

city

Honolulu

subclassOf

subclassOf

isAbornIn

isA

subclassOf

subclassOf

classes

instances/entities(URIs)

relations

scientists

subclassOf

04‐08‐1961bornOn

“Barack Obama”

“44th president”

label

label

bornIn

domain range

More complex „schema“ languages based on Description Logics (OWL, OWL2)


SPARQL – Example

Example query:Find all actors from Ontario (that are in the knowledge base)

Jim_Carrey

actor

Ontario

Canada

Mike_Myers

NewmarketScarborough

isA

bornIn bornIn

locatedIn

locatedInlocatedIn

isA

actor

Ontario

?person

?loc

bornIn

locatedIn

isA

Find subgraphs of this form:

variables

constants

SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc.?loc locatedIn Ontario.


Examples for Semantic Data

• General Knowledge Bases: DBPedia, Freebase, YAGO• Domain-specific knowledge: Biology, Geo, Government,

Publications, Movies, Songs, …• Linked Data as large integrated knowledge base


Linked Data Principles

1.Use URIs as names for things2.Use HTTP URIs so that people can look up those

names3.When someone looks up a URI, provide useful

information, using the standards (RDF, SPARQL)4. Include links to other URIs, so that they can

discover more things

http://www.w3.org/DesignIssues/LinkedData.html


Semantic Data Grows Rapidly

More than 31 billion triples in the LOD cloudDBPedia: 3.6 million entities, 1.2 billion triples


Varying access to datasets

Semantic data can be provided• as set of triples for download• through a SPARQL endpoint• by dereferencing URIs of entities

and may or may not come with• links to other data sets (such as owl:sameAs)• RDFS or OWL schema description• meta data describing content (such as VoID)


Queries can be complex, too

SELECT DISTINCT ?a ?b ?lat ?long WHERE{ ?a dbpedia:spouse ?b.?a dbpedia:wikilink dbpediares:actor.?b dbpedia:wikilink dbpediares:actor.?a dbpedia:placeOfBirth ?c.?b dbpedia:placeOfBirth ?c.?c owl:sameAs ?c2.?c2 pos:lat ?lat.?c2 pos:long ?long.

}

Q7 on BTC2008 in [Neumann & Weikum, 2009]Find actors that are married to each other and were bornIn the same place, together with coordinates of that place


… and may do useful things

SELECT ?genedescription ?taxonomy ?interaction WHERE{ ?interaction biopax2:PARTICIPANTS ?p .?interaction biopax2:NAME ?interactionname .?p biopax2:PHYSICAL-ENTITY ?protein .?protein skos:exactMatch ?uniprotaccession .?uniprotaccession core:organism ?taxonomy .?taxonomy core:scientificName 'Homo sapiens' .?geneid gene:uniprotAccession ?uniprotaccession .?geneid gene:description ?genedescription .?geneid gene:chromosome 'Y' .

}

LLD8 on Bio2RDF in [Schwarte et al., 2012]Select all human genes located on the Y‐chromosomewith known molecular interactions


Alternatives for Query Processing

• Centralized repository of semantic data (data warehouse)• Virtual integration of SPARQL endpoints through federation• Live exploration of URIs in query and partial results• Hybrid methods combining any of the above

Choice of method depends on (among others)• availability of data from sources• application requirements (speed, freshness, availability)

Good overview article by Olaf Hartig in Datenbank‐Spektrum 13(2), 2013


Centralized Repository

Simple to setup:– Copy all data to central server

in offline process– Execute query at central server

DataSource

DataSource

DataSource

centralrepository

query

Many problems:volume of data (>31 billion triples), changes of base data,sources may not provide RDF dump (only SPARQL access)

Advantages:Highly efficient, highly available,easy to add reasoning


Querying with Live Exploration

dbpedia.org

<http://dbpedia.org/Braunschweig> :population ?p

HTTP GET

...<http://dbpedia.org/Braunschweig>

rdf:label „Braunschweig“;dbpedia:country dbpedia:Germany;:population 250556;

... ?p 250556

Pros:Always executed on the live data

No SPARQL endpoint or dataset neededQuery „self-indexing“

Cons:Sources may be slow to respond

Query optimization difficult for complexqueries or frequent constants


Outline of the Talk

• Introduction• Querying Federations of Knowledge Bases

– FedX– BBQ

• Challenges• Cooperative Knowledge Services


Federated Query Processing

Advantages:• Access to live data• No local storage and maintenance • On‐demand access to sources

Federation

SPARQLEndpoint

SPARQLEndpoint

SPARQLEndpoint

DataSource

DataSource

DataSource

query

Federation layer at central server• Computes (distributed) execution plan• Fetches subresults from local sources (SPARQL)• Combines subresults

But:• Sources provide only limited levelof cooperation

• Only limited information aboutdata in each source

• User must select sources toinclude in federation


Naive Federated Processing

• Iteratively evaluate triple patterns at all sources• For each resulting binding, fill value in next triple pattern and

submit to all sources (nested loop join)• Continue until all patterns are evaluated

?Country ns:capital ?Capital.?Country ns:population ?CountryPop.?Capital ns:population ?CapitalPop.

Evaluate this at all sources:• 200 results from source1:(?Country,?Capital) bindings• no results from other sources• overall 4 requests

For each of the 200 ?Country bind.:• replace ?Country by value(e.g., „Austria ns:population ?CP“)

• submit to all sources• overall 200*4 requests, 100 res.(the same from sources 2 and 3)

For each of the 100 ?Capital bind.with matching ?Country bind.:• replace ?Capital by value(e.g., „Wien ns:population ?CP“)

• submit to all sources• overall 100*4 requests, 100 res.

Example: 3 triple patterns, 4 sources

Many unnecessary requests:Sources do not have results or overlap in results; inefficient NL join

Our approach:Apply techniques from logical, physical, and cost‐based query optimization


Query Optimization in FedX

Specific optimization techniques in FedX:1. Source Selection2. Exclusive Groups3. Join Order4. Bound Joins

A.Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt:ESWC 2011 (demo), ISWC 2011


Technique 1: Source Selection

Which sources contribute results for a pattern?• One SPARQL ASK request per source• Local cache to reduce remote communication

(with time-based invalidation) save on subsequent queries with this pattern

• Annotate triple patterns with relevant sources(for constructing the query)

only DBpedia relevant for this triple pattern

?Country ns:capital ?Capital.DBPedia: ASK ?Country WHERE {?Country ns:capital ?Capital.}NYTimes: ASK ?Country WHERE {?Country ns:capital ?Capital.}LinkedMDB: ASK ?Country WHERE {?Country ns:capital ?Capital.}

TRUEFALSEFALSE

Example: Federation (DBpedia, NYTimes, LinkedMDB)


Technique 2: Exclusive Groups

Group joining triple patterns with the same single relevant source

• Needs only a single request• Evaluate join at the source, no communication needed

Example: Federation (DBpedia, NYTimes, LinkedMDB)

SELECT ?President ?Party ?Title WHERE { ?President rdf:type dbpedia:President . ?President dbpedia:Party ?Party .?President dc:title ?Title .

}

Execute multiple triple patterns in a single request

Exclusive Group@ DBpedia@ DBpedia@ DBpedia, NYTimes

Source Selection


Technique 3: Join Order

Determine optimal execution order of • triple patterns• joinsin order to minimize intermediate results

Example: Federation (DBpedia, LinkedMDB), 100 results

SELECT ?actor WHERE { ?actor rdf:type imdb:actor . ?actor bornIn Salzburg .

}

>1 million results in LinkedMDB1000 results in DBPedia

Execute second triple pattern first

Need for selectivity and join statistics at federated level


Technique 4: Bound Joins

Perform joins in a block nested loop fashion• Connect bound triple patterns with SPARQL UNIONS• Apply local post-processing to retain correctness• Rename variables to represent original bindings

Block Input?S=s1?S=s2?S=s3?S=s4?S=s5

Before (NLJ)SELECT ?O WHERE { s1 p ?O }SELECT ?O WHERE { s2 p ?O }SELECT ?O WHERE { s3 p ?O }SELECT ?O WHERE { s4 p ?O }SELECT ?O WHERE { s5 p ?O }

Now (bound joins)SELECT ?O_1 ?O_2 .. ?O_5 WHERE {{ s1 p ?O_1 } UNION{ s2 p ?O_2 } UNION…{ s5 p ?O_5 } }

Example: Process join for patterns (?S type U) and (?S p ?O), where results for left argument (?S type U) are already computed

Execute in a single remote request


Evaluation

Benchmarks using FedBench: SPARQL Federation

Often large improvements over state‐of‐the‐art systems


Outline of the Talk

• Introduction• Querying Federations of Knowledge Bases

– FedX– BBQ

• Challenges• Cooperative Knowledge Services


Revisiting the Source Selection Problem

SPARQL example (simplified – no prefixes etc.):SELECT ?a WHERE{ ?a dc:authorOf ?p. ?p dc:publishedAt SIGMOD2012 .}

Source selection problem:Which of the 325 sources to query?

Many sources contain the same factsMany duplicate results

Many unnecessary requests

Obvious problem: overlapping sources


Example for Overlapping Sources

• 6 results overall• 2 sources enough to retrieve all results• Source 1 alone is „optimal“ if

– only one access possible– or 5 results are enough

Source 1 Source 2 Source 3

Our contribution: Determine „optimal“set of sources without seeing the results

[SWIM@SIGMOD 2012]


Problem Definition

Given SPARQL query with triple patterns P and possiblesources S, compute query plan qpPS (which pattern isexecuted at which source) such that

– all results are retrieved with a minimal number ofrequests to sources (minimal exact plan)

– as many results as possible are retrieved with |qp|≤max(maximize recall)

– as little requests as possible are performed to retrieveat least r results (minimal approximate plan)


BBQ: High-Level Solution Overview

1. Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead ofboolean yes/no)

2. Estimate source overlap with summaries3. Select sources incrementally based on benefit

Functional properties of summaries for sets:– Size of set (number of distinct elements)– Size of union of two sets– Size of intersection of two sets– Summary smaller than the data– Data not be reproducible from the summary

Examples: Bloom Filters, kmv synopsis, …

Significant reduction of query cost compared to standard solutions


Source Selection for SingleTriple Pattern

• Benefit of a source: number of new results it can contribute• Incremental selection algorithm:

– Maintain summary for union of results from sourcesalready selected

– Estimate source benefit from summary– Select source with highest benefit

• Stop when target (# results or # requests) reached• Finally: Evaluate triple pattern at all selected sources; select

more sources if too few results


Example (Single Triple Pattern)

1 1 0 1 1 05: 0 1 1 0 0 02: 1 0 1 0 1 03:

Source 1 Source 2 Source 3

1. ASK each source

0 0 0 0 0 00:current result summary

2. Select source with highest number of results

1 1 0 1 1 05:

3. Stop if stopping condition is met (recall or number of results)4. Compute benefit for each remaining source

0 1 1 0 0 02:Source 2: 2 ‐ 1 1 0 1 1 05: = 11 0 1 0 1 03:Source 3: 3 ‐ 1 1 0 1 1 05: = 1

5. Select source with highest benefit6. Continue with step 3

1 1 1 1 1 06:


Star-Shaped Queries

Multiple triple patterns with a single identical variable

• Not enough to consider each triple pattern separately• Need to focus on the intersection of the result sets• Extended incremental algorithm:

– Init: Pick one source for each triple pattern with most results– Benefit of evaluating a triple pattern at a source:

number of new results in the intersection– Estimated by intersection of per-pattern summaries (union of

summaries from each selected source)

?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic.


Complex Queries

Queries with >1 variable and >1 triple patterns

• Summaries not applicable for whole query:– no connection of summaries for variables ?m and ?p– Do new bindings for ?p join with existing bindings for ?m ?

• But: separate source selection for each pattern possible• Plus: exclude join candidates at execution time

reduces effort for nested-loop joins run full query at sources if no cross-joins possible

imdb:Tom_Cruise imdb:actedIn ?m.

?m imdb:producedBy ?p.

naive:3x3 joins

improved:6 joins best:

3 local joins


Experimental Evaluation: Setup

• RDF Dataset from first 100,000 IMDB moviesand their actors and directors

• Generate overlapping partitions– For movies based on genre (28 partitions)– For persons based on birthplace and birthdate (22 p.)

• Queries:– 20 single triple patterns– 20 star-shaped queries

• Consider minimal exact plan• Bloom filters of different sizes, kmv synopsis


Triple Pattern Queries

Much fewer requests while retrieving (almost) all results


Extensions

• Include owl:sameAs information when checkingfor overlap BBQ with SALSA

• Use precomputed index instead of extended ASK queries DAW (DERI & U Leipzig @ ISWC 2013)


Outline of the Talk



Challenge 1: Optimization & QP

• Increase sources‘ level of cooperation:– Export extensive selectivity and join statistics

(improves federated join order)– Interfaces beyond SPARQL (enables more efficient joins)

• Caching of data at federated level(reduces latency and risk of unavailable sources)

• Best-effort execution for given cost budget(time, messages, money), considering– overlap of sources– fraction of results retrieved


Challenge 2: Data Quality

• Data provided by sources may be incorrect for a number of reasons:– outdated:(Ralf_Schenkel, worksFor, MPI-INF)

– incomplete:(Leslie_Lamport,hasWon,TuringAward) missing from KB

– extraction errors– missing/incorrect links between entities

• Unclear how to measure quality of a source or howto assess correctness of a single fact

• Some applications need quality guarantees


Challenge 3: Heterogeneity ofSources

• Integrate semantic data from many different kindsof sources in a single query processor– SPARQL endpoints– Downloadable datasets– Dereferenceable URIs– Web services– Relational databases, XML, …– Text documents, Web tables, Excel sheets, …


Outline of the Talk



Most of Today‘s Ontologies/Endpoints are Black Boxes

SPARQL

results

InformationExtraction

Sourcedocuments

Type System Schema

Rules

Reasoner

WSDNER

NLP Tools

Base Facts

Fact Candidates

Often highly customized and tunedImpossible to replace components

Impossible to replace sourcesNo provenance, quality, freshness info


Collaborative Knowledge Services

• From monolithic knowledge bases to collaboration ofspecialized services

• Different classes of services:– Storage services (RDF stores, relational

wrappers, Web services)– Creation services (Information extraction, crowd

sourcing)– Reasoning services– Query and aggregation services

[SSW@VLDB2012]


On-the-fly Combination of Services

Query

Aggregation

IMDB MovieDB

Reasonerrule:?show hasMovie ?movie ?show happensAt ?cinema ?cinema bs:shows ?movie

Aggregation

MyCinesExtractorwww.bs‐cinema.de

q={?movie rdf:type imdb:horror.BS_Cinema bs:shows ?movie}

q1q2


Core Challenges of thisFramework

• Per-query on-the-fly combination of services• Description of service properties (functional and

domain)• Service quality and cost• Uncertain and contradicting facts• Trust and authority of services• Provenance of results• Feedback and exchange of information• Query routing, optimization, processing

Linked Data Cloud is an instance of this framework!


Summary

• Fast growing volume of semantic data and numberof independent data sources are big challengesfor data management

• Federated solutions as one viable alternative forquerying Linked Open Data

• Cooperation beyond beyond plain SPARQL interfaces required

• Next Big Thing: Collaborative Knowledge Services

federated query processing for linked data: …

Documents