towards distributed rdf querying · introduction: virtual repository rdf data rdf data rdf data rep...

Towards distributed RDF querying

Citation for published version (APA):Vdovjak, R., Houben, G. J. P. M., & Stuckenschmidt, H. (2004). Towards distributed RDF querying. 1-1. Abstractfrom Dutch-Belgian Database Day 2004 (DBDBD 2004), December 3, 2004, Antwerp, Belgium , Antwerp,Belgium.

Document status and date:Published: 01/01/2004

Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne

Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.

Download date: 26. Feb. 2021

https://research.tue.nl/en/publications/towards-distributed-rdf-querying(6339b331-2cfe-4844-a41b-844d6715ed5b).html

1

Towards Distributed RDF Querying

Richard Vdovjak (TU Eindhoven)Geert-Jan Houben (TU Eindhoven)Heiner Stuckenschmidt (VU Amsterdam)

Layout

n Introduction / motivationn Hera

n Integration model

n Query Processingn Optimization

Introduction: Layout

Motivation

n Demand for combining distributed data on the Webq Comparative Shoppingq Virtual Museums

q Digital Librariesq Web Portals

n User: q formulate his queryq split the queryq assemble results

Introduction: motivation

Web Site

Web Site

Web Site

Query 1

Query 2

Query 3UserUser

Sem. Web / RDF(S) to the Rescue (?)

n RDFSq The Pivot Language of the Semantic Webq Solves the problem of syntactic heterogeneity of

different sourcesq Provides basic modelling primitives and reasoning

techniques for conceptual knowledgeq Built on an extremely simple data model that

avoids complications of other object oriented formalisms

q And (most importantly): It is a standard !

Introduction: motivation

The RDF data model

n statements are <Subject, Predicate, Object> triples:q <“http://wwwis.win.tue.nl/~houben/”, dc:creator, “Geert-Jan”>

n statements describe properties of resources

n a resource is everything that has an identifier: URIn Directed Labelled Multi-graph

��

��

dc:creator

Introduction: RDF (S)

Sem. Web / RDF(S) to the Rescue (?)

n Benefitsq Common Syntaxq Formal (limited) Semanticsq Flexible

easy to express anything about anything

n Our User ?q formulate his queryq split the queryq assemble results

...(still unhappy)

Introduction: RDF(S)

RDFData

RDFData

RDFData

Repository

Repository

Repository

Query 1

Query 2

Query 3UserUser

2

Possible Solution: Data Warehousing

n Gather All Data Locally

n Watch for Updatesq Insert New Dataq Delete Old Data

n Evaluate (locally) User Query

Introduction: Data Warehousing

RDFData

RDFData

RDFData

Repository RepositoryRepository

Query UserUser

RDFData

Repository

Data Warehousing: Problems

n Performance Bottleneck

n Freshness

n Copyright / Data Ownership issues

Introduction: Data Warehousing

Insert into Local Repository:Processing Speed

0

50

100

150

200

250

0 10 20 30 40 50 60 70 80

Size ( Stored triplets * 1000)

Tri

ple

ts/s

Another Solution: Virtual Repository

n Split the User Query into Sub-Queries

n Translate Sub-Queries to Source Schemata

n Distribute Sub-Queriesn Assemble Results

Introduction: Virtual Repository

RDFData

RDFData

RDFData

Repository Repository Repository

Query

Sub-QuerySub-QuerySub-Query

UserUser

Path Index

Mediator

Challenges/ Requirements for Integration of RDF(S) Sourcesn Unified Interface to data sources

q Usually only a part of a source is of interestq “Freshness” of data must be guaranteedq Sources are many, autonomous, and volatileq Semantic Heterogeneity

n Schema heterogeneity n Designation heterogeneity

n Distributed Query processing q Complex/Join queries (SeRQL language)q Flexibility w.r.t. user needs/query

n (the order of importance changes)q Correctness, Completenessq Performance

Introduction: Requirements

Distinguishing aspects from Distributed Databases

Hera Design Methodology

Hera Suite

Design Methodology

RequirementsAnalysis

ConceptualDesign

IntegrationDesign

ApplicationDesign

AdaptationDesign

PresentationDesign

(Search)Agent

inforequest

(meta) data

ConceptualModel Application

Model

info request (slice)presentation

End User

RQL / RDF XSLT / XML

UserModel

PresentationModel

PresentationEngine

ApplicationEngine

IntegrationEngine

CuypersEngine

AdaptationEngine

info request HTML/WML

IntegrationModel

Semantic Layer Application Layer Presentation Layer

Hera

n Data retrieval is just a begining n Navigational Structuren Presentation Renderingn Adaptation / Adaptivity

Hera Front-end

Source Clusters:

User Query

Presentation

Conceptual Model Access Point

html/smil

Hera Back-end

IM InstanceIntegration Model (IM)

The Semantic Layer

(Search) Agent request

RDF/data

Hera: Semantic Layer

3

Conceptual Model (CM)

n Interface between data retrieval and presentation generation (via SeRQL)

n CM exists on its own (even without instances)q Made by a system designerq Top down approach

n Specifies the application’s semantics What is the information system about

n Expressed in RDF(S) n Populated on demand with data from several

heterogeneous information sourcesn Challenge: Map the Sources to the defined CM

Hera: Conceptual Model

Layout


n Integration model


Integration Model

Integration Model (IM)n A generic framework for describing, integrating and

relating concepts from sources to their CM counterparts

n View Definition/Translation Language + n Object reconciliation language +n Language for programming the mediator +

n Expressed in RDFS n Instantiated by the integration designer into IMIs:

program the mediator to overcome the semantic heterogeneity between the sources and the CM

Integration Model: Definition

Integration Model in RDF(S)

PathExpression

ComparatorTransformer

ProcInstruction

FromEdge

ToEdge

FromNode

ToNode

PrimaryNode

EdgeNode

From

To

Rdf:Property Rdf:Resource

Decoration Articulation

appliesToArt

Literal

edgeProducedBy

producedBy

source

target

backtrack

follow

idByValue

obtainedFromEdgeobtainedFromNode

starts

forSource

ends

ends

starts

starts

fromTarget

toTarget

rankValue

nodeProducedBy

subPropertyOf

subClassOf

property

ends

endsList

endsList

List

FromList

obtainedFromList

Source

address

name

Literal

Literal

idByProperty

SubSchema

forSource

hasClassList

ClassList identicalFor

appliesToSource

Literal

dimension

Ranking Restriction

Literal

restrictionValuerestrictionPath

From

Integration Model: RDFS syntax

Schema heterogeneity

n Sources are autonomous and can therefore differ a lot from each other

n Mappings are formed through the notion of Path Expressions (PE) which form articulations

n An articulation is a pair of two PEs, one in an external source, one in the CMq consists of a link between the start-classes and a

link between the ending-edges

Integration Model: Expressive power

Schema heterogeneityn Design the CM like source 1 :


4

Schema heterogeneity

n Sometimes a value should be mapped to a list of values

n A transformer is needed for the necessary actionn Denoted in the IM as “obtained by list”


n Different sources may have different ways to uniquely identify instances

n Need to define the identifying properties “primary key” of objects in every source

n Consolidate them into the CM so that a join can be performed across multiple sources

Designation heterogeneity


Designation heterogeneity

n IM offers three kinds of data-identification:1. idByUri � every object has an unique URI

2. idByValue � a value is unique for an object

3. idByProperty � a “super resource” defines the uniqueness of an object

n Examples:


Designation heterogeneity : idByUri

n An object is uniquely defined by its URIn Can be used/imposed within closed communities,

e.g. corporate ISn Does not work World Wide


Designation heterogeneity : idByValuen Value based (similarly to the relational model)n A object is uniquely defined by one(or more) of its

own properties


Designation heterogeneity: idByPropertyn idBy-information provided by a super-resourcen Recursive path until either idByUri or idByValue is

encountered


5

Join Examplen If the primary key is the same data is joined

n The two idBy-paths are different but joining is still possible if the end-values for the primary key are the same


Different sources - different qualities

n Sources can “get points” for certain qualities:q Data reliabilityq Data quality: e.g. picture quality for a Photo databaseq Reachabilityq Speed

n Sources can be consulted in different order based on the current user preferences

n Decorations q making background knowledge explicitq “exported” into the CM (extending the Schema)


Extra IM features

n Process Instructions : (Need to compare (primary) keys

q Transformers/Comparatos§ Conversion functions (date of birth -> Personal number)§ Look-up table value translations§ Unit conversions (km->mile)§ Format conversion (tif -> jpg)

n Direct translationq In case of homogeneous Schemas/sourcesq Lists of classes with identical outgoing edges

Integration Model: Extra features

Layout


n Integration model


Query Processing

CM Instance (CMI)

n The Hera presentation engine needs more data than that resulted from a “bare” user query

n The user query is extended to retrieve literal values (the real content)

n An RDF graph is constructed out of the “flat” SeRQL output

Query Processing: CMI definition

CMI generation

n User Query: q Retrieve all writers

n Extended query:q Retrieves also the name, age, hasPortrait P4,P5

Literal

Literal Literal Literal

Literal

Literal

Literal

Comic

Picture

WriterAnimator

Person

P1 P2 P3

P4

P5

hasPicture

hasWriter

hasDrawer

name

age

Given the CM:

Literal

drawStyle hasPortrait

Query Processing: CMI generation example

6

Layout

n Introduction / motivationn Integration model

n Query Processing

n Optimization

Optimization

Initial Performance experimentsn Queries against small-size applications (e.g. the comics

database or virtual museum) answered within ms

n Medium Test Set: RDF version of Wordnetq Naturally split in four parts: Nouns (10MB), Glossary (15MB),

Similar-To Definitions (2MB), Hyponyms (8MB)Test Setting: q Three local stores: Vrije Universiteit Amsterdam, CWI

(Amsterdam) and Eindhoven Technical University, Mediator at Vrije Universiteit

Resultsq When installed locally Performance of distributed system is between

50 and 200% of the original Sesame (200-1000 msec.)q The performance drops with the size of the result set due to

extensive joining and communication overhead

Optimization: Initial Performance

Where the Time Goes?

Large Result Set

Local Processing31%

Communication47%

Result Joining21%

Mediator Overhead

1%

Small Result Set

Local Processing75%

Communication9%

Result Joining9%

Mediator Overhead

7%

Depends on many factorsq Data Size / Source Processing Speedq Query (complexity, result size ...) q Connection Speed

Improving the Performance

n Large applications (hundreds of MB) require sophisticated optimization techniques:

n Currentlyq Schema/path Indexing q Join orderingq Algebraic optimizations

n Work in progressq Reducing the transferred data

n requires an architecture change (similar to P2P)

Optimization: Approaches

Schema/Path Index Hierarchy

n Central place to store the translatable paths from the articulations

n The main idea: pushing long paths to sources is more efficent than joining many small paths at the mediatorq fewer joinsq smaller data traffic

n The index is constructed out of a pool of articulations

n Articulations can be added/deleted or modified on the fly: flexible source management

n Index is able to infer “new” paths of its own:

Optimization: Path Indexing


n Inference rules:

q Inclusion of “ super paths”Given: A � B and B � CInsert A � C

q Transitive closure of subClassOf / subPropertyOfGiven: A � B, C � D and B –subClassOf–› Cinsert A � D


7


n Key = path (sequence of properties)n Value = list of sources to which this path can be

translated


Performance Results: Full Index vs 1-path Index

Joining Time

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

100 1000 10000 100000

Result Size per Table

Tim

e [m

s]

Full Index

1-Index

Evaluation Time

100000

300000

500000

700000

900000

1100000

1300000

1500000

100 1000 10000 100000

Result Size per Table

Tim

e [m

s] Full Index

1-Index

From a RDF Path to Relational Tables

A B C Dx y z

Source1 Source2 Source3

C B

a1 b1

a2 b2

B C

b1 c1

b3 c2

C D

c1 d1

c2 d2

Source1x

Source2y

Source3z

� �

The Problem of Join Orderingn Determine the optimal

order of join execution in a (chain) query

n Different strategies have different execution costs due to different selectivity of the join operations

n Problem is NP-hard: we consider heuristic solutions

Optimization: Join Ordering

A Cost Model for RDF Queryingn Data Access Costs

q Initializing the transmissionq Transmitting the data

n Join Costs q Nested Loop Join

q Hash Join (potentially faster but less flexible wrt. determining obhect identity

n Costs of a Query Plan:q Transmission costs for all

relationsq Join costs for the chosen

footprint

Optimization: Join ordering

Join Ordering Heuristicsn Complexity of Task

demands heuristic approaches, performance of heuristics depends on class of queries.

n For chain queries the following performs best [Steinbrunn et.al. 1997]

n 1) Iterative Improvementq Start with random solutionsq Improve solution using a

greedy heuristic q Result: local optima

n 2) Simulated Annealingq Further improve solutions

allowing for increasing costs with a certain probability

q Helps to get out of local optima and converge towards better solutions

Optimization: Join ordering

8

Optimizing Communication: Collaborating Network of Mediators

n One coordinator + Set of cooperating nodes

n The coordinator generates the (global) query plan for other nodes

n instructions:q obtain data

receive a tableuse your local data

q join obtained data in a given order

q ship data to another node

RDFData

RDFData

RDFData


RDFData

RDFData

RDFData


RDFData

RDFData

RepositoryRepository

Path IndexMediator

Path IndexMediator

Path IndexMediator

RDFData

RDFData


Path IndexMediator

RDFData

RDFData


Path IndexMediator

QueryUserUser

Conclusionsn Virtual Repositories are a viable solution for building

distributed WISChallenges:

q Semantic heterogeneityn Source schemas can differ n URI is not enough for joiningn Flexible Evaluation

q Performance / Scalabilityn Path Indexingn Join Orderingn Collaborating Mediators

IM}

} Optimizer + Architecture

Questions

towards distributed rdf querying · introduction: virtual repository rdf data rdf data rdf data rep...

Documents