diplodocus[rdf]: short and long-tail rdf analytics for massive webs of data

Short and Long-Tail RDF Analytics for Massive Webs of Data

Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux

eXascale Infolab, University of FribourgSwitzerland

International Semantic Web Conference26th October 2011, Bonn, Germany

Motivation

● increasingly large semantic/LoD data sets

● increasingly complex queries○ real time analytic queries

■ like “returning professor who supervises the most students”

urgent need for more efficient and scalable solution for RDF data management

3 recipes to speed-up


○collocation


○collocation

○collocation


○collocation

○collocation

○collocation

Why collocation??

Because by collocating data together we can reduce IO operations, which are one of the biggest bottlenecks in database systems.

Outline

● architecture

● main idea

● data structures

● basic operations (inserts, queries)

● evaluation & results

● future work

System Architecture

Main Idea - Hybrid Storage

Main Idea - data structures

Declarative Templates

Template Matching

Molecule Clusters

● extremely compact sub-graphs● precomputed joins

List of Literals

● extremely compact list of sorted values

Hash Table

lexicographic tree to encode URIs

template based indexing

extremely compact lists of homologous nodes

Basic operations - inserts

n-pass algorithm

Basic operations - queries - triple patterns?x type Student.?x takesCourse Course0.

?x type Student.?x takesCourse Course0.?x takesCourse Course1.

=> intersection of sorted lists

Basic operations - queries - molecule queries

?a name 'Student1'.?a ?b ?c.?c ?d ?e.

Basic operations - queriesaggregates and analytics

?x type Student.?x age ?yfilter (?y < 21)

Performance Evaluation

We used the Lehigh University Benchmark.

We generated two datasets, for 10 and 100 Universities.● 1 272 814 distinct triples and 315 003 distinct strings● 13 876 209 distinct triples and 3 301 868 distinct strings

We compared the runtime execution for 14 LUBM queries and 3 analytic queries inspired from BowlognaBench.

● returning professor who supervises the most students● returning big molecule containing everything around

Student0 within scope 2● returning names for all graduate students

Results - LUBM - 10 Universities

Results - LUBM - 100 Universities

Results - analytic 10 Universities

Results - analytic 100 Universities

Future work

● open source○ cleaning code○ extending code

● parallelising operations○ multi-core architecture○ cloud

● automated database design

Conclusions

● advanced data collocation○ molecules, RDF sub-graphs○ lists of literals, compact sorted list of values○ hash table indexed by templates

● slower inserts and updates○ compact ordered structures○ data redundancy

● 30 times faster on LUBM queries ● 350 times faster on analytic queries

Thank you for your attention

Update Manager - lazy updates

Transitivity

● Inheritance Manager○ typeX subClassOf typeY

● Query○ ?z type typeY

■ ?z type typeY■ ?z type typeX

● subClassOf● subPropertyOf

Serialising Molecules

#TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE

#TEMPLATES - the number of templates in the moleculeTEMPLATE_SIZE - the size of a key in bytes

#TRIPLES - the number of triples in the molecule KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)

diplodocus[rdf]: short and long-tail rdf analytics for massive webs of data

Science

molecule queries

lubm queries

x type student

results analytic

distinct triples

molecule key

results lubm

main idea data structures