diplodocus[rdf]: short and long-tail rdf analytics for massive webs of data
DESCRIPTION
dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute). http://diuf.unifr.ch/main/xi/diplodocus/TRANSCRIPT
Short and Long-Tail RDF Analytics for Massive Webs of Data
Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux
eXascale Infolab, University of FribourgSwitzerland
International Semantic Web Conference26th October 2011, Bonn, Germany
Motivation
● increasingly large semantic/LoD data sets
● increasingly complex queries○ real time analytic queries
■ like “returning professor who supervises the most students”
urgent need for more efficient and scalable solution for RDF data management
3 recipes to speed-up
3 recipes to speed-up
○collocation
3 recipes to speed-up
○collocation
○collocation
3 recipes to speed-up
○collocation
○collocation
○collocation
Why collocation??
Because by collocating data together we can reduce IO operations, which are one of the biggest bottlenecks in database systems.
Outline
● architecture
● main idea
● data structures
● basic operations (inserts, queries)
● evaluation & results
● future work
System Architecture
Main Idea - Hybrid Storage
Main Idea - data structures
Declarative Templates
Template Matching
Molecule Clusters
● extremely compact sub-graphs● precomputed joins
List of Literals
● extremely compact list of sorted values
Hash Table
lexicographic tree to encode URIs
template based indexing
extremely compact lists of homologous nodes
Basic operations - inserts
n-pass algorithm
Basic operations - queries - triple patterns?x type Student.?x takesCourse Course0.
?x type Student.?x takesCourse Course0.?x takesCourse Course1.
=> intersection of sorted lists
Basic operations - queries - molecule queries
?a name 'Student1'.?a ?b ?c.?c ?d ?e.
Basic operations - queriesaggregates and analytics
?x type Student.?x age ?yfilter (?y < 21)
Performance Evaluation
We used the Lehigh University Benchmark.
We generated two datasets, for 10 and 100 Universities.● 1 272 814 distinct triples and 315 003 distinct strings● 13 876 209 distinct triples and 3 301 868 distinct strings
We compared the runtime execution for 14 LUBM queries and 3 analytic queries inspired from BowlognaBench.
● returning professor who supervises the most students● returning big molecule containing everything around
Student0 within scope 2● returning names for all graduate students
Results - LUBM - 10 Universities
Results - LUBM - 100 Universities
Results - analytic 10 Universities
Results - analytic 100 Universities
Future work
● open source○ cleaning code○ extending code
● parallelising operations○ multi-core architecture○ cloud
● automated database design
Conclusions
● advanced data collocation○ molecules, RDF sub-graphs○ lists of literals, compact sorted list of values○ hash table indexed by templates
● slower inserts and updates○ compact ordered structures○ data redundancy
● 30 times faster on LUBM queries ● 350 times faster on analytic queries
Thank you for your attention
Update Manager - lazy updates
Transitivity
● Inheritance Manager○ typeX subClassOf typeY
● Query○ ?z type typeY
■ ?z type typeY■ ?z type typeX
● subClassOf● subPropertyOf
Serialising Molecules
#TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE
#TEMPLATES - the number of templates in the moleculeTEMPLATE_SIZE - the size of a key in bytes
#TRIPLES - the number of triples in the molecule KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)