diplodocus[rdf]: short and long-tail rdf analytics for massive webs of data

31
Short and Long-Tail RDF Analytics for Massive Webs of Data Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland International Semantic Web Conference 26th October 2011, Bonn, Germany

Upload: exascale-infolab

Post on 10-May-2015

189 views

Category:

Science


0 download

DESCRIPTION

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute). http://diuf.unifr.ch/main/xi/diplodocus/

TRANSCRIPT

Page 1: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Short and Long-Tail RDF Analytics for Massive Webs of Data

Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux

eXascale Infolab, University of FribourgSwitzerland

International Semantic Web Conference26th October 2011, Bonn, Germany

Page 2: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Motivation

● increasingly large semantic/LoD data sets

● increasingly complex queries○ real time analytic queries

■ like “returning professor who supervises the most students”

urgent need for more efficient and scalable solution for RDF data management

Page 3: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

3 recipes to speed-up

Page 4: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

3 recipes to speed-up

○collocation

Page 5: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

3 recipes to speed-up

○collocation

○collocation

Page 6: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

3 recipes to speed-up

○collocation

○collocation

○collocation

Page 7: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Why collocation??

Because by collocating data together we can reduce IO operations, which are one of the biggest bottlenecks in database systems.

Page 8: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Outline

● architecture

● main idea

● data structures

● basic operations (inserts, queries)

● evaluation & results

● future work

Page 9: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

System Architecture

Page 10: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Main Idea - Hybrid Storage

Page 11: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Main Idea - data structures

Page 12: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Declarative Templates

Page 13: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Template Matching

Page 14: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Molecule Clusters

● extremely compact sub-graphs● precomputed joins

Page 15: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

List of Literals

● extremely compact list of sorted values

Page 16: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Hash Table

lexicographic tree to encode URIs

template based indexing

extremely compact lists of homologous nodes

Page 17: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Basic operations - inserts

n-pass algorithm

Page 18: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Basic operations - queries - triple patterns?x type Student.?x takesCourse Course0.

?x type Student.?x takesCourse Course0.?x takesCourse Course1.

=> intersection of sorted lists

Page 19: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Basic operations - queries - molecule queries

?a name 'Student1'.?a ?b ?c.?c ?d ?e.

Page 20: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Basic operations - queriesaggregates and analytics

?x type Student.?x age ?yfilter (?y < 21)

Page 21: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Performance Evaluation

We used the Lehigh University Benchmark.

We generated two datasets, for 10 and 100 Universities.● 1 272 814 distinct triples and 315 003 distinct strings● 13 876 209 distinct triples and 3 301 868 distinct strings

We compared the runtime execution for 14 LUBM queries and 3 analytic queries inspired from BowlognaBench.

● returning professor who supervises the most students● returning big molecule containing everything around

Student0 within scope 2● returning names for all graduate students

Page 22: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Results - LUBM - 10 Universities

Page 23: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Results - LUBM - 100 Universities

Page 24: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Results - analytic 10 Universities

Page 25: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Results - analytic 100 Universities

Page 26: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Future work

● open source○ cleaning code○ extending code

● parallelising operations○ multi-core architecture○ cloud

● automated database design

Page 27: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Conclusions

● advanced data collocation○ molecules, RDF sub-graphs○ lists of literals, compact sorted list of values○ hash table indexed by templates

● slower inserts and updates○ compact ordered structures○ data redundancy

● 30 times faster on LUBM queries ● 350 times faster on analytic queries

Page 28: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Thank you for your attention

Page 29: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Update Manager - lazy updates

Page 30: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Transitivity

● Inheritance Manager○ typeX subClassOf typeY

● Query○ ?z type typeY

■ ?z type typeY■ ?z type typeX

● subClassOf● subPropertyOf

Page 31: dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Serialising Molecules

#TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE

#TEMPLATES - the number of templates in the moleculeTEMPLATE_SIZE - the size of a key in bytes

#TRIPLES - the number of triples in the molecule KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)