exchange and consumption of huge rdf data

27
Copyright 2010 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e Exchange and Consumption of Huge RDF Data Miguel A. Martínez-Prieto 1,2 <[email protected]> Mario Arias 1,3 <[email protected]> Javier D. Fernández 1,2 <[email protected]> Department of Computer Science, Universidad de Valladolid (Spain) Department of Computer Science, Universidad de Chile (Chile) Digital Enterprise Research Institute, National University of Ireland Galwa

Upload: mario-arias

Post on 29-Aug-2014

4.009 views

Category:

Technology


2 download

DESCRIPTION

Huge RDF datasets are currently exchanged on textual RDF formats, hence consumers need to post-process them using RDF stores for local consumption, such as indexing and SPARQL query. This results in a painful task requiring a great effort in terms of time and compu- tational resources. A first approach to lightweight data exchange is a compact (binary) RDF serialization format called HDT. In this paper, we show how to enhance the exchanged HDT with additional structures to support some basic forms of SPARQL query resolution without the need of "unpacking" the data. Experiments show that i) with an exchanging ef- ficiency that outperforms universal compression, ii) post-processing now becomes a fast process which iii) provides competitive query performance at consumption.

TRANSCRIPT

Page 1: Exchange and Consumption of Huge RDF Data

Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Exchange and Consumption of Huge RDF Data

Miguel A. Martínez-Prieto1,2 <[email protected]>

Mario Arias1,3 <[email protected]>

Javier D. Fernández1,2 <[email protected]>

1. Department of Computer Science, Universidad de Valladolid (Spain)2. Department of Computer Science, Universidad de Chile (Chile)3. Digital Enterprise Research Institute, National University of Ireland Galway

Page 2: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Sharing RDF in the Web of Data.

APP

APP

RDF dump

APP

APP

SPARQL Endpoints/ APIs

APP

dereferenceable URIs

2. EX

CHANGE

APP

APP

APP

Parsing / IndexingReasoning

3. CONSUMPTION

PI

RAPP

sensor

APP

APP

APP

Data Generation

1. PUBLICATION

• Dataset analysis.• Setup a SPARQL server.• Vocabulary interlinking / integration.• Browsing and Visualization.• Exchange between servers• Data-intensive tasks.

Page 3: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Dataset Exchange Workflow

1º Publication

Convert

Serialize

Compress

2º Exchange

Transfer

3ºConsumptio

nDecompres

s

Parse

Index

If RDF is meant to be machine processable,

Why are we using plain text serialization formats??

Page 4: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

HDT: RDF Binary Format

Compact Data Structure for RDF. W3C Submission. http://www.w3.org/Submission/2011/03/ Open Source C++/Java library.

Page 5: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

HDT Focused on Querying

Contribution of this paper: A complementary Index to make the HDT fully queryable. Analysis on how HDT reduces exchange and indexing

time. Evaluate in-memory query performance.

FoQ

Page 6: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Dictionary

Mapping of strings to correlative IDs. {1..n} Lexicographically sorted, no duplicates. Section compression explained at [8]

Page 7: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Triples Model

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O [ ][ ] [ ][ ] [ ] [ ]

[ ] [ ] [ ]

Page 8: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Adjacency Lists

Operations:– access(g) = Given a global position, get the value.– findList(g) = Given a global position, get the list number.– first(l), last(l), = Given a list, find the first and last.

1 2 3 4 5 6

1 2 32 3 1 2 4 3[ , ] [ , , ] [ ]

2 3 1 2 4 3 1 0 1 0 0 1

ArrayBitmap

O(1)O(1)O(log log n)

Page 9: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Triples Model and Coding

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Array YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

Page 10: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Searching by Subject

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Array YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

SPO, SP?S??, S?O

( 2, 2, ? )

Page 11: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Searching by Predicate

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Array YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

? P ?

( ?, 2, ? )

Page 12: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Wavelet Tree

Compact Sequence of Integers {0,σ}.

access(position) = Value at position. rank(entry, position) = Number of appearances of “entry” up to “position”.

select(entry, i) = Position where “entry” appears for the i-th time.

2 3 6 3 6 1 2 3 6 2 5 2 4 1 4 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1rank(3, 7) = 2

9select(6, 3) = 9

O(log σ)O(log σ)O(log σ)

Page 13: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Searching by Predicate w/ Wavelet

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Wavelet YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

? P ?

( ?, 2, ? )

Page 14: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Triples: Object-Search

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

6 2 7 3 4 5 1[ ] [ ][ ] [ ] [ ] [ ]OP-IndexO1 O2 O3 O4 O5 O6

? ? O? P O

( ?, ?, 2 )

Page 15: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Data Structure Summary.

From HDT to HDT-FoQ: Convert Array Y to Wavelet. Generate OP-Index.

Triple Patterns:

SPO, SP?, S??, S?O Original HDT?P? Wavelet Tree?PO, ??O OP-Index

Page 16: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Evaluation Environment

Dataset Triples Size NTriples

LinkedMDB 6,1M 850 MbDBLP 73M 11,1 GbGeonames 112M 12,3 GbDBPedia 258M 37,3 Gb

Producer:Xeon @ 2.4Ghz96GB RAM

Consumer:Phenom-II @ 3.2Ghz8GB RAM

RDF Storage

• Virtuoso• RDF-3x• Hexastore

Compressors:

• GZIP• LZMA

Datasets

Page 17: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Compression Ratio

LinkedMDB

DBLP

Geonames

DBPedia

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

hdtgzlzmahdt.gzhdt.lzma

Compression ratio (% against plain ntriples)

Page 18: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Publication Times

NT+GZIP NT+LZMA

HDT HDT+GZIP HDT+LZMA

linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 minDBLP 2,72 min 103 min 12 min 13,5 min 21,9 minGeonames 3,28 min 244 min 25 min 26,4 min 38,9 minDBPedia 15,9 min 466 min 56 min 60 min 121 min

linkedMDB

dblp

geonames

dbpedia

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

gz lzma hdt hdt.gz hdt.lzma

Times slower than Ntriples+GZIP

Page 19: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Publication Times2

NT+GZIP NT+LZMA

HDT HDT+GZIP HDT+LZMA

linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 minDBLP 2,72 min 103 min 12 min 13,5 min 21,9 minGeonames 3,28 min 244 min 25 min 26,4 min 38,9 minDBPedia 15,9 min 466 min 56 min 60 min 121 min

linkedMDB

dblp

geonames

dbpedia

0 1 2 3 4 5 6 7 8 9 10 11 12 13

gz hdt hdt.gz hdt.lzma

Times slower than Ntriples + GZIP

Page 20: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Exchange & Decompression Time

HDT+LZMA

HDT+GZIP

LZMA

GZIP

0 50 100 150 200 250 300

ExchangeDecompress

Seconds (Geometric Mean of all datasets)

*Assuming a Network Bandwidth of 2MByte/s

Page 21: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Overall Client Time

HDT+GZIP+FOQ

HDT+LZMA+FOQ

GZ+RDF3x

LZMA+RDF3x

GZ+Virtuoso

LZMA+Virtuoso

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600

ExchangeDecompressIndex

Seconds (Geometric mean of all datasets)

LZMA+RDF3x HDT+LZMA

linkedMDB 2,1 min 9,21 secdblp 27 min 2,02 mingeonames 49,2 min 3,04 mindbpedia 159 min 14,3 min

Page 22: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

In-memory Data Store.

Less size = more data in memory = less I/O access!

Triples Index Size (Mb)Virtuoso Hexastor

eRDF3x

HDT-FoQ

LinkedMDB

6,1M 518 6976 337 68

DBLP 46M 3982 - 3252 850Geonames 112M 9216 - 6678 1435DBPedia 258M - - 15802 5260

Page 23: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

SP? S?O S?? ?PO ?P? ??O0123456789

10111213141516

LinkedMDB

Tim

es H

DT

Fast

erQuery Performance, Triple Patterns

SP? S?O S?? ?PO ?P? ??O0123456789

10111213141516

Geonames

RDF-3xVirtuoso

Page 24: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

SSbig SSsmall SObig SOsmall OObig OOsmall0

0.5

1

1.5

2

2.5

3

GeonamesRDF-3xVirtuoso

Query Performance Two-way Joins

SSbig SSsmall SObig SOsmall OObig OOsmall0

0.5

1

1.5

2

2.5

3

LinkedMDB

Tim

es H

DT

Fast

er

Page 25: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Conclusions

Data is ready to be consumed 10-15x faster. Exchange time reduced. Indexing burden on server = Lightweight client processing.

Competitive query performance. Very fast on triple patterns. Joins on the same scale of existing solutions.

This is useful to you: If you need a fast, compact read-only in-memory RDF

store. If you want to share self-queryable RDF dumps. If you need fast download & query.

Addresses the volume issue of Big Data.

Page 26: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Future work. Full SPARQL support.

UNION, Optional, Multiple Join. Optimized query evaluation.

Integration: Jena, Any23…

Diffussion. Get more people to use it!

Additional services on top of HDT. SPARQL Endpoint. Distributed Stream Processing. Mobile Applications.

Page 27: Exchange and Consumption of Huge RDF Data

Digital Enterprise Research Institute www.deri.ie

Thanks! http://www.rdf-hdt.org