exchange and consumption of huge rdf data

Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Exchange and Consumption of Huge RDF Data

Miguel A. Martínez-Prieto1,2 <[email protected]>

Mario Arias1,3 <[email protected]>

Javier D. Fernández1,2 <[email protected]>

1. Department of Computer Science, Universidad de Valladolid (Spain)2. Department of Computer Science, Universidad de Chile (Chile)3. Digital Enterprise Research Institute, National University of Ireland Galway


Sharing RDF in the Web of Data.

APP

APP

RDF dump

APP

APP

SPARQL Endpoints/ APIs

APP

dereferenceable URIs

2. EX

CHANGE

APP

APP

APP

Parsing / IndexingReasoning

3. CONSUMPTION

PI

RAPP

sensor

APP

APP

APP

Data Generation

1. PUBLICATION

• Dataset analysis.• Setup a SPARQL server.• Vocabulary interlinking / integration.• Browsing and Visualization.• Exchange between servers• Data-intensive tasks.


Dataset Exchange Workflow

1º Publication

Convert

Serialize

Compress

2º Exchange

Transfer

3ºConsumptio

nDecompres

s

Parse

Index

If RDF is meant to be machine processable,

Why are we using plain text serialization formats??


HDT: RDF Binary Format

Compact Data Structure for RDF. W3C Submission. http://www.w3.org/Submission/2011/03/ Open Source C++/Java library.

http://www.w3.org/Submission/2011/03/

http://www.w3.org/Submission/2011/03/


HDT Focused on Querying

Contribution of this paper: A complementary Index to make the HDT fully queryable. Analysis on how HDT reduces exchange and indexing

time. Evaluate in-memory query performance.

FoQ


Dictionary

Mapping of strings to correlative IDs. {1..n} Lexicographically sorted, no duplicates. Section compression explained at [8]


Triples Model

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O [ ][ ] [ ][ ] [ ] [ ]

[ ] [ ] [ ]


Adjacency Lists

Operations:– access(g) = Given a global position, get the value.– findList(g) = Given a global position, get the list number.– first(l), last(l), = Given a list, find the first and last.

1 2 3 4 5 6

1 2 32 3 1 2 4 3[ , ] [ , , ] [ ]

2 3 1 2 4 3 1 0 1 0 0 1

ArrayBitmap

O(1)O(1)O(log log n)


Triples Model and Coding

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Array YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O


Searching by Subject

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Array YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

SPO, SP?S??, S?O

( 2, 2, ? )


Searching by Predicate

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Array YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

? P ?

( ?, 2, ? )


Wavelet Tree

Compact Sequence of Integers {0,σ}.

access(position) = Value at position. rank(entry, position) = Number of appearances of “entry” up to “position”.

select(entry, i) = Position where “entry” appears for the i-th time.

2 3 6 3 6 1 2 3 6 2 5 2 4 1 4 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1rank(3, 7) = 2

9select(6, 3) = 9

O(log σ)O(log σ)O(log σ)


Searching by Predicate w/ Wavelet

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3 1 0 1 0 0 1

1111011

Wavelet YBitmap Y

Array ZBitmap Z

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

? P ?

( ?, 2, ? )


Triples: Object-Search

1 2 61 3 22 1 32 2 42 2 52 4 13 3 2

Triples

6 2 3 4 5 1 2

2 3 1 2 4 3

1 2 3S

P

O

6 2 7 3 4 5 1[ ] [ ][ ] [ ] [ ] [ ]OP-IndexO1 O2 O3 O4 O5 O6

? ? O? P O

( ?, ?, 2 )


Data Structure Summary.

From HDT to HDT-FoQ: Convert Array Y to Wavelet. Generate OP-Index.

Triple Patterns:

SPO, SP?, S??, S?O Original HDT?P? Wavelet Tree?PO, ??O OP-Index


Evaluation Environment

Dataset Triples Size NTriples

LinkedMDB 6,1M 850 MbDBLP 73M 11,1 GbGeonames 112M 12,3 GbDBPedia 258M 37,3 Gb

Producer:Xeon @ 2.4Ghz96GB RAM

Consumer:Phenom-II @ 3.2Ghz8GB RAM

RDF Storage

• Virtuoso• RDF-3x• Hexastore

Compressors:

• GZIP• LZMA

Datasets


Compression Ratio

LinkedMDB

DBLP

Geonames

DBPedia

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

hdtgzlzmahdt.gzhdt.lzma

Compression ratio (% against plain ntriples)


Publication Times

NT+GZIP NT+LZMA

HDT HDT+GZIP HDT+LZMA

linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 minDBLP 2,72 min 103 min 12 min 13,5 min 21,9 minGeonames 3,28 min 244 min 25 min 26,4 min 38,9 minDBPedia 15,9 min 466 min 56 min 60 min 121 min

linkedMDB

dblp

geonames

dbpedia

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

gz lzma hdt hdt.gz hdt.lzma

Times slower than Ntriples+GZIP


Publication Times2

NT+GZIP NT+LZMA

HDT HDT+GZIP HDT+LZMA

linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 minDBLP 2,72 min 103 min 12 min 13,5 min 21,9 minGeonames 3,28 min 244 min 25 min 26,4 min 38,9 minDBPedia 15,9 min 466 min 56 min 60 min 121 min

linkedMDB

dblp

geonames

dbpedia

0 1 2 3 4 5 6 7 8 9 10 11 12 13

gz hdt hdt.gz hdt.lzma

Times slower than Ntriples + GZIP


Exchange & Decompression Time

HDT+LZMA

HDT+GZIP

LZMA

GZIP

0 50 100 150 200 250 300

ExchangeDecompress

Seconds (Geometric Mean of all datasets)

*Assuming a Network Bandwidth of 2MByte/s


Overall Client Time

HDT+GZIP+FOQ

HDT+LZMA+FOQ

GZ+RDF3x

LZMA+RDF3x

GZ+Virtuoso

LZMA+Virtuoso

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600

ExchangeDecompressIndex

Seconds (Geometric mean of all datasets)

LZMA+RDF3x HDT+LZMA

linkedMDB 2,1 min 9,21 secdblp 27 min 2,02 mingeonames 49,2 min 3,04 mindbpedia 159 min 14,3 min


In-memory Data Store.

Less size = more data in memory = less I/O access!

Triples Index Size (Mb)Virtuoso Hexastor

eRDF3x

HDT-FoQ

LinkedMDB

6,1M 518 6976 337 68

DBLP 46M 3982 - 3252 850Geonames 112M 9216 - 6678 1435DBPedia 258M - - 15802 5260


SP? S?O S?? ?PO ?P? ??O0123456789

10111213141516

LinkedMDB

Tim

es H

DT

Fast

erQuery Performance, Triple Patterns

SP? S?O S?? ?PO ?P? ??O0123456789

10111213141516

Geonames

RDF-3xVirtuoso


SSbig SSsmall SObig SOsmall OObig OOsmall0

0.5

1

1.5

2

2.5

3

GeonamesRDF-3xVirtuoso

Query Performance Two-way Joins

SSbig SSsmall SObig SOsmall OObig OOsmall0

0.5

1

1.5

2

2.5

3

LinkedMDB

Tim

es H

DT

Fast

er


Conclusions

Data is ready to be consumed 10-15x faster. Exchange time reduced. Indexing burden on server = Lightweight client processing.

Competitive query performance. Very fast on triple patterns. Joins on the same scale of existing solutions.

This is useful to you: If you need a fast, compact read-only in-memory RDF

store. If you want to share self-queryable RDF dumps. If you need fast download & query.

Addresses the volume issue of Big Data.


Future work. Full SPARQL support.

UNION, Optional, Multiple Join. Optimized query evaluation.

Integration: Jena, Any23…

Diffussion. Get more people to use it!

Additional services on top of HDT. SPARQL Endpoint. Distributed Stream Processing. Mobile Applications.


Thanks! http://www.rdf-hdt.org

exchange and consumption of huge rdf data

Technology

query performance

triple patterns

computer science

global position

index

hdt

data

rdf