2014-11 apacheconeu : lizard - clustering an rdf triplestore

21
Lizard Clustering an RDF Triplestore Andy Seaborne [email protected]

Upload: andyseaborne

Post on 10-Jul-2015

533 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

LizardClustering an RDF Triplestore

Andy Seaborne

[email protected]

Page 2: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Why Cluster?➢ Service Resilience

○ Failures○ Server admin and security patches

➢ Performance / scale○ More hardware : CPU, RAM, system bus

Page 3: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Why Cluster?➢ Alternatives

○ Full clustering○ Master-slave (load scale only ; update issues)○ Application visible partitioning

Page 4: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Who am I?➢ Committer on Apache Jena

○ Deploy/operate Jena/TDB in ££job.

➢ W3C○ Co-editor on SPARQ 1.0 and 1.1 query lang

spaces○ RDF 1.1 (on syntax, inc. SPARQL alignment)○ ASF’s W3C AC representative

Page 5: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Acknowledgements➢ Apache➢ Partial funding : InnovateUK* ➢ Users

○ For the discussion and encouragement

* Used to be the Technology Strategy Board.UK Department for Business, Innovation & Skills

Page 6: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Outline➢ TDB Design➢ SPARQL Execution➢ Lizard Design➢ Back to SPARQL

Page 7: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

TDB➢ Custom RDF Database

○ Quads + Triples

➢ Custom index code○ Threaded B+Trees

➢ Dictionary terms○ NodeId – 8 bytes○ Inline terms (numbers, dates, dateTimes)

Page 8: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

TDB : Node Table

Page 9: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

TDB : Indexes➢ Indexes are covering

○ Range scans○ All key, no value○ No "triple table"

Page 10: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

SPARQL Execution{ ?x :p 123 . }

Convert to NodeIds

Look in POS to get all PO?, assign S to ?x

123 is an inline constant in TDB.

{ ?x :p 123 . ?x :q ?v . }

A database joinIndex join (Loop+substitution)

Index join (= loop) on :x1 :q ?vwhere :x1 is the value of ?x

Page 11: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Index Implementation➢ TDB uses threaded B+Trees for indexes

○ 8K blocks 100-way B+Tree○ Threaded : scan only touches data blocks

SPO SPO SPO ------ ------ ------

Ptr Ptr ------ ------ ------

SPO SPO SPO SPO ------ ------

Ptr Ptr Ptr ------ ------

SPO SPO SPO SPO SPO SPO SPO SPO SPO SPO ------ ------

Page 12: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Choices

Where to introduce distribution?

Query and Update

Indexes / B+Trees Node table / Objects

Blocks Key → Value Store

Page 13: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

This Does Not Work (very well)

➢ Impedance mismatch○ Too much data moving about○ Little parallelism○ Bad cold-start

Distribute the storageK->V store

Index access on query processor

Query and Update

B+Trees Objects

Blocks Key→Value

Page 14: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Distribute

➢ Distribute the indexes○ With modified index access

➢ Distribute the nodes➢ Comms : Apache Thrift

Query and Update

B+Trees Objects

Blocks Key→Value

Page 15: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Clustered Node Table➢ Node Table

○ N replicas; Read R / Write We.g. W=N and R =1 =>Complete copies of node table on each data server

○ ReplaceableRequirement: NodeId for naming

Page 16: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Clustered Indexes➢ Indexes

○ Shard by subject○ Replica shards○ Compound access operations

Page 17: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Clustered IndexesIndex

Shard 1 Shard 2 Shard 3

Machine 1 Machine 2

Page 18: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Configuration 1

Page 19: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Configuration 2

Page 20: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore

Modified SPARQL Execution➢ Different unit of index access

○ subject + several predicates(subj, pred1, pred2, pred3, …)

➢ Different join algorithms○ Merge join○ Parallel hash join