sparti scalable rdf data management using query-centric ... · conclusion •we presented sparti, a...

28
SPARTI Scalable RDF Data Management Using Query-Centric Semantic Partitioning Amgad Madkour (Purdue) Walid G. Aref (Purdue) Ahmed M. Aly (Google)

Upload: others

Post on 22-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

SPARTIScalable RDF Data Management Using Query-Centric Semantic Partitioning

Amgad Madkour (Purdue) Walid G. Aref (Purdue)Ahmed M. Aly (Google)

Page 2: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Motivation

• The astronomical growth of RDF data raises the need for scalable RDF management strategies

• Efficient RDF data partitioning can significantly improve the query performance over cloud platforms

• Cloud platforms provide shared memory, storage, and advanced data processing components that help manage web-scale RDF

Page 3: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Problem Definition

• Vertical Partitioning (VP) is a scalablepartitioning schema that can be used in cloud-based systems

• However, not all entries of a VP are part of the final result

• The non-matching entries cause computation and communication overhead

SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }

mention

SUB OBJ

:John :T1

:Mike :T2

Mike :T3

:Alex :T4

tweet

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Original Data

Reductions

Join(mention sub , tweet sub )

Join(tweet sub , mention sub )

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

SUB OBJ

:John :T1

:Alex :T4

Page 4: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Vertical Partitioning ModelVertical Partitioning [Abadi et al. – VLDB 2009]

• Create property-bound tables consisting of two columns

• Advantages­ Inspects the corresponding VP tables only­ Avoids the tuple-header reading overhead of row-stores when stored in a column store

• Drawbacks­ Some partitions can account for a large portion of the entire graph

­ (i.e. Massive I/O)

Subject:John:John:John:Mike:Mike:Alex:Alex

Propertymentionmentiontweettweettweetmentiontweet

Object:Mary:Mike:T1:T2:T3:Mary:T4

Triples Table

mention

Subject Object:John :Mary:Alex :Mary:John :Mike

tweet

Subject Object:John :T1:Mike :T2:Mike :T3:Alex :T4

Page 5: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ProposalOverview

• Three Phases

1. Property-based Partitioning2. Partition Reduction3. Query Processing

Page 6: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

SPA

ProposalHigh-Level Workflow

RDF Dataset

Partition based on “property”

Query Workload Partition Reduction

Query Processing

Input

OutputInput

(Incremental)

SPARTISemVP Tables

…Subject tweet mentio

nMETA

… … …

…Subject tweet mentio

nMETA

… … …

…Subject tweet mentio

nMETA

… … …

Row-Level Semantics

Subject Object

… …

Page 7: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ProposalProperty-Based Partitioning

• SemVP - A relational partitioning schema that extends VP’s­ Partition RDF datasets based on property­ Represent row-level semantics for every triple via semantic filters

• Create Bloom filters for every SemVP (subject and object)­ Used later to compute reduced row-sets for specific query join patterns

Page 8: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ProposalPartition Reduction

• Identify related properties from the query workload for every partition

• Identify join patterns in the query workload

• Compute a reduction using Bloom join between related properties based on the join patterns

• Store the result of the reduction as semantic filters

Page 9: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ProposalQuery Processing

• Identify the properties and join patterns in queries

• Match every property and query join pattern with a SemVPpartition and a semantic filter, respectively

• In case a match is found, a semantic filter (representing the reduced set of rows) is read to answer a query instead of reading the entire partition for a property.

Page 10: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Challenges

• How to represent row-level semantics (semantic filters)

• How to efficiently compute reductions ?

• How to select semantic filters that minimize network and disk IO ?

• How to evolve based on the query-workload

Page 11: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Semantic Vertical Partitioning (SemVP)

• A relational partitioning schema that extends vertical partitioning with row-level semantics

• Consists of 2 columns: Subject, Object

• A Semantic Data Layer (SDATA) structure that stores triple semantics

Page 12: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

SDATA

[1]

[1,2]

[1]

Key Count

mention_S_JN_tweet_S 3

mention_O_JN_tweet_S 1

SEMANTIC DATA LAYER

SemVP

StatisticsBloom

Filters

Key ID

mention_S_JN_tweet_S 1

mention_O_JN_tweet_S 2Sem

antic Filters

Semantic Vertical Partitioning (SemVP)Example

Dictionary

Key Filter

mention_S [B1]

mention_O [B2]

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

Figure: SemVP representing the mention property

Page 13: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Semantic Vertical Partitioning (SemVP)Semantic Data Layer

• Semantic Filters­Used for determining triples that can are read to answer a specific query join pattern

• Statistics­Maintains statistics about the semantic filters

• Bloom Filters­Used for computing the semantic filters

Page 14: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Property ReductionExample

SUB OBJ

:John :Mary

:Alex :Mary

:John :Mike

SUB OBJ

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

:John, :Alex :John,

:Mike, :Alex

SELECT ?x ?y WHERE {?x :mention :John . ?x :tweet ?y . }

x=[:mention_S, :tweet_S]

SUB OBJ

:John :Mary

:Alex :Mary

:John :MikeBF: tweet_S

BF: mention_S

SUB OBJ

:John :T1

:Alex :T4

BGPJoin

:mention :tweetSDATA

[2]

[2]

SDATA

[1]

[1]

[1]

Original Partitions

SF (ID: 2) : tweet_S_JN_mention_S

SF (ID: 1) : mention_S_JN_tweet_S

Page 15: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Partition ReductionExample: Property Relatedness

SELECT ?X ?Y ?ZWHERE{

?X type GraduateStudent .?Z phdFrom ?Y .?X mscFrom ?Y .

}

type = [ mscFrom ]phdFrom = [mscFrom ]mscFrom = [ type, phdFrom]

• Properties are considered related if they appear in the same query join pattern

• SPARTI utilizes the co-occurrence to determine which semantic filters should be computed

Page 16: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Evolution

• Evolution relates to when semantic filters are:­Created – To reduce partitions of newly observed join patterns­Deleted – To reduce disk space

• The cost of computing additional semantic filters can be inferred from the query-workload and the SemVP statistics

Page 17: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Cost Model

• Each SemVP partition maybe associated with hundreds of semantic filters

• A cost-model is needed in order to determine the importance of a semantic filter

!"#$#"% = α ( + ß + +∞ (.)

• Where ­ S : Support (ie, frequency) of a join pattern within a query-workload­ R : The partition size­ P : Number of properties that the semantic filter includes­ α + ß + ∞ = 1

Page 18: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Experimental Setup• Computational Framework

­ Apache Spark

• Storage­ HDFS

• Datasets­ WatDiv Benchmark + Stress Workload­ 100 Million, 1 Billion

­ YAGO­ 200 M

• Systems­ S2RDF­ SPARTI­ Vertical Partitioning

Page 19: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsNumber of Rows

1.E-02

1.E+00

1.E+02

1.E+04

1.E+06

WatDiv (100M) WatDiv (1000M) YAGO2s

Num

. Row

s (C

ount

)

Datasets

Original file VP SPARTI S2RDF

Page 20: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsNumber of Files

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

WatDiv (100M) WatDiv (1000M) YAGO2s

Num

. File

s (C

ount

)

Datasets

VP SPARTI S2RDF

Page 21: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsHDFS Size

1.E+00

1.E+01

1.E+02

1.E+03

WatDiv (100M) WatDiv (1000M) YAGO2s

Stor

age

Size

(GB)

Dataset

VP SPARTI S2RDF

Page 22: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsLoad Time

0.E+00

5.E+03

1.E+04

2.E+04

2.E+04

3.E+04

3.E+04

WatDiv(100M) WatDiv (1000M) YAGO2s

Load

Tim

e (S

econ

ds)

Datasets

VP SPARTI S2RDF

0%

20%

40%

60%

80%

100%

WatDiv(100M) WatDiv(1000M) YAGO2s

Perc

enta

ge (T

ime)

Datasets

Property-Based Partitioning Partition Reduction

Page 23: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsExecution Time – WatDiv 1B

1.E+02

1.E+03

1.E+04

1.E+05

C1 C2 C3 F1 F2 F3 F4 F5 L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 S6 S7

Exec

. Tim

e (m

illise

cond

s)

WatDiv Queries - Snowflake (F) , Complex (C) , Linear (L), Star (S)

SPARTI S2RDF VP

Page 24: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsExecution Time - YAGO

2.E+02

8.E+02

3.E+03

1.E+04

5.E+04

2.E+05

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15

Exec

. Tim

e (m

illise

cond

s)

YAGO Queries

SPARTI S2RDF VP

Page 25: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

ResultsExecution – WatDiv Stress Testing Workload

0

500

1000

1500

2000

2500

3000

P1 P2 P3 P4 P5

Exec

. Tim

e (m

illise

cond

s)

WatDiv Stress-Testing Workload Snapshots (P1-P5)

Snowflake Linear Star

Page 26: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Conclusion

• We presented SPARTI, a scalable RDF data management system that utilizes a relational scheme and provides row- level semantics for RDF data

• The row level semantics, represented as semantic filters, provide SPARTI with a mechanism to read a reduced set of rows when answering specific query join-patterns

• The cost-model for managing semantic filters prioritizes the creation of the important semantic filters to compute.

• The experimental study that compares SPARTI with the state-of-the-art Spark-based RDF system demonstrates that SPARTI achieves robust performance over synthetic and real datasets

Page 27: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

Questions ?

Page 28: SPARTI Scalable RDF Data Management Using Query-Centric ... · Conclusion •We presented SPARTI, a scalableRDF data management system that utilizes a relational scheme and provides

BACKUP SLIDES