g-sparql: a hybrid engine for querying large attributed graphs

30
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif Sakr Sameh Elnikety Yuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond, WA Microsoft Research Redmond, WA

Upload: kineks

Post on 15-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs . Example 1: Social Network. Example 2: Bibliographical Network. Contributions. G-SPARQL language Pattern matching Reachability Hybrid execution engine Graph topology in main memory Graph data in relational database - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

G-SPARQL: A Hybrid Engine for Querying Large Attributed

Graphs

Sherif Sakr Sameh Elnikety

Yuxiong He

NICTA & UNSWSydney, Australia

Microsoft Research

Redmond, WA

CIKM 2012

Microsoft Research

Redmond, WA

Page 2: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

Example 1: Social Network

Bob

Hillary Alice

Chris David

FranceEd George

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

2

Page 3: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

3

Example 2: Bibliographical Network

Alice JohnSmith

age: 28office: 518

Age:42location: Sydney

age:45

Paper 1 Paper 2

UNSW Microsoft

VLDB¶12

Keyword: graph Keyword: XMLtype: Demo

location: Istanbul

country: Australiaestablished: 1949

country: USAestablished: 1975

citedBy

title: Professor

title: Senior Researcher

order: 1order: 2 order: 1 order: 2

Month: 1Month: 3

Page 4: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

4

Contributions1. G-SPARQL language

– Pattern matching– Reachability

2. Hybrid execution engine– Graph topology in main memory– Graph data in relational database

3. Algebraic transformation– Operators– Optimizations

4. Experimental evaluation

Page 5: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

5

1. G-SPARQL Query Language•Extends a subset of SPARQL

– Based on triple pattern: (subject, predicate, object)

•Sub-graph matching patterns on– Graph structure– Node attribute– Edge attribute

•Reachability patterns on– Path– Shortest path

subject object

Page 6: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

6

G-SPARQL Syntax

Page 7: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

7

G-SPARQL Pattern Matching•Node attribute

– ?Person @officeNumber “518”

•Edge attribute– ?E @Role “Programmer”

•Structural– ?Person worksAt Microsoft– ?Person ?E(worksAt) Microsoft

Alice Microsoft

officeNu mber=518

Role = Programmer

Page 8: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

8

G-SPARQL Reachability•Path

– Subject ??PathVar Object

•Shortest path– Subject ?*PathVar Object

•Path filters– Path length– All edges– All nodes

Page 9: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

9

Example: G-SPARQL QuerySELECT ?L1 ?L2WHERE {

?X ??P ?Y.

?X @Label ?L1. ?Y @Label ?L2.?X @Age ?Age1. ?Y @Age ?Age2.?X Affiliated UNSW. ?Y ?E(Affiliated) Microsoft.?X LivesIn Sydney. ?E @Title "Researcher".

FILTER(?Age1 >= 40). FILTER(?Age2 >= 40).FILTERPATH( Length( ??P, <= 3) ).

}

Page 10: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

10

Outline1. G-SPARQL language

– Pattern matching– Reachability

2. Hybrid execution engine– Graph topology in main memory– Graph data in relational database

3. Algebraic transformation– Operators– Optimizations

4. Experimental evaluation

Page 11: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

11

2. Hybrid Execution Engine•Reachability queries

– Main memory algorithms– Example: BFS and Dijkstra’s algorithm

•Pattern matching queries– Relational database– Indexing

» Example: B-tree– Query optimizations,

» Example: selectivity estimation, and join ordering– Recursive queries

» Not efficient: large intermediate results and multiple joins

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Page 12: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

12

Graph RepresentationID Value1 John2 Paper 23 Alice4 Microsoft5 VLDB’126 Paper 17 UNSW8 Smith

ID Value1 453 428 28

ID Value8 518

ID Value3 Sydney5 Istanbul

ID Value2 XML6 graph

ID Value2 Demo

ID Value4 USA7 Australia

ID Value4 19757 1949

eID sID dID1 1 25 3 26 3 611 8 6

Node Label age office location keyword type established

country

authorOf

eID sID dID

3 1 4

8 3 7

12 8 7

affiliated

eID sID dID

4 2 5

10 6 5

published

eID sID dID

9 6 2

citedBy

eID sID dID

7 3 8

supervise

eID sID dID

2 1 3

know ID Value

3 Senior Researcher

8 Professor

title

ID Value

1 2

5 1

6 2

11 1

order

ID Value

4 3

10 1

month

Page 13: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

13

Hybrid Execution Engine: interfaces

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

G-SPARQL query

SQL commands

Traversal

operations

Page 14: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

14

3. Intermediate Language & Compilation

Physical execution

planSQL commands

Traversal

operations

G-SPARQL query

Algebraic query plan

Front-end compilation

Step 2

Back-end compilation

Step 1

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Page 15: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

15

Intermediate Language•Objective

– Generate query plan and chop it» Reachability part -> main-memory algorithms on topology» Pattern matching part -> relational database

– Optimizations•Features

– Independent of execution engine and graph representation– Algebraic query plan

Page 16: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

16

G-SPARQL Algebra•Variant of “Tuple Algebra”•Algebra details

– Data: tuples» Sets of nodes, edges, paths.

– Operators» Relational: select, project, join» Graph specific: node and edge attributes, adjacency» Path operators

Page 17: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

17

Relational

Page 18: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

18

Relational

NOT Relational

Page 19: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

19

Front-end Compilation (Step 1)• Input

– G-SPARQL query•Output

– Algebraic query plan•Technique

– Map» from triple patterns» To G-SPARQL operators

– Use inference rules

Page 20: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

20

Front-end Compilation: Inference Rules

Page 21: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

21

Front-end Compilation: Optimizations•Objective

– Delay execution of traversal operations•Technique

– Order triple patterns, based on restrictiveness•Heuristics

– Triple pattern P1 is more restrictive than P21. P1 has fewer path variables than P22. P1 has fewer variables than P23. P1’s variables have more filter statements than P2’s variables

Page 22: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

22

Back-end Compilation (Step 2)• Input

– G-SPARQL algebraic plan•Output

– SQL commands– Traversal operations

•Technique– Substitute G-SPARLQ relational operators with SPJ– Traverse

» Bottom up» Stop when reaching root or reaching non-relational operator» Transform relational algebra to SQL commands

– Send non-relational commands to main memory algorithms

Page 23: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

23

Back-end Compilation: Optimizations•Optimize a fragment of query plan

– Before generating SQL command•All operators are Select/Project/Join•Apply standard techniques

– For example pushing selection

Page 24: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

24

Example: G-SPARQL QuerySELECT ?L1 ?L2WHERE {

?X ??P ?Y.

?X @label ?L1. ?Y @label ?L2.?X @age ?Age1. ?Y @age ?Age2.?X affiliated UNSW. ?Y ?E(affiliated) Microsoft.?X livesIn Sydney. ?E @title "Researcher"

FILTER(?Age1 >= 40). FILTER(?Age2 >= 40).}

Page 25: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

25

Example: Query Plan

Page 26: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

26

4. Experimental Evaluation•Objective

– This is a good idea– Good performance from DBMS and main memory topology

•Data sets– Real ACM bibliographic network

– Synthetic graphs» See technical report

Page 27: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

27

Experimental Environment•Workload

– Created Q1 … Q12•Process

– Compare to Neo4J (non-optimized, optimized)•Environment

– Implementation» Main memory algorithms in C++» IBM DB2

– PC Server

Page 28: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

28

Results on Real Dataset

Page 29: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

29

Response time on ACM Bibliographic Network

Page 30: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

30

Conclusions•G-SPARQL Language

– Expresses pattern matching and reachability queries on attributed graphs

•Hybrid engine– Graph topology in main memory– Graph data in database

•Compilation into algebraic plan– Operators and optimizations

•Evaluation– Real and synthetic datasets– Good performance

» Leveraging database engine and main memory topology