the ieee international conference on big data 2013

26
The IEEE International Conference on Big Data 2013 • Arash Fard • M. Usman Nisar • Lakshmish Ramaswamy • John A. Miller • Matthew Saltz Computer Science Department University of Georgia, Athens, GA October 7, 2013 A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs

Upload: avidan

Post on 23-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs. Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz. Computer Science Department University of Georgia, Athens, GA. October 7, 2013. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The IEEE International Conference on Big Data 2013

The IEEE International Conference on Big Data 2013

• Arash Fard• M. Usman Nisar• Lakshmish Ramaswamy• John A. Miller• Matthew Saltz

Computer Science DepartmentUniversity of Georgia, Athens, GA

October 7, 2013

A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs

Page 2: The IEEE International Conference on Big Data 2013

2

Outline

Introduction Subgraph Pattern Matching

Types of Subgraph Pattern Matching Strict Simulation Distributed Algorithms Experiments

Page 3: The IEEE International Conference on Big Data 2013

3

Introduction Labeled Directed Graph G(V, E, l) Versatile and expressive

Social networks, web search engines, genome sequencing, etc. Data sizes are growing rapidly

Facebook: 1 billion+ users, average degree of 140 Twitter: 200 million+ users, 400 million tweets each day

Bigger datasets mean bigger graphs Pattern (Query Graph): the interesting small graph Data Graph: the massive graph contains the

information Pattern Matching: finding the instances of the pattern in

the data graph

Page 4: The IEEE International Conference on Big Data 2013

4

Subgraph Pattern Matching There are different paradigms for pattern matching:

Exact Topological Matching Sub-graph Isomorphism

Simulation Matching (meaning of the relations) Graph Simulation Dual Simulation Strong Simulation Strict Simulation SD

Bio PM

PMSD: Software DeveloperBio: BiologistPM: Product Manager

SD

Bio PM

PM

John

Ann

Sara

SDMary

BioBobAlic

e

Pattern Graph Data Graph

Endorsed byA B

Page 5: The IEEE International Conference on Big Data 2013

5

Example: Graph Simulation

PMPM

SA

DB

SA

DB

PM

SD

AIAI

DBDB AI

AI DB AI

SA

Data Graph

DB

AI

PM

PM

SA

PMSA

1

2

3

45 6

7

8

9

10

11

12

13

14

151617

18 19 20

21

22

23 24

Pattern Graph

PM

SA

DBAI

PM: Product ManagerSA: Software AnalystDB: Database DesignerAI: AI specialistSD: Software Developer

Label and Children conditions

Quadratic time complexity

Page 6: The IEEE International Conference on Big Data 2013

6

Example: Dual Simulation

PMPM

SA

DB

SA

DB

PM

SD

AIAI

DBDB AI

AI DB AI

SA

Pattern Graph

Data Graph

PM

SA

DBAI

DB

AI

PM

PM

SA

PMSA

1

2

3

45 6

7

8

9

10

11

12

13

14

151617

18 19 20

21

22

23 24

PM: Product ManagerSA: Software AnalystDB: Database DesignerAI: AI specialistSD: Software Developer

Adds Parents condition Cubic time complexity

Page 7: The IEEE International Conference on Big Data 2013

7

Strong Simulation Extends dual simulation with a locality condition A ball Gb[v, r] is a subgraph of a graph G

containing: all vertices Vb within an undirected distance r of the

vertex v all the edges between the vertices in Vb

S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo, “Capturing topology in graph pattern matching,” Proc. VLDB Endow., vol. 5, no. 4, pp. 310–321, Dec. 2011.

X86

1432

57

X86

1432

57

Page 8: The IEEE International Conference on Big Data 2013

8

Example: Strong Simulation

PMPM

SA

DB

SA

DB

PM

SD

AIAI

DBDB AI

AI DB AI

SA

Data Graph

DB

AI

PM

PM

SA

PMSA

1

2

3

45 6

7

8

9

10

11

12

13

14

151617

18 19 20

21

22

23 24

Pattern Graph

PM

SA

DBAI

PM: Product ManagerSA: Software AnalystDB: Database DesignerAI: AI specialistSD: Software Developer

Adds Locality condition Cubic time complexity

Page 9: The IEEE International Conference on Big Data 2013

9

Example: Subgraph Isomorphism

PMPM

SA

DB

SA

DB

PM

SD

AIAI

DBDB AI

AI DB AI

SA

Data Graph

DB

AI

PM

PM

SA

PMSA

1

2

3

45 6

7

8

9

10

11

12

13

14

151617

18 19 20

21

22

23 24

2 Subgraph results:• {4,6,7,8}• {5,6,7,8}

Pattern Graph

PM

SA

DBAI

PM: Product ManagerSA: Software AnalystDB: Database DesignerAI: AI specialistSD: Software Developer

Exact Topological Match NP-complete in general

case

Page 10: The IEEE International Conference on Big Data 2013

10

Strict Simulation

Strict Simulation is a smart modification of Strong Simulation Smaller balls Faster computation The same quality of results or better

Cubic time complexity

Page 11: The IEEE International Conference on Big Data 2013

11

Strict vs Strong Simulation

Page 12: The IEEE International Conference on Big Data 2013

12

Example: Strict vs Strong Simulation

dq = 2 Strong Simulation: Gb[2,2] contains all vertices Strict Simulation: Gb[2,2] contains {2,1,10,3,4} Blue and white vertics in the result of Strong

Simulation White vertices in the result of Strict Simulation

Page 13: The IEEE International Conference on Big Data 2013

13

Vertex Centric BSP Bulk Synchronous Parallel Vertex Centric

Each vertex of the graph is a computing unit Each vertex initially knows only its id, label, and out going edges Algorithm terminates when every vertex votes to halt.

GPS, a free implementation of Pregel, developed in infoLab, Stanford University.

Page 14: The IEEE International Conference on Big Data 2013

14

Distributed Algorithm for Graph Simulation

Initialization:• Pattern is received from the Master• Set the match flag to true

match flag: indicating if the vertex is a candidate member of the result match set

matchSet: set of vertices in Q that are candidate match

Vertex

match flag

matchSet

G1 True Empty

G2 True Empty

G3 True Empty

G4 True Empty

G5 True Empty

G6 True Empty

G7 True Empty

G8 True Empty

Pattern Graph

a

b c

b

c

cb

ea

f

dG1

G2

G3

G4

G5

G6

G7

G8

Q3Q2

Q1

Page 15: The IEEE International Conference on Big Data 2013

15

Distributed Algorithm for Graph Simulation

Vertex

match flag

matchSet

G1 False Empty

G2 True Q2

G3 True Q1

G4 True Q3

G5 True Q2

G6 True Q3

G7 False Empty

G8 False Empty

Pattern Graph

a

b c

b

c

cb

ea

f

dG1

G2

G3

G4

G5

G6

G7

G8

Data Graph

1st Superstep

Q3Q2

Q1

Page 16: The IEEE International Conference on Big Data 2013

16

Distributed Algorithm for Graph Simulation

Vertex

match flag

matchSet

G1 False Empty

G2 True Q2

G3 True Q1

G4 True Q3

G5 True Q2

G6 True Q3

G7 False Empty

G8 False Empty

Pattern Graph

a

b c

b

c

cb

ea

f

dG1

G2

G3

G4

G5

G6

G7

G8

Data Graph

2nd Superstep

Q3Q2

Q1

Page 17: The IEEE International Conference on Big Data 2013

17

Distributed Algorithm for Graph Simulation

Vertex

match flag

matchSet

G1 False Empty

G2 True Q2

G3 True Q1

G4 True Q3

G5 False Empty

G6 True Q3

G7 False Empty

G8 False Empty

Pattern Graph

a

b c

b

c

cb

ea

f

dG1

G2

G3

G4

G5

G6

G7

G8

Data Graph

3rd and 4th Supersteps

Q3Q2

Q1

Page 18: The IEEE International Conference on Big Data 2013

18

Other Distributed Graph Algorithms

Dual simulation In addition to storing the match sets of its children, a vertex

also stores the match sets of its parents During match set evaluation, a vertex takes both of these

into account Strong simulation

Run dual simulation to obtain match relation Rd

For each matching vertex v in Rd : Create a ball centered at v with radius dQ containing any vertex

in G Perform dual simulation on the ball

Strict simulation Run dual simulation to obtain match relation Rd

For each matching vertex v in Rd : Create a ball centered at v with radius dQ containing only

vertices in Rd

Perform dual simulation on the ball

Page 19: The IEEE International Conference on Big Data 2013

19

Experiments System

Cluster of 12 machines 128 GB RAM Two 2GHz Intel Xeon E-2620 GPS (developed in infoLab, Stanford University)

Datasets Synthesized graphs: |V|= 100e6, |E|=|V| α , α = 1.2 Real world graphs

uk-2002: |V|= 18e6, |E|=298e6 ljournal: |V|= 5e6, |E|=79e6 Amazon-2008: |V|= 7e5, |E|=5e6http://law.di.unimi.it/datasets.php

Page 20: The IEEE International Conference on Big Data 2013

20

Experiment 1: Comparison of Strong and Strict Simulation

Synthesized (|V |= 1e6, α = 1.2)

Amazon-2008 (|V |= 7e5, |V| = 5e6)

Amazon-2008 (|V |= 7e5, |Vq| = 20)

Amazon-2008 (|V |= 7e5, |Vq| = 20)

Page 21: The IEEE International Conference on Big Data 2013

21

Experiment 2: Running time

uk-2002-hc (|V |= 18e6, |E|=298e6)

(l = 200 & |Vq| = 20)

Synthesized (|V |= 1e8, α = 1.2,|E|=|V| α )

Quality of result:Graph Simulation < Dual Simulation < Strict Simulation

Page 22: The IEEE International Conference on Big Data 2013

22

Experiment 3: Impact of dataset

Synthesized (|V |= 100e6)

(l = 200 & |Vq| = 20)

Synthesized (α = 1.2, |E|=|V| α )

Page 23: The IEEE International Conference on Big Data 2013

23

Experiment 4: Impact of pattern

uk-2002-hc (|V |= 18e6, |E|=298e6)

Synthesized (|V |= 100e6, α = 1.2 , |E|=|V| α )

Page 24: The IEEE International Conference on Big Data 2013

24

Related work Pattern matching models

P-homomorphism Graph Simulation, Dual and Strong Bounded simulation

Distributed algorithms for pattern matching S. Ma, Y. Cao, J. Huai, and T. Wo, “Distributed

graph pattern matching,” WWW 2012. Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li,

“Efficient subgraph matching on billion node graphs,”VLDB 2012.

Page 25: The IEEE International Conference on Big Data 2013

25

Introducing a new Pattern Matching model

Design, implementation, and test of 4 distributed algorithms for Pattern Matching Graph Simulation Dual Simulation Strong Simulation Strict Simulation

Conclusion

Page 26: The IEEE International Conference on Big Data 2013

26

Questions?

Thanks