techniques for graph analytics on big data
DESCRIPTION
Techniques for Graph Analytics on Big Data. Usman Nisar , Arash Fard , John A. Miller Computer Science Department University of Georgia, Athens, GA. Outline. Introduction Subgraph Pattern Matching Types of Subgraph Pattern Matching Models of Computation Distributed Algorithms - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/1.jpg)
Techniques for Graph Analytics on Big Data
Usman Nisar, Arash Fard, John A. Miller
Computer Science DepartmentUniversity of Georgia, Athens, GA
![Page 2: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/2.jpg)
Outline
Introduction Subgraph Pattern Matching
Types of Subgraph Pattern Matching Models of Computation Distributed Algorithms
Performance Evaluation Graph Partitioning
Results Analysis Conclusion
![Page 3: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/3.jpg)
Introduction
Directed graph G(V, E, l) V is a set of vertices E ⊆ V x V is a set of directed edges l: V → 𝓝 is a function mapping vertices to labels
Versatility and expressivity Social networks, web search engines, genome sequencing,
etc. Data sizes are growing rapidly
Facebook: 1 billion + users, average degree of 140 Twitter: 200 million + users, 400 million tweets each day
Bigger datasets mean bigger graphs
![Page 4: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/4.jpg)
Subgraph Pattern Matching
A graph G’(V’, E’, l’) is a subgraph of G(V, E, l) if:1. V’ ⊆ V2. E’ ⊆ V’ x V’ ⊆ E3. ∀v ∈ V’ , l’(v) = l(v)
A pattern (or query) is a graph that we want to find in a bigger graph
Subgraph pattern matching: Given a query graph, find all subgraphs of another graph (the data graph) that are similar to the query based on certain criteria
![Page 5: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/5.jpg)
Types of Subgraph Pattern Matching
Exact Subgraph Isomorphism
Heuristic Graph Simulation Dual Simulation Strong Simulation
![Page 6: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/6.jpg)
Subgraph Isomorphism
Exact matching from the pattern to data graph
Labels must be the same Ullmann’s algorithm, VF2 NP-hard problem in general case Current solutions not practical for large
graphs
![Page 7: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/7.jpg)
Example: Subgraph Isomorphism
Matches: {4, 8, 6, 7}, {5, 8, 6, 7}
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 8: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/8.jpg)
Graph Simulation
A vertex in a data graph G matches a vertex in query graph Q via graph simulation iff1. both have the same label2. a subset of its children match all the
children of its corresponding vertex in the query graph
HHK algorithm[1]
[1] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke, “Computing simulations on finite and infinite graphs,” in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995, pp. 453–462.
![Page 9: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/9.jpg)
Graph Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 10: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/10.jpg)
Dual Simulation
Adds a duality condition to graph simulation A vertex in a data graph becomes a match with a vertex in
the query graph iff1. Both have the same label2. A subset of its children matches all the children of its
corresponding vertex in the data graph3. A subset of its parents matches all the parents of its
corresponding vertex in the data graph
O((|Vq| + |Eq|) (|V| + |E|)) More restrictive than graph simulation
![Page 11: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/11.jpg)
Example: Graph Dual Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 12: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/12.jpg)
Example: Graph Dual Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 13: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/13.jpg)
Strong Simulation
Extends dual simulation with a locality condition
A ball G[v, r] is a subset of a graph G containing: all vertices VB within an undirected distance r
of the vertex v all the edges between the vertices in VB
X86
1432
57
![Page 14: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/14.jpg)
Strong Simulation
Extends dual simulation with a locality condition
A ball G[v, r] is a subset of a graph G containing: all vertices VB within an undirected distance r
of the vertex v all the edges between the vertices in VB
X86
1432
57
X86
1432
57
![Page 15: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/15.jpg)
Strong Simulation
Extends dual simulation with a locality condition A ball G[v, r] is a subset of a graph G containing:
all vertices VB within an undirected distance r of the vertex v all the edges between the vertices in VB
O(|V|(|V| + (|Vq| + |Eq|) (|V| + |E|)) Query graph Q(Vq, Eq, l) matches data graph G(V, E, l) via
strong simulation if there exists a vertex v ∈ V s.t.1. Q matches G[v, dq] via dual simulation with match relation Rb 2. v is contained in Rb
![Page 16: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/16.jpg)
Example: Dual Strong Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 17: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/17.jpg)
Example: Dual Strong Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 18: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/18.jpg)
Example: Dual Strong Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 19: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/19.jpg)
Example: Dual Strong Simulation
PM
SA
SD
SA
PM
PM
DB
AIDB
AI
DB AI
DB
SA
PM
SA
PM
AI
DB
AIDB
SAPM
AI1
2
3
4
5
6
7
89
10
11
12
13 14
15
16
17
18
19 20
21
22
23
24
PM
SA DB AI
Query
Data Graph
PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist
![Page 20: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/20.jpg)
Models of Computation
MapReduce • Useful paradigm for large
batch processing of datasets
• Not ideal for many graph algorithms
Bulk Synchronous Parallel (BSP)• Computation performed in
a series of supersteps • Interleaved with
communication/synchronization phases
Vertex-Centric (VC) BSP• Each vertex treated as a
processing unit• Vertices can communicate
with each other to obtain information
Message Passing• Different threads
communicate with each other via messages
• Programmer has more control over communication and synchronization
![Page 21: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/21.jpg)
Distributed Graph Simulation
Superstep 1: If there is any query vertex with the same label: - Set match to true - Make a match set of potential vertices in the query - Ask children about their statusOtherwise: Vote to halt
Superstep 2: If match is true: Reply back to parent with label Otherwise: Vote to halt
Superstep 3: If match is true: - Evaluate match set based on children’s responses - If there are removals from the match set: - Inform parents - Set match flag accordingly - Otherwise: Vote to haltOtherwise: Vote to halt
Superstep 4: If there is any incoming removal message: - Evaluate match set based on children’s responses - If there are removals from the match set: - Inform parents - Set match flag accordingly - Otherwise: Vote to haltOtherwise: Vote to halt
![Page 22: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/22.jpg)
Distributed Graph Simulation
![Page 23: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/23.jpg)
Other Distributed Graph Algorithms
Dual simulation Similar to graph simulation In addition to storing the match sets of its children, a
vertex also stores the match sets of its parents During match set evaluation, a vertex takes both of these
into account Strong simulation (optimized)
First run dual simulation to obtain match relation R For each matching vertex v in R:
Create a ball centered at v with radius dq containing only vertices in R
Perform dual simulation on the ball
![Page 24: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/24.jpg)
Performance Evaluation
Experimental setup Cluster with 12 machines Each with two 2Ghz Intel Xeon E5-2620
CPUs, each with six cores Ethernet: 1 Gb/s Set up HDFS/GPS on all of the machines
Dataset |V| |E| |l|Synthesize
d100 M 4 B 200
uk-2005 39 M 940 M 200enwiki-2005
4.2 M 101 M 200
![Page 25: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/25.jpg)
Runtime Evaluation
1 2 4 8 1216202428323640440200400600800
100012001400
Graph Simula-tionDual Simula-tion
# Workers
Run
ning
Tim
e (s
ecs)
Synthesized, |V|= 108
![Page 26: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/26.jpg)
Runtime Evaluation
uk-2005, |V|= 3.9 x 107
1 2 4 8 121620242832364044050
100150200250300350
Graph Simu-lationDual Simula-tion
# Workers
Run
ning
Tim
e (s
ecs)
![Page 27: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/27.jpg)
1 2 4 8 12 16 2005
1015202530354045
Graph Simu-lation
# Workers
Run
ning
Tim
e (s
ecs)
Runtime Evaluation
enwiki-2013, |V|= 4.2 x 106
![Page 28: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/28.jpg)
1 2 4 8 12 16 20 24 28 32 36 40 4405
101520253035
Graph Simu-lation
# Workers
Spee
d-U
p
Speedup Evaluation
Synthesized, |V|= 108
![Page 29: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/29.jpg)
Speedup Evaluation
uk-2005, |V|= 3.9 x 107
1 2 4 8 1216202428323640440
5
10
15
20
25 Graph Simula-tionDual Simula-tion
# Workers
Spee
d-U
p
![Page 30: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/30.jpg)
1 2 4 8 12 16 200123456
Graph Simula-tion
# Workers
Spee
d-U
p
Speedup Evaluation
enwiki-2013, |V|= 4.2 x 106
![Page 31: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/31.jpg)
Conclusions
The three algorithms exhibit excellent scalable behavior in terms of speedup and efficiency as we increase the number of workers
Distributed implementations (GPS): Graph simulation Dual simulation Strong simulation
Ongoing work: Distributed implementations with message passing in Akka Our own sequential/distributed isomorphism algorithm
![Page 32: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/32.jpg)
Questions?
![Page 33: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/33.jpg)
Thanks
![Page 34: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/34.jpg)
Depth-First Ball
Algorithm works in a depth-first fashion. Message is generated at the center which is then propagated through the system for ballSize supersteps.
The approach results in an exponential number of messages that slows down the whole system and renders the approach impractical
![Page 35: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/35.jpg)
Breadth-First Ball
Works on a simple ping-reply model. Center vertex starts of by sending a ping message to all of
its adjacent nodes in the first superstep. In the second superstep, all the recipient nodes reply back
with their label and the ids of their children and parents. Center vertex upon receiving this information in the third
superstep, saves the returned labels and then ping the boundary nodes. This process is repeated till we have a ball of size dq.
![Page 36: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/36.jpg)
Breadth-First Ball
X86
1432
57
X8
6
14
32
57
p
p
r
r
X8
6
14
32
57
pX
86
1
43
2
57
r
(a) Ball:{1,7}, Labels: { } (b) Ball: {1,7,2,3,4}, Labels: {1,7}
(c) Ball:{1,7,2,3,4}, Labels:{1,7} (d) Ball:{1,7,2,3,4}, Labels:{1,7,2,3,4}
Ping:Reply:
pr
![Page 37: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/37.jpg)
Efficiency
Synthesized, |V|=108 uk-2005, |V|=3.9x107 enwiki-2013, |V|=4.2x106
1 4 12 20 28 36 440
20406080
100120
Graph Simula-tion
# Workers
Effici
ency
1 2 4 8 12 16 20020406080
100120
Graph Simula-tion
# Workers
Effici
ency
1 4 12 20 28 36 440
20406080
100120 Graph
Simula-tion
# Workers
Effici
ency
• Calculated as Efficiencyk = Speedupk/k
![Page 38: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/38.jpg)
Graph Partitioning
By default, the data graph is partitioned in a round-robin fashion among workers
The goals of min-cut partitioning are two-fold: to create well-balanced partitions to reduce the inter-partition edges
Number of other algorithms have been shown to use min-cut graph partitioning successfully for speed-ups
METIS1 – graph partitioning tool Written in C Takes an undirected graph and outputs the partitions
1 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
![Page 39: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/39.jpg)
Performance Evaluation
Experimental Setup Cluster with 5 machines Each with two 2Ghz Intel Xeon E5-2620 CPU, each with
six cores Ethernet: 1 Gb/sec
Setup HDFS/GPS on all the machines Datasets:
Labels (l) = 200Datasets |V| |E|Synthesized 107 251 x 106
uk-2002 1.8 x 107 298 x 106
![Page 40: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/40.jpg)
Results - Runtime
5 20 60 100
05
1015202530 Min-cut Round-robin
Query Size
Run
ning
Tim
e (s
ecs)
5 20 60 100
05
10152025303540
Min-cut Round-robin
Query Size
Run
ning
Tim
e (s
ecs)
5 20 60 100
010203040506070
Min-cut Round-robin
Query size
Run
ning
Tim
e (s
ecs)
Synthesized Dataset, |V|=107, = 1.2
Graph Simulation Dual Simulation Strict Simulation
![Page 41: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/41.jpg)
Results - Runtime
uk-2002, |V|=1.8x107
5 20 60 100
0
10
20
30
40 Min-cut Round-robin
Query Size
Run
ning
Tim
e (s
ecs)
5 20 60 100
0102030405060 Min-cut
Round-robin
Query Size
Run
ning
Tim
e (s
ecs)
5 20 60 100
020406080
100120140160 Min-cut Round-robin
Query Size
Run
ning
Tim
e (s
ecs)
Graph Simulation Dual Simulation Strict Simulation
![Page 42: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/42.jpg)
Results – Network I/O
Synthesized Dataset, |V|=107,
Graph Simulation Dual Simulation Strict Simulation
5 20 60 100
0.0E+001.0E+052.0E+053.0E+054.0E+055.0E+056.0E+057.0E+05
Min-cutRound-robin
Query size
Tota
l Net
wor
k I/O
(K
B)
5 20 60 100
0.0E+002.0E+054.0E+056.0E+058.0E+051.0E+061.2E+061.4E+061.6E+06
Min-cut Query Size
Tota
l Net
wor
k I/O
(K
B)
5 20 60 100
0.0E+002.0E+054.0E+056.0E+058.0E+051.0E+061.2E+061.4E+06
Min-cutRound-robin Query size
Tota
l Net
wor
k I/O
(K
B)
![Page 43: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/43.jpg)
Results – Network I/O
uk-2002, |V|=1.8x107
Graph Simulation Dual Simulation Strict Simulation
5 20 60 100
1.0E+021.0E+031.0E+041.0E+051.0E+061.0E+07
Min-cut Query Size
Tota
l Net
wor
k I/O
(K
B)
5 20 60 100
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
Min-cutRound-robin
Query Size
Tota
l Net
wor
k I/O
(K
B)
5 20 60 100
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
Min-cutRound-robin
Query Size
Tota
l Net
wor
k I/O
(K
B)
![Page 44: Techniques for Graph Analytics on Big Data](https://reader037.vdocuments.site/reader037/viewer/2022110111/5681690c550346895de02108/html5/thumbnails/44.jpg)
Planned Pattern Matching
ImplementationsSequent
ialDistribute
d (GPS)Distributed (Akka
VC)
Distributed
(Akka MP)Graph Simulation
0.581 s 2.924 s ---------- ----------
Dual Simulation
0.642 s 3.403 s ---------- ----------
Strong Simulation
---------- 8.453 s ---------- ----------
Subgraph Iso. (ours)
2.837 s ---------- ---------- ----------
Subgraph Iso. (VF2)
---------- ---------- ---------- ----------