a framework for processing large graphs in shared memory · graph processing systems •existing:...
TRANSCRIPT
A Framework for Processing Large Graphs in Shared Memory
Julian Shun
Based on joint work with Guy Blelloch and Laxman Dhulipala
What are graphs?
• Can contain up to billions of vertices and edges• Need simple, efficient, and scalable ways to analyze them
2
EdgeVertex Vertex
Graph Data is Everywhere!
Efficient Graph Processing• Use parallelism
• Design efficient algorithms
• Write/optimize code for each application• Build a general framework
3
Breadth-first searchBetweenness centralityConnected components…
Single-source shortest pathsEccentricity estimation(Personalized) PageRank…
Ligra Graph Processing Framework4
EdgeMap VertexMap
Breadth-first searchBetweenness centralityConnected componentsTriangle countingK-core decompositionMaximal independent setSet cover
Single-source shortest pathsEccentricity estimation(Personalized) PageRankLocal graph clusteringBiconnected componentsCollaborative filtering…
Simplicity, Performance, Scalability
Graph Processing Systems
• Existing: Pregel/Giraph/GPS, GraphLab, Pegasus, Knowledge Discovery Toolbox, GraphChi, etc.
• Our system: Ligra - Lightweight graph processing system for shared memory
5
Takes advantage of “frontier-based” nature of many algorithms
(active set is dynamic and often small)
Breadth-first Search (BFS)• Compute a BFS tree rooted at source r containing all vertices reachable from r
6
• Can process each frontier in parallel• Race conditions, load balancing
ApplicationsBetweenness centralityEccentricity estimationMaximum flowWeb crawlersNetwork broadcastingCycle detection…
Steps for Graph Traversal• Operate on a subset of vertices• Map computation over subset of edges in parallel• Return new subset of vertices• Map computation over subset of vertices in parallel
7
}
Graph
VertexSubset
EdgeMap
VertexMap
Think with flat data-parallel operators
We built the Ligra abstraction for these kinds of computations
Abstraction enables optimizations(hybrid traversal and graph compression)
Many graph traversal algorithms do this!
Breadth-first Search in Ligraparents = {-1, …, -1}; //-1 indicates “unexplored”
procedure UPDATE(s, d):return compare_and_swap(parents[d], -1, s);
procedure COND(v):return parents[v] == -1; //checks if “unexplored”
procedure BFS(G, r):parents[r] = r;
frontier = {r}; //VertexSubsetwhile (size(frontier) > 0):
frontier = EDGEMAP(G, frontier, UPDATE, COND);
8
frontier
TT TTFfrontier
Actual BFS code in Ligra9
Sparse or Dense EdgeMap?10
11
10
9
12
13
15
14
1
4
3
2
5
8
7
6
Frontier• Dense method better when
frontier is large and many vertices have been visited
• Sparse (traditional) method better for small frontiers
• Switch between the two methods based on frontier size [Beamer et al. SC ’12]
11
10
9
12
Limited to BFS?
EdgeMap11
Loop through outgoing edges of frontier vertices in parallel
procedure EDGEMAP(G, frontier, Update, Cond):if (size(frontier) + sum of out-degrees > threshold) then:
return EDGEMAP_DENSE(G, frontier, Update, Cond);else:
return EDGEMAP_SPARSE(G, frontier, Update, Cond);
Loop through incoming edges of “unexplored” vertices (in parallel), breaking early if possible
• More general than just BFS!• Generalized to many other problems
• For example, betweenness centrality, connected components, sparse PageRank, shortest paths, eccentricity estimation, graph clustering, k-core decomposition, set cover, etc.
• Users need not worry about this
Frontier-based approach enables hybrid traversal
12
0
2
4
6
8
10
12
14
BFS BetweennessCentrality
ConnectedComponents
EccentricityEstimation
40-c
ore
runn
ing
time
(sec
onds
)
Twitter graph (41M vertices, 1.5B edges)
Dense
Sparse
Hybrid
30.7 20.7
(switching between sparse and dense using default threshold of |E|/20)
PageRank13
VertexMap14
0 4 68VertexSubset
4
7
52
1
0
6
8
3
68VertexSubset
bool f(v){data[v] = data[v] + 1;return (data[v] == 1);
}
4
0
6
8
VertexMap
T F F T
PageRank in Ligrap_curr = {1/|V|, …, 1/|V|}; p_next = {0, …, 0}; diff = {}; error =∞;
procedure UPDATE(s, d):
atomic_increment(p_next[d], p_curr[s] / degree(s));
return 1;
procedure COMPUTE(i):
p_next[i] = α ∙ p_next[i] + (1- α) ∙ (1/|V|);
diff[i] = abs(p_next[i] – p_curr[i]);
p_curr[i] = 0;
return 1;
procedure PageRank(G, α, ε):
frontier = {0, …, |V|-1};
while (error > ε):
frontier = EDGEMAP(G, frontier, UPDATE, CONDtrue);frontier = VERTEXMAP(frontier, COMPUTE);
error = sum of diff entries;
swap(p_curr, p_next)
return p_curr;
15
PageRank• Sparse version?
• PageRank-Delta: Only update vertices whose PageRank value has changed by more than some Δ-fraction (discussed in PowerGraph and McSherry WWW ‘05)
16
PageRank-Delta in Ligra
PR[i] = {1/|V|, …, 1/|V|}; nghSum = {0, …, 0};Change = {}; //store changes in PageRank values
procedure UPDATE(s, d): //passed to EdgeMapatomic_increment(nghSum[d], Change[s] / degree(s));return 1;
procedure COMPUTE(i): //passed to VertexMapChange[i] = α ∙ nghSum[i]; PR[i] = PR[i] + Change[i];return (abs(Change[i]) > Δ); //check if absolute value of change is big enough
17
Performance of Ligra
18
Ligra BFS Performance19
0
0.05
0.1
0.15
0.2
0.25
0.3
BFS
Run
ning
tim
e (s
econ
ds)
Twitter graph (41M vertices, 1.5B edges)
Ligra (40-coremachine)
Hand-writtenCilk/OpenMP (40-coremachine)
020406080
100120140160180
BFS
Line
s of
cod
e• Comparing against hybrid traversal BFS code by Beamer et al.
Ligra PageRank Performance20
0
2
4
6
8
10
12
14
Page Rank (1 iteration)
Run
ning
tim
e (s
econ
ds)
Twitter graph (41M vertices, 1.5B edges)
PowerGraph (64 x 8-cores)
PowerGraph (40-coremachine)
Ligra (40-coremachine)
Hand-writtenCilk/OpenMP (40-coremachine) 0
10
20
30
40
50
60
70
80
Page Rank
Line
s of
cod
e
• Easy to implement “sparse” version of PageRank in Ligra
Connected Components Performance21
0
50
100
150
200
250
Connected Components
Run
ning
tim
e (s
econ
ds)
Twitter graph (41M vertices, 1.5B edges)
PowerGraph (16 x 8-cores)[Gonzalez et al.]PowerGraph (40-coremachine)
Ligra (40-coremachine)
Hand-writtenCilk/OpenMP (40-core machine) 0
50100150200250300350400
Connected Components
Line
s of
cod
e
• Ligra’s performance is close to hand-written code• Faster than best existing system • Subsequent systems have used Ligra’s abstraction and hybrid
traversal idea, e.g., Galois [SOSP ‘13], Polymer [PPoPP ’15], Gunrock [PPoPP ’16], Gemini [OSDI ’16], GraphGrind [ICS ‘17], Grazelle [PPoPP ‘18]
22.5
0
0.5
1
1.5
2
2.5
3
Connected Components
Run
ning
tim
e (s
econ
ds)
3.5 billion vertices128 billion edges
(540 GB)
Largest publicly available graph
72-core machine with 1TB RAM Ligra Running timeBFS 12sConnected components 42s1 iteration PageRank 28s
Large Graphs22
6.6 billion edges
128 billion edges
~1 trillion edges [VLDB 2015]
• Most can fit on commodity shared memory machine
Amazon EC2
ExampleDell PowerEdge R930:Up to 96 cores and 6 TB of RAM
What if you don’t have or can’t afford that much memory?
23R
unni
ng T
ime
Memory Required
Graph Compression
Ligra+: Adding Graph Compression to Ligra
24
• Same interface as Ligra• All changes hidden from the user!
25
Ligra+: Adding Graph Compression to Ligra
Graph
VertexSubset
EdgeMap
VertexMap
Interface
Use compressed representation
Decode edges on-the-fly
Same as before
Same as before
Graph representation26
0 4 5 11
2 7 9 16 0 1 6 9 12
...
...
Offsets
Edges
2 5 2 7 -1 -1 5 3 3 ... Compressed
Edges
• Graph reordering to improve locality
Vertex IDs 0 1 2 3
Sort edges and encode differences
2 - 0 = 2 7 - 2 = 5 1 - 2 = -1
0 0 0 0 0 1 10 0 1 0 0 0 1
• k-bit codes• Encode value in chunks of k bits• Use k-1 bits for data, and 1 bit as the “continue” bit
• Example: encode “401” using 8-bit (byte) code• In binary:
27
1 1 0 0 1 0 0 0 1
1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1
“continue” bit
7 bits for data
Variable-length codes
• Another idea: get rid of “continue” bits
28
0 1 0 1 1 0 0 1
Number of bytes per integer
Size of group(max 64)
Header
……Integers in group
encoded in byte chunks
• Increases space, but makes decoding cheaper (no branch misprediction from checking “continue” bit)
x1 x2 x3 x4 x5 x6 x7 x8 ……Number of bytes required to encode each integer
1 2 2 2 2 2 2 2
Use run-length encoding
……
Encoding optimization
• Same interface as Ligra• All changes hidden from the user!
29
Ligra+: Adding Graph Compression to Ligra
Graph
VertexSubset
EdgeMap
VertexMap
Interface
Use compressed representation
Decode edges on-the-fly
Same as before
Same as before
• Processes outgoing edges of a subset of vertices
30
Modifying EdgeMap
2 5 2 7 9 2 1 3 3VertexSubset
-4 6 3 1 3 5 6 2
5 10 2
30 5
-16 2 19 1 4 2 5 3
All vertices processed in parallel
16
25
44
0
7
What about high-degree vertices?
31
Handling high-degree vertices
-1 2 4 3 16 2 1 5 8 19 4 1 23 14 12 1 9 10 3 5
High-degree vertex
…
Chunks of size T
-1 2 4 3 16 2 27 5 8 19 4 1 87 14 12 1 9 10
…
…
Encode first entry relative to source vertex
All chunks can be decoded in parallel!
• We chose T=1000• Similar performance
and space usage for a wide range of T
Ligra+ Space Savings
32
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
soc-L
J
cit-P
atents
com-L
J
com-O
rkut
nlpkkt240
uk-u
nion
Yahoo
Space relative to Ligra using byte codes with run-length encoding
Ligra
Ligra+
• Space savings of about 1.3—3x
• Could use more sophisticated schemes to further
reduce space, but more expensive to decode
• Cost of decoding on-the-fly?
Ligra+ Performance33
0.60.70.80.9
11.11.21.31.4
BFS
Betwee
nnes
s
Eccen
tricit
y
Compo
nents
PageR
ank
Bellm
an-F
ord
40-core time relative to Ligra
• Cost of decoding on-the-fly?• Memory subsystem is a scalability bottleneck in
parallel as these graph algorithms are memory-bound• Ligra+ decoding gets better parallel speed up
0.60.70.80.9
11.11.21.31.4
BFS
Betwee
nnes
s…
Eccen
tricit
y
Compo
nents
PageR
ank
Bellm
an-F
ord
Single-thread time relative to Ligra
Ligra
Ligra+
05
10152025303540
BFS
Betwee
nnes
s
Eccen
tricit
y
Compo
nents
PageR
ank
Bellm
an-F
ord
Self-relative 40-core Speedup
Ligra
Ligra+
Ligra Summary35
EdgeMapVertexMap
Breadth-first searchBetweenness centralityConnected componentsTriangle countingK-core decompositionMaximal independent set…
Single-source shortest pathsEccentricity estimation(Personalized) PageRankLocal graph clusteringBiconnected componentsCollaborative filtering…
Simplicity, Performance, Scalability
VertexSubset
Optimizations: Hybrid traversal and graph compression