[ieee 2012 sc companion: high performance computing, networking, storage and analysis (scc) - salt...
TRANSCRIPT
Parallel Algorithms for Counting Triangles and Computing Clustering CoefficientsS M Arifuzzaman1,2, Maleq Khan2 and Madhav V. Marathe1,2
1Dept. of Computer Science, and 2Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, VA 24061
Parallel Algorithm for Triangle Counting
With P processors, the graph is partitioned into P partitions. Each processor reads itsown partition in parallel from the input file. Each processor performs local computationand results are then combined.
� In edge sparsification [1], unlike sampling [2], each edge is selected with
probability p.� If the sparsified graph has Ts triangles, the estimated number of triangles
in G is which is an unbiased estimator as,
� In our parallel algorithm, each processor i sparsifies its own partition
Gi(Vi,Ei) independently.
� Consider an edge that overlaps in two partitions (e.g. edge (2,3) in the
figure below); survival of this edge in these two partitions are
independent. This independence improves accuracy of the estimation.
Parallel Algorithms for Approximate Triangle Counting
Reference[1] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “DOULION:counting triangles in massive graphs with a coin,” in Proc. of the 15th KDD, 2009, pp. 837–846.
[2] C. Seshadhri, Ali Pinar, and Tamara G. Kolda, “Fast Triangle Counting through Wedge Sampling,” in CoRR, Volume: abs/1202.5230, 2012 .[3] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” in Proc. of the 20th Intl. Conf. on World Wide Web (WWW), 2011.
� Partial count of triangles incident on a node v V may reside at
different processors which needs to be aggregated.
� Solution: Summing up by MPI_Reduce maintaining n-element count
array for n nodes at each processor (O(n) memory)
Parallel Computation of Clustering CoefficientsThe Problem and Contributions
We present MPI-based parallel algorithms for two important problems in network
analysis: counting triangles and computing clustering coefficients in massive networks.
� A triangle in a graph G(V, E) is a set of three nodes u, v, w V such that there is an edge
between each pair of nodes. The number of triangles incident on node v with adjacency listNv is defined as,
Total number of triangles,
�The clustering coefficient (CC) of a node v V with degree dv is defined as,
Improved Sequential Algorithm
� Many algorithms use adjacency matrix representation which is not suitable for largegraphs as it takes O(n2) memory.
� NodeIterator++ (NPP) [3] is a state-of-the art algorithm that uses adjacency listrepresentation.
� NodeIterator++ uses an ordering, , of nodes to avoid duplicate count of triangles. Adegree-based ordering reduces running time significantly.
Proposed Modified Algorithm: NodeIteratorN (NN)� Performs comparison u v for each edge (u,v) E in preprocessing step rather than
doing same in computing step.
� Stores neighbors u with higher order than node v in adjacency list Nv
p 0.1 0.2 0.3 0.4 0.5
Accuracy 99.61 99.685 99.832 99.898 99.947
1/P2 100 25 11.1 6.25 4
Speedup 35.3 17.6 8.1 5 3.16
NetworksVariance Avg Error (%) Max Error (%) Actual
CountOur Algo. [1] Our Algo. [1] Our Algo. [1]
web-BerkStan 1.287 2.027 0.389 0.392 1.02 1.08 64.7MLiveJournal 1.77 1.958 1.46 1.86 3.88 4.75 285.7M
{T: array of counts of local triangles}for v Vi do
Tv ← 0for v Vic do
for u Nv doS ← Nv � NuTv ← Tv + |S|Tu ← Tu + |S|for w S do
Tw ← Tw + 1Communicate Tv for each v Vi -VicAggregate Tv for each v VicCompute Cv for each v Vic
// Preprocessing Steps:for each edge (u,v) E do
if u v thenstore v in Nu
elsestore v in Nv
// Computation Steps: for v V do
sort Nv in ascending orderT ← 0 //counts of trianglesfor v V do
for u Nv doS ← Nv � NuT ← T + |S|
{T: count of triangles}for each processor i, in parallel,do
Gi(Vi, Ei) ← ReadGraph(G, i)T ← CountTriangles(Gi, i)
MpiBarrierMpiReduce(T)
Procedure CountTriangles(Gi, i)for v Vi do
sort Nv in ascending orderT ← 0for v Vic do
for u Nv doS ← Nv � NuT ← T + |S|
Networks Nodes EdgesEmail-Enron 37K 0.36M
web-BerkStan 0.69M 13MMiami 2.1M 100M
LiveJournal 4.8M 86MTwitter 42M 2.4B
Gnp(n,d) n 0.5 ndHcc(n,d) n 0.5 nd
NetworksRuntime(sec) No. of
trianglesNPP [3] NNEmail-Enron 0.08 0.03 0.7M
web-BerkStan 3.3 0.4 64.7MLiveJournal 40.35 13 285.7M
Miami 43.56 16 332MGnp(500K, 20) 1.8 0.6 1308Hcc(15M, 90) 553 52.6 15B
Networks Number of triangles
RuntimeOur Algo. [3]
web-BerkStan 65M 0.03s 1.7mLiveJournal 285.7M 0.39s 5.33m
Twitter 34.8B 15m 423mHcc(5M, 200) 24.7B 3.86s -Hcc(2B, 50) 600B 9m -
Miami 332M 0.5s -
Load-Balancing� We assign equal number of core vertices per
processor.
�Load can be imbalanced if the network has skewed
degree distribution.
� However, degree-based ordering provides very good
load balancing without additional overhead.
� In the example network shown on the right, v0 has
degree n-1, but we have |Nv0 |= 0 and |Nvi | ≤ 3.
Accuracy and speedup on LiveJournal graph
Variance and average error
Better Solution: (external memory aggregation)
� Processor i writes P intermediate disk files Fj each for one distinct
processor j with triangle-counts for v Vjc
� All Fjs are then aggregated by processor j� Processor j computes Cv for v Vj
c
…
v0
v1
v2 v3
v4
v5Vn-1
Performance� scales very well with size of networks and
number of processors.
� is significantly faster than the only known
distributed-memory parallel [3].
� is able to run on a 50B-edge network in 10
minutes; a 2.9B-edge network is the largest one
processed by [3].
Acknowledgement This work has been partially supported by the Grants NSF NetSE CNS-1011769, NSF PetaApps OCI-0904844, DTRA CNIMS HDTRA1-11-D-0016-0001, NSF SDCI
OCI-1032677 and TRA Grant HDTRA1-11-1-0016.
� Challenge: massive graphs that do not fit in
main memory
� Contribution:� Exact parallel algorithm that counts triangles
in a graph with 50B edges in 10 minutes
� Parallel sparsification algorithm for
approximate count with higher accuracy
2
3
1 4
6
5N2={1,3}
N1={4}
N3={1}N4={}
N5={4,6}
N6={4}
∆123 is counted by N2 � N3 = {1}
∆456 is counted by N5 � N6 = {4}
2
3
1 41 4
2
3
V
VicVi
Partitioning� Each processor works on , a subgraph of the
original graph induced by Vi.
� V is partitioned into sets of core vertices
for processors i and j, and
� and
Counting TrianglesEach processor i counts total triangles incident on v Vi
c as shown in the pseudocode below.
)()( vuddddvu vuvu �������
Weak Scaling Curves
Network Density vs Runtime
Strong Scaling Curves
Algorithm for Clustering Coefficient
Memory ScalabilityStrong Scaling Curves