[ieee 2012 sc companion: high performance computing, networking, storage and analysis (scc) - salt...

Parallel Algorithms for Counting Triangles and Computing Clustering CoefficientsS M Arifuzzaman1,2, Maleq Khan2 and Madhav V. Marathe1,2

1Dept. of Computer Science, and 2Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, VA 24061

Parallel Algorithm for Triangle Counting

With P processors, the graph is partitioned into P partitions. Each processor reads itsown partition in parallel from the input file. Each processor performs local computationand results are then combined.

� In edge sparsification [1], unlike sampling [2], each edge is selected with

probability p.� If the sparsified graph has Ts triangles, the estimated number of triangles

in G is which is an unbiased estimator as,

� In our parallel algorithm, each processor i sparsifies its own partition

Gi(Vi,Ei) independently.

� Consider an edge that overlaps in two partitions (e.g. edge (2,3) in the

figure below); survival of this edge in these two partitions are

independent. This independence improves accuracy of the estimation.

Parallel Algorithms for Approximate Triangle Counting

Reference[1] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “DOULION:counting triangles in massive graphs with a coin,” in Proc. of the 15th KDD, 2009, pp. 837–846.

[2] C. Seshadhri, Ali Pinar, and Tamara G. Kolda, “Fast Triangle Counting through Wedge Sampling,” in CoRR, Volume: abs/1202.5230, 2012 .[3] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” in Proc. of the 20th Intl. Conf. on World Wide Web (WWW), 2011.

� Partial count of triangles incident on a node v V may reside at

different processors which needs to be aggregated.

� Solution: Summing up by MPI_Reduce maintaining n-element count

array for n nodes at each processor (O(n) memory)

Parallel Computation of Clustering CoefficientsThe Problem and Contributions

We present MPI-based parallel algorithms for two important problems in network

analysis: counting triangles and computing clustering coefficients in massive networks.

� A triangle in a graph G(V, E) is a set of three nodes u, v, w V such that there is an edge

between each pair of nodes. The number of triangles incident on node v with adjacency listNv is defined as,

Total number of triangles,

�The clustering coefficient (CC) of a node v V with degree dv is defined as,

Improved Sequential Algorithm

� Many algorithms use adjacency matrix representation which is not suitable for largegraphs as it takes O(n2) memory.

� NodeIterator++ (NPP) [3] is a state-of-the art algorithm that uses adjacency listrepresentation.

� NodeIterator++ uses an ordering, , of nodes to avoid duplicate count of triangles. Adegree-based ordering reduces running time significantly.

Proposed Modified Algorithm: NodeIteratorN (NN)� Performs comparison u v for each edge (u,v) E in preprocessing step rather than

doing same in computing step.

� Stores neighbors u with higher order than node v in adjacency list Nv

p 0.1 0.2 0.3 0.4 0.5

Accuracy 99.61 99.685 99.832 99.898 99.947

1/P2 100 25 11.1 6.25 4

Speedup 35.3 17.6 8.1 5 3.16

NetworksVariance Avg Error (%) Max Error (%) Actual

CountOur Algo. [1] Our Algo. [1] Our Algo. [1]

web-BerkStan 1.287 2.027 0.389 0.392 1.02 1.08 64.7MLiveJournal 1.77 1.958 1.46 1.86 3.88 4.75 285.7M

{T: array of counts of local triangles}for v Vi do

Tv ← 0for v Vic do

for u Nv doS ← Nv � NuTv ← Tv + |S|Tu ← Tu + |S|for w S do

Tw ← Tw + 1Communicate Tv for each v Vi -VicAggregate Tv for each v VicCompute Cv for each v Vic

// Preprocessing Steps:for each edge (u,v) E do

if u v thenstore v in Nu

elsestore v in Nv

// Computation Steps: for v V do

sort Nv in ascending orderT ← 0 //counts of trianglesfor v V do

for u Nv doS ← Nv � NuT ← T + |S|

{T: count of triangles}for each processor i, in parallel,do

Gi(Vi, Ei) ← ReadGraph(G, i)T ← CountTriangles(Gi, i)

MpiBarrierMpiReduce(T)

Procedure CountTriangles(Gi, i)for v Vi do

sort Nv in ascending orderT ← 0for v Vic do

for u Nv doS ← Nv � NuT ← T + |S|

Networks Nodes EdgesEmail-Enron 37K 0.36M

web-BerkStan 0.69M 13MMiami 2.1M 100M

LiveJournal 4.8M 86MTwitter 42M 2.4B

Gnp(n,d) n 0.5 ndHcc(n,d) n 0.5 nd

NetworksRuntime(sec) No. of

trianglesNPP [3] NNEmail-Enron 0.08 0.03 0.7M

web-BerkStan 3.3 0.4 64.7MLiveJournal 40.35 13 285.7M

Miami 43.56 16 332MGnp(500K, 20) 1.8 0.6 1308Hcc(15M, 90) 553 52.6 15B

Networks Number of triangles

RuntimeOur Algo. [3]

web-BerkStan 65M 0.03s 1.7mLiveJournal 285.7M 0.39s 5.33m

Twitter 34.8B 15m 423mHcc(5M, 200) 24.7B 3.86s -Hcc(2B, 50) 600B 9m -

Miami 332M 0.5s -

Load-Balancing� We assign equal number of core vertices per

processor.

�Load can be imbalanced if the network has skewed

degree distribution.

� However, degree-based ordering provides very good

load balancing without additional overhead.

� In the example network shown on the right, v0 has

degree n-1, but we have |Nv0 |= 0 and |Nvi | ≤ 3.

Accuracy and speedup on LiveJournal graph

Variance and average error

Better Solution: (external memory aggregation)

� Processor i writes P intermediate disk files Fj each for one distinct

processor j with triangle-counts for v Vjc

� All Fjs are then aggregated by processor j� Processor j computes Cv for v Vj

c

…

v0

v1

v2 v3

v4

v5Vn-1

Performance� scales very well with size of networks and

number of processors.

� is significantly faster than the only known

distributed-memory parallel [3].

� is able to run on a 50B-edge network in 10

minutes; a 2.9B-edge network is the largest one

processed by [3].

Acknowledgement This work has been partially supported by the Grants NSF NetSE CNS-1011769, NSF PetaApps OCI-0904844, DTRA CNIMS HDTRA1-11-D-0016-0001, NSF SDCI

OCI-1032677 and TRA Grant HDTRA1-11-1-0016.

� Challenge: massive graphs that do not fit in

main memory

� Contribution:� Exact parallel algorithm that counts triangles

in a graph with 50B edges in 10 minutes

� Parallel sparsification algorithm for

approximate count with higher accuracy

2

3

1 4

6

5N2={1,3}

N1={4}

N3={1}N4={}

N5={4,6}

N6={4}

∆123 is counted by N2 � N3 = {1}

∆456 is counted by N5 � N6 = {4}

2

3

1 41 4

2

3

V

VicVi

Partitioning� Each processor works on , a subgraph of the

original graph induced by Vi.

� V is partitioned into sets of core vertices

for processors i and j, and

� and

Counting TrianglesEach processor i counts total triangles incident on v Vi

c as shown in the pseudocode below.

)()( vuddddvu vuvu ��

Weak Scaling Curves

Network Density vs Runtime

Strong Scaling Curves

Algorithm for Clustering Coefficient

Memory ScalabilityStrong Scaling Curves

[ieee 2012 sc companion: high performance computing, networking, storage and analysis (scc) - salt...

Documents