[ieee 2012 sc companion: high performance computing, networking, storage and analysis (scc) - salt...

1
Parallel Algorithms for Counting Triangles and Computing Clustering Coefficients S M Arifuzzaman 1,2 , Maleq Khan 2 and Madhav V. Marathe 1,2 1 Dept. of Computer Science, and 2 Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, VA 24061 Parallel Algorithm for Triangle Counting With P processors, the graph is partitioned into P partitions. Each processor reads its own partition in parallel from the input file. Each processor performs local computation and results are then combined. In edge sparsification [1], unlike sampling [2], each edge is selected with probability p. If the sparsified graph has T s triangles, the estimated number of triangles in G is which is an unbiased estimator as, In our parallel algorithm, each processor i sparsifies its own partition G i (V i ,E i ) independently. Consider an edge that overlaps in two partitions (e.g. edge (2,3) in the figure below); survival of this edge in these two partitions are independent. This independence improves accuracy of the estimation. Parallel Algorithms for Approximate Triangle Counting Reference [1] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “DOULION:counting triangles in massive graphs with a coin,” in Proc. of the 15 th KDD, 2009, pp. 837–846. [2] C. Seshadhri, Ali Pinar, and Tamara G. Kolda, “Fast Triangle Counting through Wedge Sampling,” in CoRR, Volume: abs/1202.5230, 2012 . [3] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” in Proc. of the 20th Intl. Conf. on World Wide Web (WWW), 2011. Partial count of triangles incident on a node v V may reside at different processors which needs to be aggregated. Solution: Summing up by MPI_Reduce maintaining n-element count array for n nodes at each processor (O(n) memory) Parallel Computation of Clustering Coefficients The Problem and Contributions We present MPI-based parallel algorithms for two important problems in network analysis: counting triangles and computing clustering coefficients in massive networks. A triangle in a graph G(V, E) is a set of three nodes u, v, w V such that there is an edge between each pair of nodes. The number of triangles incident on node v with adjacency list N v is defined as, Total number of triangles , The clustering coefficient (CC) of a node v V with degree d v is defined as, Improved Sequential Algorithm Many algorithms use adjacency matrix representation which is not suitable for large graphs as it takes O(n 2 ) memory. NodeIterator++ (NPP) [3] is a state-of-the art algorithm that uses adjacency list representation. NodeIterator++ uses an ordering, , of nodes to avoid duplicate count of triangles. A degree-based ordering reduces running time significantly. Proposed Modified Algorithm: NodeIteratorN (NN) Performs comparison u v for each edge (u,v) E in preprocessing step rather than doing same in computing step. Stores neighbors u with higher order than node v in adjacency list N v p 0.1 0.2 0.3 0.4 0.5 Accuracy 99.61 99.685 99.832 99.898 99.947 1/P 2 100 25 11.1 6.25 4 Speedup 35.3 17.6 8.1 5 3.16 Networks Variance Avg Error (%) Max Error (%) Actual Count Our Algo. [1] Our Algo. [1] Our Algo. [1] web-BerkStan 1.287 2.027 0.389 0.392 1.02 1.08 64.7M LiveJournal 1.77 1.958 1.46 1.86 3.88 4.75 285.7M {T: array of counts of local triangles} for v V i do T v ← 0 for v V i c do for u N v do S N v N u T v T v + |S| T u T u + |S| for w S do T w T w + 1 Communicate T v for each v V i -V i c Aggregate T v for each v V i c Compute C v for each v V i c // Preprocessing Steps: for each edge (u,v) E do if u v then store v in N u else store v in N v // Computation Steps: for v V do sort N v in ascending order T ← 0 //counts of triangles for v V do for u N v do S N v N u T T + |S| {T: count of triangles} for each processor i, in parallel, do G i (V i , E i ) ← ReadGraph(G, i) T CountTriangles(G i , i) MpiBarrier MpiReduce(T) Procedure CountTriangles(G i , i) for v V i do sort N v in ascending order T ← 0 for v V i c do for u N v do S N v N u T T + |S| Networks Nodes Edges Email-Enron 37K 0.36M web-BerkStan 0.69M 13M Miami 2.1M 100M LiveJournal 4.8M 86M Twitter 42M 2.4B Gnp(n,d) n 0.5 nd Hcc(n,d) n 0.5 nd Networks Runtime(sec) No. of triangles NPP [3] NN Email-Enron 0.08 0.03 0.7M web-BerkStan 3.3 0.4 64.7M LiveJournal 40.35 13 285.7M Miami 43.56 16 332M Gnp(500K, 20) 1.8 0.6 1308 Hcc(15M, 90) 553 52.6 15B Networks Number of triangles Runtime Our Algo. [3] web-BerkStan 65M 0.03s 1.7m LiveJournal 285.7M 0.39s 5.33m Twitter 34.8B 15m 423m Hcc(5M, 200) 24.7B 3.86s - Hcc(2B, 50) 600B 9m - Miami 332M 0.5s - Load-Balancing We assign equal number of core vertices per processor. Load can be imbalanced if the network has skewed degree distribution. However, degree-based ordering provides very good load balancing without additional overhead. In the example network shown on the right, v 0 has degree n-1, but we have |N v0 |= 0 and |N vi | ≤ 3. Accuracy and speedup on LiveJournal graph Variance and average error Better Solution: (external memory aggregation) Processor i writes P intermediate disk files F j each for one distinct processor j with triangle-counts for v V j c All F j s are then aggregated by processor j Processor j computes C v for v V j c v 0 v 1 v 2 v 3 v 4 v 5 V n-1 Performance scales very well with size of networks and number of processors. is significantly faster than the only known distributed-memory parallel [3]. is able to run on a 50B-edge network in 10 minutes; a 2.9B-edge network is the largest one processed by [3]. Acknowledgement This work has been partially supported by the Grants NSF NetSE CNS-1011769, NSF PetaApps OCI-0904844, DTRA CNIMS HDTRA1-11-D-0016-0001, NSF SDCI OCI-1032677 and TRA Grant HDTRA1-11-1-0016. Challenge: massive graphs that do not fit in main memory Contribution: Exact parallel algorithm that counts triangles in a graph with 50B edges in 10 minutes Parallel sparsification algorithm for approximate count with higher accuracy 2 3 1 4 6 5 N 2 ={1,3} N 1 ={4} N 3 ={1} N 4 ={} N 5 ={4,6} N 6 ={4} 123 is counted by N 2 N 3 = {1} 456 is counted by N 5 N 6 = {4} 2 3 1 4 1 4 2 3 V V i c V i Partitioning Each processor works on , a subgraph of the original graph induced by V i . V is partitioned into sets of core vertices for processors i and j, and and Counting Triangles Each processor i counts total triangles incident on v V i c as shown in the pseudocode below. ) ( ) ( v u d d d d v u v u v u Weak Scaling Curves Network Density vs Runtime Strong Scaling Curves Algorithm for Clustering Coefficient Memory Scalability Strong Scaling Curves

Upload: madhav

Post on 09-Dec-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Parallel Algorithms for Counting Triangles and Computing Clustering CoefficientsS M Arifuzzaman1,2, Maleq Khan2 and Madhav V. Marathe1,2

1Dept. of Computer Science, and 2Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, VA 24061

Parallel Algorithm for Triangle Counting

With P processors, the graph is partitioned into P partitions. Each processor reads itsown partition in parallel from the input file. Each processor performs local computationand results are then combined.

� In edge sparsification [1], unlike sampling [2], each edge is selected with

probability p.� If the sparsified graph has Ts triangles, the estimated number of triangles

in G is which is an unbiased estimator as,

� In our parallel algorithm, each processor i sparsifies its own partition

Gi(Vi,Ei) independently.

� Consider an edge that overlaps in two partitions (e.g. edge (2,3) in the

figure below); survival of this edge in these two partitions are

independent. This independence improves accuracy of the estimation.

Parallel Algorithms for Approximate Triangle Counting

Reference[1] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “DOULION:counting triangles in massive graphs with a coin,” in Proc. of the 15th KDD, 2009, pp. 837–846.

[2] C. Seshadhri, Ali Pinar, and Tamara G. Kolda, “Fast Triangle Counting through Wedge Sampling,” in CoRR, Volume: abs/1202.5230, 2012 .[3] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” in Proc. of the 20th Intl. Conf. on World Wide Web (WWW), 2011.

� Partial count of triangles incident on a node v V may reside at

different processors which needs to be aggregated.

� Solution: Summing up by MPI_Reduce maintaining n-element count

array for n nodes at each processor (O(n) memory)

Parallel Computation of Clustering CoefficientsThe Problem and Contributions

We present MPI-based parallel algorithms for two important problems in network

analysis: counting triangles and computing clustering coefficients in massive networks.

� A triangle in a graph G(V, E) is a set of three nodes u, v, w V such that there is an edge

between each pair of nodes. The number of triangles incident on node v with adjacency listNv is defined as,

Total number of triangles,

�The clustering coefficient (CC) of a node v V with degree dv is defined as,

Improved Sequential Algorithm

� Many algorithms use adjacency matrix representation which is not suitable for largegraphs as it takes O(n2) memory.

� NodeIterator++ (NPP) [3] is a state-of-the art algorithm that uses adjacency listrepresentation.

� NodeIterator++ uses an ordering, , of nodes to avoid duplicate count of triangles. Adegree-based ordering reduces running time significantly.

Proposed Modified Algorithm: NodeIteratorN (NN)� Performs comparison u v for each edge (u,v) E in preprocessing step rather than

doing same in computing step.

� Stores neighbors u with higher order than node v in adjacency list Nv

p 0.1 0.2 0.3 0.4 0.5

Accuracy 99.61 99.685 99.832 99.898 99.947

1/P2 100 25 11.1 6.25 4

Speedup 35.3 17.6 8.1 5 3.16

NetworksVariance Avg Error (%) Max Error (%) Actual

CountOur Algo. [1] Our Algo. [1] Our Algo. [1]

web-BerkStan 1.287 2.027 0.389 0.392 1.02 1.08 64.7MLiveJournal 1.77 1.958 1.46 1.86 3.88 4.75 285.7M

{T: array of counts of local triangles}for v Vi do

Tv ← 0for v Vic do

for u Nv doS ← Nv � NuTv ← Tv + |S|Tu ← Tu + |S|for w S do

Tw ← Tw + 1Communicate Tv for each v Vi -VicAggregate Tv for each v VicCompute Cv for each v Vic

// Preprocessing Steps:for each edge (u,v) E do

if u v thenstore v in Nu

elsestore v in Nv

// Computation Steps: for v V do

sort Nv in ascending orderT ← 0 //counts of trianglesfor v V do

for u Nv doS ← Nv � NuT ← T + |S|

{T: count of triangles}for each processor i, in parallel,do

Gi(Vi, Ei) ← ReadGraph(G, i)T ← CountTriangles(Gi, i)

MpiBarrierMpiReduce(T)

Procedure CountTriangles(Gi, i)for v Vi do

sort Nv in ascending orderT ← 0for v Vic do

for u Nv doS ← Nv � NuT ← T + |S|

Networks Nodes EdgesEmail-Enron 37K 0.36M

web-BerkStan 0.69M 13MMiami 2.1M 100M

LiveJournal 4.8M 86MTwitter 42M 2.4B

Gnp(n,d) n 0.5 ndHcc(n,d) n 0.5 nd

NetworksRuntime(sec) No. of

trianglesNPP [3] NNEmail-Enron 0.08 0.03 0.7M

web-BerkStan 3.3 0.4 64.7MLiveJournal 40.35 13 285.7M

Miami 43.56 16 332MGnp(500K, 20) 1.8 0.6 1308Hcc(15M, 90) 553 52.6 15B

Networks Number of triangles

RuntimeOur Algo. [3]

web-BerkStan 65M 0.03s 1.7mLiveJournal 285.7M 0.39s 5.33m

Twitter 34.8B 15m 423mHcc(5M, 200) 24.7B 3.86s -Hcc(2B, 50) 600B 9m -

Miami 332M 0.5s -

Load-Balancing� We assign equal number of core vertices per

processor.

�Load can be imbalanced if the network has skewed

degree distribution.

� However, degree-based ordering provides very good

load balancing without additional overhead.

� In the example network shown on the right, v0 has

degree n-1, but we have |Nv0 |= 0 and |Nvi | ≤ 3.

Accuracy and speedup on LiveJournal graph

Variance and average error

Better Solution: (external memory aggregation)

� Processor i writes P intermediate disk files Fj each for one distinct

processor j with triangle-counts for v Vjc

� All Fjs are then aggregated by processor j� Processor j computes Cv for v Vj

c

v0

v1

v2 v3

v4

v5Vn-1

Performance� scales very well with size of networks and

number of processors.

� is significantly faster than the only known

distributed-memory parallel [3].

� is able to run on a 50B-edge network in 10

minutes; a 2.9B-edge network is the largest one

processed by [3].

Acknowledgement This work has been partially supported by the Grants NSF NetSE CNS-1011769, NSF PetaApps OCI-0904844, DTRA CNIMS HDTRA1-11-D-0016-0001, NSF SDCI

OCI-1032677 and TRA Grant HDTRA1-11-1-0016.

� Challenge: massive graphs that do not fit in

main memory

� Contribution:� Exact parallel algorithm that counts triangles

in a graph with 50B edges in 10 minutes

� Parallel sparsification algorithm for

approximate count with higher accuracy

2

3

1 4

6

5N2={1,3}

N1={4}

N3={1}N4={}

N5={4,6}

N6={4}

∆123 is counted by N2 � N3 = {1}

∆456 is counted by N5 � N6 = {4}

2

3

1 41 4

2

3

V

VicVi

Partitioning� Each processor works on , a subgraph of the

original graph induced by Vi.

� V is partitioned into sets of core vertices

for processors i and j, and

� and

Counting TrianglesEach processor i counts total triangles incident on v Vi

c as shown in the pseudocode below.

)()( vuddddvu vuvu �������

Weak Scaling Curves

Network Density vs Runtime

Strong Scaling Curves

Algorithm for Clustering Coefficient

Memory ScalabilityStrong Scaling Curves