fast triangle counting through wedge sampling · 2012-08-06 · fast triangle counting through...

20
Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories 7/10/2012 Pinar SIAM Annual 12 1 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. U.S. Department of Energy Office of Advanced Scientific Computing Research U.S. Department of Defense Defense Advanced Research Projects Agency

Upload: others

Post on 31-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Fast Triangle Counting through Wedge Sampling

Ali Pinar, C. Seshadhri, and Tamara G. KoldaSandia National Laboratories

7/10/2012 Pinar ‐ SIAM Annual 12 1

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security

Administration under contract DE-AC04-94AL85000.

U.S. Department of EnergyOffice of Advanced Scientific Computing Research

U.S. Department of DefenseDefense Advanced Research Projects Agency

Presenter
Presentation Notes
Title: ��The Block Two-Level Erdos-Renyi (BTER) Graph Model��Authors:��C. Seshadhri, Tamara G. Kolda (speaker), Ali Pinar��Abstract:��Graphs can be used to model a wide variety of interactions, ranging from the digital interconnectivity of the Internet to the "Friend" links on Facebook. The goal of graph models is to emulate the key properties of observed graphs; these models can then be used to test the scalability and robustness of algorithms and hardware architectures. One major problem is to find a simple and scalable model that has the right degree distribution and clustering properties of real graphs. We propose a new Block Two-Level Erdos-Renyi (BTER) graph model which builds a large number of small and dense Erdos-Renyi graphs in the first phase and then interconnects those blocks using a weighted Erdos-Renyi graph (also known as a configuration model) in the second phase. This model has several key advantages as compared to existing models: (1) The BTER model specification requires only two parameters in addition to the parameters of the desired degree distribution. (2) The BTER algorithm is easily parallelized and can scale to high dimensions, with only O(1) work per edge in phase 1 and O(log n) work per edge in phase 2. (3) Any degree distribution can be matched perfectly by BTER. We note that the degree distribution may be some specified distribution like power law, which needs only two parameters to be completely specified, or an observed degree distribution from real data. (4) Graphs produced by BTER naturally have community structure even for small degree nodes (i.e., as evidenced by  high clustering coefficients) as a result of the small dense blocks inserted in phase 1, and small effective diameters as a result of the weighted Erdos-Renyi links inserted in phase 2. (5) Finally, BTER graphs show exceptional similarities to real-world data in terms of clustering coefficients, effective diameter, and eigenvalue distribution.��Bio:��Tamara (Tammy) Kolda is a Distinguished Member of Technical Staff in the Informatics and Systems Assessments department at Sandia National Laboratories in Livermore, California. Her research interests include multilinear algebra and tensor decompositions, data mining, optimization, nonlinear solvers, graph algorithms, parallel computing and the design of scientific software. Tamara has received a 2003 Presidential Early Career Award for Scientists and Engineers (PECASE). She currently serves as Section Editor for the Software and High Performance Computing Section of SISC and Associate Editor for SIMAX.��-- Tamara G. Kolda, [email protected] Sandia National Labs, Livermore, CA 94551-9159 phone: 925-294-4769, fax: 925-294-2234 http://csmr.ca.sandia.gov/~tgkolda/ �Tammy Yosemite Mar 2010.jpg�
Page 2: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Triangles are critical for graph analysis

7/10/2012 Pinar ‐ SIAM Annual 12 2

• Interpreted in many different ways in social sciences. – Identifier for bridges between 

communities. – Likelihood to go against norms

• Applied to spam detection• Used to compare  graphs• Proposed as a guide for community 

structure.• Stated as a core feature 

for graph models [Vivar&Banks11] – Cornerstone for Block Two‐level 

Erdos‐Renyi (BTER)

• Rich set of algorithmic results– Algorithms, runtime analysis, 

streaming algorithms, MapReduce, … Using graph assays to monitor network traffic

Open wedge Closed wedge,(i.e., triangle)

Page 3: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

It is not only how many, it  is aboutwhere they are…

• We need algorithms that can compute the distributions of triangles over a given set of attributes.   – For social networks, degree‐wise clustering coefficients tend to 

decrease with degree.

7/10/2012 Pinar ‐ SIAM Annual 12 3

Page 4: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

BTER: A New Model with Explicit Community Structure

• Preprocessing: Generate communities – Determined by desired degree distribution– All nodes have (close to) the same degree – Size of cluster = min degree + 1

• Phase 1: Generate ER graph on each community

– User must specify connectivity coefficient for each community, ½k

– We use a function of the min degree in the community, dk

• Phase 2: Generate CL graph on “excess” degree

– e(i) = d(i) – ½k dk where vertex i is in community k

2/15/2012 Pinar ‐ SIAM PP12 4

Preprocessing:Create explicit communities

Phase 1: Erdös‐Rényigraphs in each community

Phase 2:CL model on “excess” degree 

Seshadhri, Kolda, & Pinar, Phys. Rev. E, 2012

Hypothesis: Real‐world interaction networks consist of a scale –free collection of dense Erdős‐Rényi graphs.

Page 5: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

BTER can match properties of real world graphs

• The code is available at http://www.sandia.gov/~tgkolda/bter_supplement/• Hadoop and MPI implementation will be available soon. 

7/10/2012 Pinar ‐ SIAM Annual 12 5

Page 6: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

It is not only how many and where they are, it is about what they comprise …

• Tell me  about your friends, I will tell you who you are. 

• We need algorithms that can reveal the structure of the triangles. – For social networks vertices of a triangle are close in degree, but high degree nodes 

are dominant in triangles of infrastructure networks. 

7/10/2012 Pinar ‐ SIAM Annual 12 6

amazon0312 ca‐AstroPh Soc_Epinionscit‐HepPh

as‐caida20071105 web‐Stanford wiki‐TalkOregon1_010331

Durak, Pinar, Kolda, Seshadhri, 2012

Page 7: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Enumerating triangles• Core idea: check whether each wedge is closed.  – For each vertex v, in the graph 

• For every pair of neighbors u, w of vertex v, – If  there is an edge between u and w, 

» report the triangle.

• Runs in cubic time. • Redundant work: each triangle is reported 3 times.  

7/10/2012 Pinar ‐ SIAM Annual 12 7

Example with 13 wedges and 1 triangle

Page 8: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Clever Enumeration• By imposing an ordering on the vertices (e.g., order by 

degree), we can check only one wedge per triangle (the one centered on the vertex with min. degree).

• This can be achieved by assigning each edge to its vertex with lower degree.

• Discovered and rediscovered starting in 1985. 

7/10/2012 Pinar ‐ SIAM Annual 12 8

Total wedges: 24Wedges that need to be checked: 4

Page 9: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Naïve vs. Clever enumeration

050

100150200250300

Normalized wedge counts  Naïve Clever

• In practice, clever approach is very effective in reducing number of wedges that are checked.

• Recent work showed that the clever algorithm runs in linear time for graph generated with edge configuration model, with power‐law degree dist. with coefficient > 7/3. [Berry et al, SAND2010‐4474C]

7/10/2012 Pinar ‐ SIAM Annual 12 9

Page 10: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Triangle counting is amenable to sampling 

7/10/2012 Pinar ‐ SIAM Annual 12 10

• Clustering coefficient (CC) can be considered as the success rate of  an experiment with a binary outcome.• Each wedge is an experiment, which succeeds if it is closed, 

and fails otherwise.• This is an excellent setup for a sampling algorithm, because..

• Many graphs of interest have a very large number of wedges.• Large enough space, to benefit from sampling.  

• In many graphs of interest, a nontrivial fraction of the wedges are closed.• We are not looking for a needle in a haystack.

Page 11: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Wedge‐sampling

7/10/2012 Pinar ‐ SIAM Annual 12 11

Clustering coefficients can be considered as the success rate ofexperiments with binary outcomes.

Page 12: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Wedge‐sampling providesprovably accurate estimations

• Theorem: For error = ε and confidence = 1‐δ, the number of samples required  is 

• For 99.9% confidence and 1% error, we need only k = 38,005 samples

7/10/2012 Pinar ‐ SIAM Annual 12 12

0.5ε−2 ln(2δ

)⎡⎢⎢

⎤⎥⎥

The number of samples in independent of the graph size.

Page 13: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Alternative: DoulionAn alternative to wedge sampling is edge‐based sampling. [Tsourakakis et al, KDD09]• Generate a smaller graph by removing each edge with probability 

1‐p. • Count the number of triangles in the original graph. • Multiply by p3 to predict the number of triangles in the original 

graph.    

7/10/2012 Pinar ‐ SIAM Annual 12 13

Drawback: • Expected value is correct, but the 

variance may be huge.  

Page 14: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Wedge‐sampling offers accurate estimations 

0

0.05

0.1

0.15

0.2

0.25

0.3

Relative error

Wedge‐sampling‐13K Doulion‐10 Doulion‐25

7/10/2012 Pinar ‐ SIAM Annual 12 14

Page 15: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

…with big savings in runtime

0

0.1

0.2

0.3

0.4

0.5

Enumeration Wedge‐sampling Doulion‐10 Doulion‐25

7/10/2012 Pinar ‐ SIAM Annual 12 15

Times normalized with respect to the IO time. 

Page 16: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Counting Directed Triangles

• We have – three edge types: in, out, bi‐directional,– six wedge types,– seven triangle types. 

• Sampling works as is for clustering coefficients. • Estimating the number of triangles needs adjustments.  

7/10/2012 Pinar ‐ SIAM Annual 12 16

Page 17: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Counting Directed Triangles

i ii iii iv v vi

a 1 1 1

b 3

c 1 2

d 1 1 1

e 1 2

f 1 1 1

g 3

7/10/2012 Pinar ‐ SIAM Annual 12 17

• Multiple occurrences of the same wedge type causes counting the same triangle multiple times. 

• Algorithm– Pick a wedge‐type for the triangle type– Compute the success rate– #triangles = success rate * |w|/wedge multiplicity  

Page 18: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Estimating  triangles per degree 

• Similar principles apply to counting triangles per degree. • But, we need to adjust the counts based of the number of 

vertices with the same degree in the sampled wedge.

7/10/2012 Pinar ‐ SIAM Annual 12 18

ca‐CondMat cit‐HepPh soc‐Epinions1

Page 19: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Concluding Remarks

7/10/2012 Pinar ‐ SIAM Annual 12 19

Freq

uency

TrianglesWedges (not in Triangles)

Edges (not in Wedges or Triangles)

Isolates

• Triangles can reveal a lot of information about a graph. • Wedge‐sampling provides provably good estimations with big runtime savings. 

– The number of samples is independent of the graph size. – Directed triangles can be counted the same way as undirected graphs. – Distribution of triangles with a given property can be estimated with  the same algorithm. 

• Current work:– A MapReduce is implementation is on the way. – Enhancing the algorithm for streaming graphs – Sampling for larger patterns efficiently is being investigated. 

• Goal: – Build graph assays 

Page 20: Fast Triangle Counting through Wedge Sampling · 2012-08-06 · Fast Triangle Counting through Wedge Sampling Ali Pinar, C. Seshadhri, and Tamara G. Kolda Sandia National Laboratories

Related Publications• Modeling 

– C. Seshadhri, T. Kolda, and A. Pinar, “The Blocked Two‐Level Erdos Renyi Graph Model,” Physical Review E.

– C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth analysis of Stochastic Kronecker Graphs," submitted.

– A. Pinar, C. Seshadhri, and T. Kolda, “The Similarity of Stochastic Kronecker Graphs to Edge‐Configuration Models,” SDM’12

– C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth study of Stochastic Kronecker Graphs,” ICDM’12 

• Generating a random graph

– J. Ray, A. Pinar, and C. Sehadhri, Are we there yet? When to stop a Markov chain while generating random graphs,” WAW 12.

– I. Stanton and A. Pinar, “Constructing and uniform sampling graphs with prescribed joint degree distribution using Markov Chains,” to appear in ACM JEA.

– I. Stanton and A. Pinar, “Sampling graphs with prescribed joint degree distribution using Markov Chains,” ALENEX’11.

• Community structure and triangles 

– C. Seshadhri, A. Pinar, and T. Kolda, “Fast Triangle Counting through Wedge Sampling," submitted. 

– M. Rocklin and A. Pinar, “On Clustering on Graphs with Multiple Edge Types,” Internet Mathematics.

– M. Rocklin and A. Pinar, “Latent Clustering on Graphs with Multiple Edge Types,” Proc. 8th Workshop on Algorithms and Models for the Web Graph WAW’ 11.

– M. Rocklin and A. Pinar, “Computing an Aggregate Edge‐weight function for Clustering Graphs with Multiple Edge Types,” WAW’10.

7/10/2012 Pinar ‐ SIAM Annual 12 20