the block two-level erdös-rényi (bter) graph model ali pinar, c. seshadri, and tamara g. kolda...
Post on 25-Dec-2015
219 Views
Preview:
TRANSCRIPT
The Block Two-Level Erdös-Rényi (BTER) Graph Model
Ali Pinar, C. Seshadri, and Tamara G. KoldaSandia National Laboratories
July 13, 2012 Pinar - MMDS12 1
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration
under contract DE-AC04-94AL85000.
U.S. Department of EnergyOffice of Advanced Scientific Computing Research
U.S. Department of DefenseDefense Advanced Research Projects Agency
Pinar - MMDS12 2
Why model massive graphs? • Enable sharing of surrogate data
– Computer network traffic– Social networks– Financial transactions
• Testing graph algorithms– Scalability– Versatility (e.g., vary degree distributions)– Characterizing algorithm performance– Verification & validation
• Insight into…– Generative process– Community structure– Comparison– Evolution– Uncertainty
July 13, 2012
Block Two-Level Erdös-Rényi (BTER) graph; image courtesy of Nurcan Durak.
Pinar - MMDS12 3
Model Desiderata• Captures heavy-tailed degree
distribution– Not necessarily power law
• Captures community structure– Measured indirectly through
clustering coefficient, and other measures
• Able to “fit” real-world data– Reproduce degree distribution– Reproduce community structure
• Scales up to ~240 nodes – Motivated by GRAPH500
benchmark
July 13, 2012
A.-L. Barabasi and R. Albert. Emergence of scalingin random networks. Science, 286(5349):509-512, 1999.
Actor Collaboration WWW Power Grid
M.E.J. Newman and M. Girvan, Finding and evaluating community structure in networks,
Phys. Rev. E 69, 026113, 2004.http://www.graph500.org/
Preserving Degree Distribution
July 13, 2012 Pinar - MMDS12 4
• Majority of models are amazingly bad at matching degree distributions
– Tend to focus on matching “spirit” of the distribution– May not even have the capability of matching some
distributions
• Exception is the Chung-Lu (CL) model [Chung & Lu, 2002]
– Also known as configuration model [Newman, 2003] or weighted Erdős-Rényi (ER) model
• Premise: Probability of edge is proportional to degrees of endpoints
– Pr(eij) = didj/(2m)
• Can be implemented in a scalable way by repeated edge insertions.
The CL model is nearly perfect at matching desired degree
distributions.
100
101
102
103
100
101
102
103
104
Degree
Cou
nt
Degree Distribution
cit-HepPhCL
Pinar - MMDS12 5
Community Structure in Graphs• Numerous community finding algorithms exist
– Difficult to validate– Trouble in finding full range of sizes
• Instead, use related measures like clustering coefficient– Triangles arise because of community structure
• What “community” structure must be presentto ensure a high clustering coefficient, especially for low-degree nodes?– How many communities?– What do they look like?
July 13, 2012
M.E.J. Newman and M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113,
2004.
Pinar - MMDS12 6
Clustering Coefficients
July 13, 2012
ti = # triangles at vertex idi = degree of vertex i
Clustering Coefficient
Global Clustering Coeff.
Pinar - MMDS12 7
Clustering coefficients are higher for lower degree vertices
July 13, 2012
Model building approach • Find features that restrict the space and identify the
structure imposed by these features • Given
• a skewed degree distribution• high clustering coefficients
• What structure should we expect? • Let’s take it to an extreme, i.e., clustering coefficients ~1. • Implication 1: The graph must be a union of cliques.• Implication 2: The sizes of the cliques (communities) will have the
same distribution as the degree distribution. • The challenge is in characterizing the gray area:
clustering coefficients are high but not 1
July 13, 2012 Pinar - MMDS12 8
Pinar - MMDS12 9
Building the basis for a model
Empirical Observations• Degree distributions are heavy tailed.• Clustering coefficients are highest for
small degree vertices.
July 13, 2012
We are not only trying to build a formal model, we are trying to formalize the model building process itself.
Seshadhri, Kolda, & Pinar, Phys. Rev. Let. E 2012
Theoretical Analysis Theorem: If a community has s edges then there must be Ω(√s) vertices with degree Ω(√s).
Hypothesis: Real-world interaction networks consist of a scale –free collection of relatively dense Erdős-Rényi graphs.
Pinar - MMDS12 10
Verifying the model
July 13, 2012
Prediction: Within a community, the degree variance should be small.
Prediction: the number of communities grows with the number of nodes.
Verification: Vertex degree is correlated to the average degree of neighbors.
And degrees of triangle vertices are close.
Verification: The sizes of the communities are small, and do not grow (or grow very slowly) for larger graphs. (e.g., Dunbar number).
Hypothesis: Real-world interaction networks consist of a scale –free collection of dense Erdős-Rényi graphs.
Matching the Clustering Coefficients
July 13, 2012 Pinar - MMDS12 11
• Random pairings cannot close wedges!– Models that achieve high clustering coefficients require
history to guide future edge insertions
• To get high clustering coefficient without history, we have to pre-group the nodes into affinity blocks
– Our theory: CL graph with high clustering coefficient must have dense ER block at its core
– Each affinity block is a near-clique
• Simplifying assumption: Non-overlapping blocks– Implies blocks must comprise nodes of similar degree
– Can tune clustering coefficient to be appropriate for the degree by changing connectivity of the block If nodes of different degrees are
in the same block, then the high-degree nodes will have low
clustering coefficients.
4
2
3
1
Pinar - MMDS12 12
Preprocessing: Determining Blocks
July 13, 2012
• Input: Desired degree-distribution– Assume nodes sorted by degree, least to greatest
– Ignore degree-1 vertices in this phase
• Algorithm:– Group nodes in order
– Bulk assign, if possible
• Output: Compressed information about blocks– Size of each block
– Starting index of each block
– “Weight” of each block (depends on connectivity)
– O(duniq) information, if compressed
• Observe: If degree distribution is power law, then so is the distribution of the block sizes
– Evocative of Dunbar’s number: Max block size = 100 for γ=2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
homogeneous
heterogeneous
BTER: Block Two-Level Erdős-Rényi
July 13, 2012 Pinar - MMDS12 13
Preprocessing:Create explicit
affinity blocks of nodes with same (or nearly same)
degree. The blocks are
determined by the degree
distribution.
Phase 1: Erdős-Rényi graphs in
each block. User-specified connectivity determines number of links within each block. A formula is given that
depends on lowest-degree node in block.
Phase 2: CL (aka weighted ER)
model on “excess” degree, creates
connections across blocks. Can run in tandem to Phase 1
by worked with expected excess
degree.
Scalable BTER: Independent Edges
July 13, 2012 Pinar - MMDS12 14
Choose phase 1 or 2?
Create Phase 2 edgeusing CL model on
“excess degree”
Create Block 1 edge per ER model with
connectivity ρ1
Choose 1st endpoint
Choose 2nd endpoint
Create Block K edge per ER model with
connectivity ρK
Choose 1st endpoint
Choose 2nd endpoint
Choose 1st endpoint
Choose 2nd endpoint
Choose block proportional to number of “samples” per block
Requires O(n) data to determine the various probabilities, and can be compressed to O(duniq).
Visualization of BTER Adjacency Matrix
July 13, 2012 Pinar - MMDS12 15
Red = Phase 1Blue = Phase 2
Visualization of BTER Graph
July 13, 2012 Pinar - MMDS12 16
Nodes colored by degree
(darker=higher)
Phase 1 = BluePhase 2 = Green
Image courtesy of Nurcan
Durak.
Pinar - MMDS12 17
Co-authorship (ca-AstroPh)
July 13, 2012
• 18,771 nodes 396,100 edges; based on arxiv• BTER
Phase 1 edges: 290,268 Phase 2 edges: 112,808
• Global clustering coefficient Original: 0.32, BTER: 0.31, CL: 0.01
• Normalized size of the largest connected component: Original: 0.95, BTER: 0.86, CL: 1.0
Eigenvalues are not determined by the degree distribution.
Pinar - MMDS12 18
Trust Network
July 13, 2012
• 75,879 vertices, 811,480 edges; based on Epinion web site; edges represent trust between two users
• BTER Phase 1 Edges: 300,162 Phase 2 Edges: 515,192
• Global clustering coefficientOriginal: 0.07, BTER: 0.07, CL: 0.03
• Normalized size of the largest connected componentOriginal: 1.0, BTER: 0.96, CL: 0.98
BTER provides good matches to real data, even when the clustering coefficients are not too high
Pinar - MMDS12 19
Concluding Remarks• Models are a critical challenge in network analysis• The challenge is not in building a formal model, but
formalizing the modeling process itself • We propose the Block Two-Level Erdös-Rényi (BTER)
– New theory says there must be many dense subgraphs for high clustering coefficient
– New BTER model explicitly creates dense communities using ER
– Exceptional similarities to real data in terms of clustering coefficients and eigenvalues
• The code is available at http://www.sandia.gov/~tgkolda/bter_supplement/.• Scalable versions (Hadoop and MPI) will be
available soon• For more information,
– Ali Pinar apinar@sandia.govJuly 13, 2012
A new workshop• SIAM Workshop on Network Science
– Proposed; under review• Dates: July 7-8, 2013• Place: San Diego, CA• Co-located with SIAM Annual Meeting • Contact:
– Ali Pinar (apinar@sandia.gov), Sandia National Labs– Madhav Marathe (mmarathe@vbi.vt.edu), Virginia Tech
July 13, 2012 Pinar - MMDS12 20
Pinar - MMDS12 21
Relevant Publications• Modeling
– C. Seshadhri, T. Kolda, and A. Pinar, “The Blocked Two-Level Erdos Renyi Graph Model,” Phys. Rev. E 2012.
– C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth analysis of Stochastic Kronecker Graphs," submitted.– A. Pinar, C. Seshadhri, and T. Kolda, “The Similarity of Stochastic Kronecker Graphs to Edge-
Configuration Models,” SDM’12– C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth study of Stochastic Kronecker Graphs,” ICDM’12
• Generating a random graph– J. Ray, A. Pinar, and C. Sehadhri, Are we there yet? When to stop a Markov chain while generating
random graphs," submitted.– I. Stanton and A. Pinar, “Constructing and uniform sampling graphs with prescribed joint degree
distribution using Markov Chains,” to appear in ACM JEA.– I. Stanton and A. Pinar, “Sampling graphs with prescribed joint degree distribution using Markov
Chains,” ALENEX’11.• Community structure
– C. Seshadhri, A. Pinar, and T. Kolda, “Fast Triangle Counting through Wedge Sampling," submitted. – M. Rocklin and A. Pinar, “On Clustering on Graphs with Multiple Edge Types,” Internet Math. 2012. – M. Rocklin and A. Pinar, “Latent Clustering on Graphs with Multiple Edge Types,” Proc. 8th Workshop on
Algorithms and Models for the Web Graph WAW’ 11.– M. Rocklin and A. Pinar, “Computing an Aggregate Edge-weight function for Clustering Graphs with
Multiple Edge Types,” WAW’10.July 13, 2012
top related