network characterization via random walks b. ribeiro, d. towsley umass-amherst
TRANSCRIPT
Network Characterization via Random Walks
B. Ribeiro, D. TowsleyUMass-Amherst
Problem
Given large, possibly dynamic, network, how does one efficiently sample/crawl to accurately characterize it?
degree distribution centrality clustering …
Motivation
understanding technological networks, social networks Internet, wireless networks on-line social networks such as FaceBook,
MySpace, Orkut, YouTube, …
when network dataset not available size, lack of global view, dynamics
Outline
review of sampling
random walks (RWs)
multiple coupled RWs
results
Sampling methods
random sampling uniform vertex sampling
• θi - fraction of vertices with degree i
• degree i vertex sampled with probability θi
uniform edge sampling• πi - probability degree i vertex sampled
• πi = θi x i / <average degree>
crawling snowball sampling – commonly used, highly
biased random walk
6
Estimate θi - fraction of vertices with degree i
Budget: B samples accuracy: Normalized root Mean Squared
Error
uniform vertex
uniform edge
Random sampling: accuracy of estimates
head: GOOD tail: BAD
q head: BAD
q tail: GOOD
NM
SE
in-degree
Uniform vertex vs. edge sampling
edge
vertex
head: GOOD tail: BAD
GO
OD
head: BAD tail: GOOD
BA
D
Flickr graph (1.7 M vertices, 22M
edges)
budget: B = |V|/100
8
uniform vertex
Pros: independent sampling OSN needs numeric
user IDs. E.g.: Livejournal, Flickr, MySpace, Facebook,...
Cons: resource intensive
(sparse user ID space) difficult to sample
large degree vertices
Pros & Consuniform edge
Pros:◦ independent sampling◦ easy to sample high
degree vertices
Cons:◦ no public OSN interface
to sample edges
9
start at node v randomly select a neighbor of v repeat till collected B samples
sampling with replacement
Random walk (RW)
Random walk sampling produces biased
estimate iRW
of i
easily corrected
iRW
= i i /avg. degree
i = Norm iRW
/iCCDF
RW sampling^ ^
11
uniform vertex
Pros: independent sampling OSN needs numeric
user IDs. E.g.: Livejournal, Flickr, MySpace, Facebook,...
Cons: resource intensive
(sparse user ID space) difficult to sample
large degree vertices
Pros & Consrandom walk
Pros: asymptotically unbiased easy to sample high
degree vertices low cost resource-wise
Cons: graph must be
connected large estimation errors
when graph loosely connected
length of transient?
12
uniform vertex samples A and C subgraphs but is expensive
RW samples A or C but is cheap
A
C
Combine advantages of
uniform vertex & RWs?
Hybrid sampling
Multiple random walks
m independent uniformly placed RWs split budget B among
them
Pros cover all components whp as m increases
Cons bias due to transient difficult to combine estimates
Couple the RWs?
14
m coupled walkers
B – sampling budget
S = {v1, … , vm} initial set of m vertices; E’ =
(1) start from vr S w.p. deg(vr)
(2) walk one step from vr
(3) add walked edge to E’ and update vr
(4) return to (1) (until m + | E’ | = B)
Frontier Sampling (FS)
Random walk on Gm
At steady state
samples edges uniformlyas m → , walkers uniformly distributed in
graph m coupled RWs start approximately in
steady state short transient
15
FS properties
16
Sample paths for θ1 estimate (Flickr graph)
Plot evolution (n) , n - number of steps
17
large connected component of Flickr graph
accuracy metric: NMSE of CCDF
Sampling errors
in-degree
NM
SE
18
2 Albert-Barabasi graphs with average degrees 2, 10, connected by one edge
Sampling errors: GAB graph
in-degree
NM
SE
20
m independent walkers walker i takes next step with
exponentially distributed time, mean current node degree
walkers run for time T, report to central site
Distributed FS
Future work analyzing, speeding up convergence
other forms of coupling other graph statistics study how graph structure affects
sampling efficiency power law vs exponential tail spatial correlation, independence vs. SRD
vs. LRD application to different networks
wireless, social, wireless/social