Download - Sampling Biases in IP Topology Measurements
![Page 1: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/1.jpg)
Computer Science
Sampling Biases in IP Topology Measurements
John Byers
with Anukool Lakhina, Mark Crovella and Peng XieDepartment of Computer ScienceBoston University
![Page 2: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/2.jpg)
Discovering the Internet topology
Goal: Discover the Internet Router Graph• Vertices represent routers,• Edges connect routers that are one IP hop apart
Measurement Primitive: traceroute Reports the IP path from A to B i.e., how IP paths are overlaid on the router graph
source destination
212.12.5.77
212.12.58.3 163.55.221.98
163.55.1.41
163.55.1.10
![Page 3: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/3.jpg)
• k sources: Few active sources, strategically located.
• m destinations: Many passive destinations, globally dispersed.
• Union of many traceroute paths.
(k,m)-traceroute study
Traceroute studies today
Destinations
Sources
![Page 4: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/4.jpg)
DegreeF
req
uen
cy
Dataset from [PG98]
Heavy tails in Topology Measurements
A surprising finding: [FFF99]
Let be a given node degree.Let be frequency of degree vertices in a graph
Power-law relationship:
dfd
cd df
d
Subsequent measurements show that the degree distribution is a heavy tail,[GT00, BC01, …]
log
(Pr[
X>
x])
log( )
cdxX ]Pr[
![Page 5: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/5.jpg)
We’re skeptical
We will argue that the evidence for power laws is at best insufficient.
Insufficient does not mean noisy or incomplete. (which these datasets certainly are!)
For us, insufficient means that measurements are statistically biased.
We will show that (k,m)-traceroute studies exhibit significant sampling bias.
![Page 6: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/6.jpg)
A thought experiment
Idea: Simulate topology measurements on a random graph.
1. Generate a sparse Erdös-Rényi random graph, G=(V,E). Each edge present independently with probability pAssign weights: w(e) = 1 + , where in
2. Pick k unique source nodes, uniformly at random
3. Pick m unique destination nodes, uniformly at random
4. Simulate traceroute from k sources to m destinations, i.e. learn shortest paths between k sources and m destinations.
5. Let Ĝ be union of shortest paths.
Ask: How does Ĝ compare with G ?
||
1,||
1
VV
![Page 7: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/7.jpg)
Ĝ is a biased sample of G that looks heavy-tailedAre heavy tails a measurement artifact?
MeasuredGraph, Ĝ
Underlying Random Graph, G
Underlying Graph: N=100000, p=0.00015Measured Graph: k=3, m=1000
log(Degree)
log
(Pr[
X>
x])
![Page 8: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/8.jpg)
Outline
Motivation and Thought Experiments
Understanding Bias on Simulated TopologiesWhere and Why
Detecting and Defining BiasStatistical hypotheses to infer presence
of bias
Examining Internet Maps
![Page 9: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/9.jpg)
Understanding Bias
(k,m)-traceroute sampling of graphs is biased
An intuitive explanation: When traces are run from few sources to large
destinations, some portions of underlying graph are explored more than others.
We now investigate the causes behind bias.
![Page 10: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/10.jpg)
Are nodes sampled unevenly?
• Conjecture: Shortest path routing favors higher degree nodes nodes sampled unevenly
• Validation:Examine true degrees of nodes in measured graph, Ĝ.
Expect true degrees of nodes in Ĝ to be higher than degrees of nodes in G, on average.
True Degrees of nodes in Ĝ
Degrees of all nodes in G
Measured Graph: k=5,m=1000
• Conclusion: Difference between true degrees of Ĝ and degrees of G is insignificant; dismiss conjecture.
![Page 11: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/11.jpg)
Are edges sampled unevenly?
• Conjecture:Edges selected incident to a node in Ĝ not proportional to true degree.
• Validation:For each node in Ĝ, plot true degree vs. measured degree.
If unbiased, ratio of true to measured degree should be constant. Points clustered around y=cx line (c<1).
• Conclusion: Edges incident to a node are sampled disproportionately; supports conjecture.
Ob
serv
ed D
egre
e
True Degree
![Page 12: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/12.jpg)
Why: Analyzing Bias
• Question: Given some vertex in Ĝ that is h hops from the source, what fraction of its true edges are contained in Ĝ?
• Messages:
• As h increases, number of edges discovered falls off sharply.*
* We can prove exponential fall-off analytically, in a simplified model.
Distance from source
Fra
cti
on
of
no
de
ed
ges
dis
cove
red
1000dst
100dst
600dst
Result of adding more destinations: most new nodes and edges closer to the source.
![Page 13: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/13.jpg)
What does this suggest?
Summary:
Edges are sampled unevenly by (k,m)-traceroute methods.
Edges close to the source are sampled more often than edges further away.
Intuitive Picture:
Neighborhood near sources is well explored but, visibility of edges declines sharply with hop distance from sources.
Hop1lo
g(P
r[X
>x]
)
log(Degree)
Hop2
Hop3
Underlying Graph
Measured Graph
Hop4
![Page 14: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/14.jpg)
Outline
Motivation and Thought Experiments
Understanding Bias in Simulated TopologiesWhere and Why
Detecting and Defining BiasStatistical hypotheses to infer presence
of bias
Examining Internet Maps
![Page 15: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/15.jpg)
Inferring Bias
Goal:Given a measured Ĝ, does it appear to be biased?
Why this is difficult: Don’t have underlying graph. Don’t have formal criteria for checking bias.
General Approach: Examine statistical properties as a function of distance from nearest source. Unbiased sample No change Change Bias
![Page 16: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/16.jpg)
Detecting Bias
Examine Pr[D=d|H=h], the conditional probability that a node has degree d, given that it is at distance h from the source.
Two observations:1. Highest degree nodes are near the source.2. Degree distribution of nodes near the source different from those far away
log(Degree)
Ĝ degrees| H=3
log
(Pr[
X>
x])
Underlying Graph
Ĝ degrees| H=2
![Page 17: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/17.jpg)
A Statistical Test for C1
2
)1(2)1(
)1(]Pr[
v
ek
Cut vertex set in half: N (near) and F (far), by distance from nearest source.Let v : (0.01) |V|
k : fraction of v that lies in N
Can bound likelihood k deviates from 1/2 using Chernoff-bounds:
H0C1
Reject hypothesis with confidence 1- if:
2
)1()1(
v
e
C1: Are the highest-degree nodes near the source? If so, then consistent with bias.
The 1% highest degree nodes occur at random with distance to nearest source.
![Page 18: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/18.jpg)
A Statistical Test for C2
Partition vertices across median distance: N (near) and F (far)
Compare degree distribution of nodes in N and F, using the Chi-Square Test:
l
iiii EEO
1
22 /)(
where O and E are observed and expected degree frequencies and l is histogram bin size.
Reject hypothesis with confidence 1- if:
H0C2
2]1,[
2 l
C2: Is the degree distribution of nodes near the source different from those further away? If so, consistent with bias.
Chi Square Test succeeds on degree distribution for nodes near the source and far from the source.
![Page 19: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/19.jpg)
Our Definition of Bias
• Bias (Definition): Failure of a sampled graph to meet statistical tests for randomness associated with C1 and C2.
• Disclaimers:Tests are not conclusive.Tests are binary and don’t tell us how
biased datasets are.
• But dataset that fails both tests is a poor choice to make generalizations of underlying graph.
![Page 20: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/20.jpg)
Introducing datasets
Pansiot-Grad
log(Degree)
Mercator Skitter
log
(Pr[
X>
x])
Dataset Name Date # Nodes # Links # Srcs # Dsts
Pansiot-Grad 1995 3,888 4,857 12 1270
Mercator 1999 228,263 320,149 1 NA
Skitter 2000 7,202 11,575 8 1277
![Page 21: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/21.jpg)
Testing C1
H0C1 The 1% highest degree nodes occur at random with distance
to source.
Pansiot-Grad: 93% of the highest degree nodes are in NMercator: 90% of the highest degree nodes are in NSkitter: 84% of the highest degree nodes are in N
![Page 22: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/22.jpg)
Testing C2
H0C2
Pansiot-Grad Mercator Skitter
log
(Pr[
X>
x])
log(Degree)
Near
Far
All
Near
Far
All
Near
Far
All
![Page 23: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/23.jpg)
Summary of Statistical Tests
All datasets pass both statistical tests for evidence of bias.
Likely that true degree distribution of the routers is different than that of these datasets.
![Page 24: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/24.jpg)
Final Remarks
• Using (k,m)-traceroute methods to discover Internet topology yields biased samples.
• Rocketfuel [SMW:02] is limited-scale but may avoid some pitfalls of (k,m)-traceroute studies.
• One open question: How to sample the degree of a router at random?
• Node degree distributions are especially sensitive to biased sampling may not be a sufficiently robust metric for characterizing or comparing graphs.
![Page 25: Sampling Biases in IP Topology Measurements](https://reader035.vdocuments.site/reader035/viewer/2022062322/56814664550346895db38514/html5/thumbnails/25.jpg)
Sampling Power-Law Graphs
Even though distributional shape similar, different exponents matter for topology modeling. Again, Ĝ is a biased sample of G
MeasuredGraph
Underlying, Power-Law Graph
Underlying PLRG: N=100000Measured Graph: k=3, m=1000
log
(Pr[
X>
x])
log(Degree)