network resilience: exploring cascading failures vishal misra columbia university in the city of new...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Network Resilience: Exploring Cascading Failures
Vishal MisraColumbia University in the City of New York
Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst)
Prologue
On Tuesday, September 18, simultaneous with the onset of thepropagation phase of the Nimda worm, we observed a BGP storm.
Thisone came on faster, rode the trend higher, and then, justas mysteriously, turned itself off, though much more slowly. Over aperiod of roughly two hours, starting at about 13:00 GMT (9am EDT),aggregate BGP announcement rates exponentially ramped up by afactor of 25, from 400 per minute to 10,000 per minute, withsustained "gusts" to more than 200,000 per minute. Theadvertisement rate then decayed gradually over many days, reachingpre-Nimda levels by September 24th.
Similar events were observed on July 19th, the day CODE RED spread
http://www.renesys.com/projects/bgp_instability
Conjecture The viruses started random IP port scanning Most of these random IP addresses were not in the cached entries of
the routing table, causing.... frequent cache misses, and.. in the case of invalid IP addresses, generation of ICMP (router error)
messages.. …both of the above causes led to router CPU overload, causing
routers to crash Router failure led to withdrawal announcements by the peers,
generating a high level of advertisement traffic. When the router came back on, it required a full state update from it's
peers, creating a large spike in the load of it's peers that provided the state dump
Once the restarted router obtained all the dumps, it dumped its full state to all its peers, creating another spike in the load..
Frequent full state dumps led to more CPU overload, leading to more crashes, and the propagation of the cycle...
Cascading Failures?
Outline
Background Modeling interactions A Fluid model
Phase transitions A Birth-Death model
More phase transitions Insights Future work
Studies in Cascading Failures
Cascading failures studied extensively in Power Networks (Zaborsky et al.)
Coupling in Power Networks between nodes well understood: e.g. differential equations describe voltage-phasor-load relationships
Coupling in data networks: Routing, Traffic engineering, policy routing, DNS…difficult to model!
Modeling interactions
We model coupling at BGP level Study the interaction of a clique of BGP
routers Model three different kinds of
phenomena: router crash, router repair and full state updates
System essentially forms a mutual aid collective
Clique of routers
•Routers form a fully connected graph•All routers are peers of each other•At the AS level, BGP routers form a clique ofthe order of 540 nodes
A fluid model for interactions
We consider a clique of N nodes Study process of nodes that are down,
D ks : Rate at which single up node brings
up down nodes kl : Rate at which full state updates
brings down up nodes Typically, expect ks >> kl
Drift equations
(t) = Number of arrivals in [0,t)d(t) = (N-D)*D*ksdt
(t) = Number of departures in [0,t)d(t) = D *(N-D) /D kldt = (N-D) *kldt
Now, consider the drift in down nodes DdD(t) = d(t) - d(t)
Dynamics of D
NkDNkkDkdtdD
slsl )(2
System shows Phase TransitionIf D(0) > ks / kl
NtDt )(lim
else
0)(lim tDt
Phase transitions
N = 100ks / kl = 20
Properties of phase transition
Threshold is an absolute quantity rather than a fraction
Cliques with “powerful” (i.e., ks / kl high) nodes do not exhibit cascading failures
Smaller cliques more resistant to phase transitions
A Birth-Death model
Again consider a clique of N nodes The system state i is the number of
down nodes Transitions rates are state dependent
0 1 i i+1 N-1 N
N-1i
i
Transient model
Since N =0, state N is an absorbing state System ends up in N with probability 1 Perform transient analysis, compute mean time
to absorption, Wi starting from state i
Wi good indicator of stability of system, a low value indicates propensity to collapse to state N (where all nodes are down)
Physically, interpret Wi as the ability for the system to recover if it ends up in state i through some exogenous process (e.g. attacks)
Solution for Wi
11
1
11
1
1
iii
ii
ii
ii
ii WWW
With boundary conditions
010
1
WW
211
21
21
1
NN
NNN
NN WW
and
Solution (cont.)
1
0
1
1
1i
j ii
i
jkk
ii WW
and
0NW
Yield a way to compute Wi
Modeling transition rates
i =(N-i) *i *kl + ka
ka =ambient traffic load, kl similar to fluid model
ks similar to fluid model
i =(N-i) *ks
The mean time to absorption
N=20, ks =1, kl=0.01System stable, mean time to absorption of the order 1026 , even if only one node is up
A larger clique
N=100, ks =1, kl=0.01System still stable, mean time to absorption
of the order 1048 , if only one node is up
The appearance of phase transitions
N=200, ks =1, kl=0.01Mean time to absorption goes down from 1047 , to about 0 in a matter of few states
Dependence on service rate/load
Transition point shifts right as ratio goes up
Dependence on clique size
Transition point remains roughly the same, relative stability goes down as N goes up
Early conclusions
Cascading failures possible in mutual support systems like a BGP clique
Presence of phase transitions depends on system parameters strongly
Clique size an important threshold, larger cliques more likely to undergo cascading failures
Future work
Refine model, plug in numbers for parameters
Look at different topologies Do more detailed modeling of single
router (fixed point solutions)