network resilience: exploring cascading failures vishal misra columbia university in the city of new...

Network Resilience: Exploring Cascading Failures

Vishal MisraColumbia University in the City of New York

Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst)

Prologue

On Tuesday, September 18, simultaneous with the onset of thepropagation phase of the Nimda worm, we observed a BGP storm.

Thisone came on faster, rode the trend higher, and then, justas mysteriously, turned itself off, though much more slowly. Over aperiod of roughly two hours, starting at about 13:00 GMT (9am EDT),aggregate BGP announcement rates exponentially ramped up by afactor of 25, from 400 per minute to 10,000 per minute, withsustained "gusts" to more than 200,000 per minute. Theadvertisement rate then decayed gradually over many days, reachingpre-Nimda levels by September 24th.

Similar events were observed on July 19th, the day CODE RED spread

http://www.renesys.com/projects/bgp_instability

Conjecture The viruses started random IP port scanning Most of these random IP addresses were not in the cached entries of

the routing table, causing.... frequent cache misses, and.. in the case of invalid IP addresses, generation of ICMP (router error)

messages.. …both of the above causes led to router CPU overload, causing

routers to crash Router failure led to withdrawal announcements by the peers,

generating a high level of advertisement traffic. When the router came back on, it required a full state update from it's

peers, creating a large spike in the load of it's peers that provided the state dump

Once the restarted router obtained all the dumps, it dumped its full state to all its peers, creating another spike in the load..

Frequent full state dumps led to more CPU overload, leading to more crashes, and the propagation of the cycle...

Cascading Failures?

Outline

Background Modeling interactions A Fluid model

Phase transitions A Birth-Death model

More phase transitions Insights Future work

Studies in Cascading Failures

Cascading failures studied extensively in Power Networks (Zaborsky et al.)

Coupling in Power Networks between nodes well understood: e.g. differential equations describe voltage-phasor-load relationships

Coupling in data networks: Routing, Traffic engineering, policy routing, DNS…difficult to model!

Modeling interactions

We model coupling at BGP level Study the interaction of a clique of BGP

routers Model three different kinds of

phenomena: router crash, router repair and full state updates

System essentially forms a mutual aid collective

Clique of routers

•Routers form a fully connected graph•All routers are peers of each other•At the AS level, BGP routers form a clique ofthe order of 540 nodes

A fluid model for interactions

We consider a clique of N nodes Study process of nodes that are down,

D ks : Rate at which single up node brings

up down nodes kl : Rate at which full state updates

brings down up nodes Typically, expect ks >> kl

Drift equations

(t) = Number of arrivals in [0,t)d(t) = (N-D)*D*ksdt

(t) = Number of departures in [0,t)d(t) = D *(N-D) /D kldt = (N-D) *kldt

Now, consider the drift in down nodes DdD(t) = d(t) - d(t)

Dynamics of D

NkDNkkDkdtdD

slsl )(2

System shows Phase TransitionIf D(0) > ks / kl

NtDt )(lim

else

0)(lim tDt

Phase transitions

N = 100ks / kl = 20

Properties of phase transition

Threshold is an absolute quantity rather than a fraction

Cliques with “powerful” (i.e., ks / kl high) nodes do not exhibit cascading failures

Smaller cliques more resistant to phase transitions

A Birth-Death model

Again consider a clique of N nodes The system state i is the number of

down nodes Transitions rates are state dependent

0 1 i i+1 N-1 N

N-1i

i

Transient model

Since N =0, state N is an absorbing state System ends up in N with probability 1 Perform transient analysis, compute mean time

to absorption, Wi starting from state i

Wi good indicator of stability of system, a low value indicates propensity to collapse to state N (where all nodes are down)

Physically, interpret Wi as the ability for the system to recover if it ends up in state i through some exogenous process (e.g. attacks)

Solution for Wi

11

1

11

1

1

iii

ii

ii

ii

ii WWW

With boundary conditions

010

1

WW

211

21

21

1

NN

NNN

NN WW

and

Solution (cont.)

1

0

1

1

1i

j ii

i

jkk

ii WW

and

0NW

Yield a way to compute Wi

Modeling transition rates

i =(N-i) *i *kl + ka

ka =ambient traffic load, kl similar to fluid model

ks similar to fluid model

i =(N-i) *ks

The mean time to absorption

N=20, ks =1, kl=0.01System stable, mean time to absorption of the order 1026 , even if only one node is up

A larger clique

N=100, ks =1, kl=0.01System still stable, mean time to absorption

of the order 1048 , if only one node is up

The appearance of phase transitions

N=200, ks =1, kl=0.01Mean time to absorption goes down from 1047 , to about 0 in a matter of few states

Dependence on service rate/load

Transition point shifts right as ratio goes up

Dependence on clique size

Transition point remains roughly the same, relative stability goes down as N goes up

Early conclusions

Cascading failures possible in mutual support systems like a BGP clique

Presence of phase transitions depends on system parameters strongly

Clique size an important threshold, larger cliques more likely to undergo cascading failures

Future work

Refine model, plug in numbers for parameters

Look at different topologies Do more detailed modeling of single

router (fixed point solutions)

network resilience: exploring cascading failures vishal misra columbia university in the city of new...

Documents