ron: resilient overlay networks david andersen, hari balakrishnan, frans kaashoek, and robert morris...

RON: Resilient Overlay Networks

David Andersen, Hari Balakrishnan,

Frans Kaashoek, and Robert Morris

MIT Laboratory for Computer Science

http://nms.lcs.mit.edu/ron/

Fault-tolerant networking

Network

A B

C D

• Packet switching and route around failures

Internet: network of networks

• ISPs peer to forward packets• ISP exchange route info using BGP

ISP3

ISP1 ISP2

Site 1

Site 5 Site 4

Site 3Site 2

The Internet is ill suited to mission-critical applications

• Commercial peer architecture– Performance bottlenecks at peering points– Ignores many existing alternate paths– Directly conflicts with robustness

• Internet’s global scale:– Prevents sophisticated algorithms– Route selection uses fixed, simple metrics– Routing isn’t sensitive to path quality

How robust is Internet routing?

Paxson

95-97

• 3.3% of all routes had serious problems

Labovitz 97-00 • 10% of routes available < 95% of the time

• 65% of routes available < 99.9% of the time

• 3-min minimum detection+recovery time; often 15 mins

• 40% of outages took 30+ mins to repair

Chandra 01 • 5% of faults last more than 2.75 hours

Our goalTo improve communication availability for small

groups by at least a factor or 10

• Many applications– Collaboration and conferencing

– Virtual Private Networks (VPNs) across public Internet

– Overlay Internet Service

Overlay routes around Internet failures

Utah

Utah Company

MIT

Cable Modem

• Failures:

–Outages: Configuration/operational errors, backhoes, etc.

–Performance failures: Severe congestion, denial-of-service attacks, etc.

Scalability versus recovery

• Internet scalability pays a price:– Slow recovery

• RON recovers fast by– Limiting size of overlay– Exploiting redundancy in underlying Internet

Redundant links

• Multiple paths between all sites

Utah Company

Cable Modem

Utah MITInternet 2

Redundant links

• But many of them are hidden

Utah Company

Cable Modem

Utah MIT

Resilient overlay networks

• Measure all links between nodes

• Compute path properties

• Determine best route

• Forward traffic over that path

RON design

Prober ProberRouter RouterForwarder ForwarderConduit Conduit

PerformanceDatabase

Application-specific routing tables Policy routing module

RON library

Nodes in differentrouting domains

(ASes)

Routing and path selection

• Path selection at the entry node– Specialized for routing through one intermediate node

• Router computes the forwarding tables– Link-state dissemination through RON

• Path evaluation and selection– Latency minimizer: EWMA of round-trip samples– Loss-rate minimizer: average of the last k samples– Throughput optimizer: TCP throughput equation

• Select when estimated throughput improves by 2x

• 5% hysteresis to avoid flapping

Policy routing

• Router computes a forwarding table for each policy

• Two ways of describing policies:– Exclusive cliques (e.g., educational only)

– General policies• BPF-like packet matcher, which returns a policy

• Links that are denied by a policy

• Entry node classifies packet with a policy tag

Responding to failure

• Probe interval: 12 seconds• Probe timeout: 3 seconds• Routing update interval: 14 seconds

RON overhead

• Probe overhead: 69 bytes• RON routing overhead: 60 + 20 (N-1)• 50: allows recovery times between 12 and 25 s

10 nodes 20 nodes 30 nodes 40 nodes 50 nodes

1.8 Kbps 5.9 Kbps 12 Kbps 21 Kbps 32 Kbps

Many research questions• Does the RON approach work at all?• Each RON is small in size, no more than 50 or 100

nodes– How fast can failure detection & recovery happen?

• Policy routing– Doesn’t RON violate AUPs and other policies?

• Routing behavior– Can stable routing be achieved?– Implementing efficient multi-criteria routing

• Is it safe to deploy a large number of (small) interacting RONs on the Internet?

IP forwarder

• A RON application• Transparently

forwards IP traffic over RON

• Allows comparisons of IP traffic over RON versus over direct Internet

RON deployment (19 sites)

CA-T1CCIArosUtah

CMU

To vu.nlLulea.se

MITMA-CableCisco

Cornell

NYU

OR-DSL

.com (ca), .com (ca), dsl (or), cci (ut), aros (ut), utah.edu, .com (tx)cmu (pa), dsl (nc), nyu , cornell, cable (ma), cisco (ma), mit,vu.nl, lulea.se, ucl.uk, kaist.kr, univ-in-venezuela

To vu.nl lulea.se ucl.uk

To kaist.kr, .ve

AS view

Experiments

• Measure loss, latency, and throughput with and without RON

• RON1: 12 hosts in the US and Europe– 64 hours of measurements in March 2001

• RON2: 16 hosts– 85 hours of measurements in May 2001

• 30-minute average loss rates– A 30 minute outage is very serious!

• Note: Experiments done with “No-Internet2-for-commercial-use” policy

Take home messages

1. RON reduced outages by a factor 5 to 10, and routed around all major outages

2. RON takes 18s (average) to route around a failure, and can do so in the face of flooding attacks

3. Single route indirection delivers the majority RON benefits

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

"loss.jit"

RON improves loss-rate

30-min average loss rate with RON

30-m

in a

vera

ge lo

ss r

ate

on I

nter

net

13,000 samples

RON loss rate never more than 30%

An order-of-magnitude fewer failures

Loss RateRON Better

No Change

RON Worse

10% 526 [517] 58 [51] 47 [45]

20% 142 [140] 4 [3] 15 [15]

30% 32 [32] 0 0

50% 20 [20] 0 0

80% 14 [14] 0 0

100% 10 0 0

30-minute average loss rates

6,825 “path hours” represented here12 “path hours” of essentially complete outage72 “path hours” of TCP outage

RON routed around all of these!One indirection hop provides almost all the benefit!

6,825 “path hours” represented here12 “path hours” of essentially complete outage72 “path hours” of TCP outage

RON routed around all of these!One indirection hop provides almost all the benefit!

Why does one hop work?

In RON testbed: – P(direct path is good) is 48.8%– P(intermediate path is good) is 51%

source target

RON

RON

RON

•••

R RON nodes

Good (p)Bad

(1-p)

P(good path) = (1 – (1-p)^2)^(R+1)

Resilience Against DoS Attacks

Latency using RON

What’s next for RON?

• Data mining of collected samples

• Applications

• Routing policies (e.g., rate control)

Other progress: Chord

• Chord: a peer-to-peer lookup system

• CFS: a peer-to-peer file sharing applicationwww.pdos.lcs.mit.edu/chord

Conclusion

• Improved availability of Internet communication paths using small overlays– Layered above scalable IP substrate– RON provides a set of libraries and programs to facilitate this

application-specific routing

• Experimental data suggest that approach works– Over 10X availability– Outage detection and recovery in about 15 seconds– Able to route around certain denial-of-service attacks

• Many interesting questions remain…

http://nms.lcs.mit.edu/ron/

ron: resilient overlay networks david andersen, hari balakrishnan, frans kaashoek, and robert morris...

Documents