infrastructure-based resilient routing
DESCRIPTION
Infrastructure-based Resilient Routing. Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley ICSI Lunch Seminar, January 2004. Motivation. Network connectivity is not reliable - PowerPoint PPT PresentationTRANSCRIPT
Infrastructure-basedResilient Routing
Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz
University of California, BerkeleyICSI Lunch Seminar, January 2004
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Motivation Network connectivity is not reliable
Disconnections frequent in the Internet (UMichTR98,IMC02) 50% of backbone links have MTBF < 10 days 20% of faults last longer than 10mins
IP-level repair relatively slow Wide-area: BGP 3 mins Local-area: IS-IS 5 seconds
Next generation wide-area network applications Streaming media, VoIP, B2B transactions Low tolerance of delay, jitter and faults
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
The Challenge Routing failures are diverse
Many causes Misconfigurations, cut fiber, planned downtime, software bugs
Occur anywhere with local or global impact: Single fiber cut can disconnect AS pairs
One event can lead to complex protocol interactions Isolating failures is difficult
End user symptoms often dynamic or intermittent WAN measurement research is ongoing (Rocketfuel, etc) Observations:
Fault detection from multiple distributed vantage points In-network decision making necessary for timely responses
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Talk Overview Motivation A structured overlay approach Mechanisms and policy Evaluation Some questions
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
An Infrastructure Approach Our goals
Resilient overlay to route around failures Respond in milliseconds (not seconds)
Our approach (data & control plane) Nodes are observation points
(similar to Plato’s NEWS service) Nodes are also points of traffic redirection
(forwarding path determination and data forwarding) No edge node involvement
Fast response time, security focused on infrastructure Fully transparent, no application awareness necessary
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Why Structured Overlays Resilient Overlay Networks (MIT) Fully connected mesh Each node has full knowledge of network
Fast, independent calculation of routes Nodes can construct any path, maximum
flexibility Cost of flexibility
Protocol needs to choose the “right” route/nodes
Per node O(n) state Monitors n - 1 paths O(n2) total path monitoring is expensive
S
D
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
v
v
vv
v v
v
v
v
v
v
v
v
O V E R L A Y
The Big Picture
Locate nearby overlay proxy Establish overlay path to destination host Overlay traffic routes traffic resiliently
Internet
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
B
Traffic TunnelingLegacyNode A
LegacyNode B
ProxyProxy
registerregister
Structured Peer to Peer Overlay
put (hash(B), P’(B))
P’(B)
get (hash(B)) P’(B)
A, B are IP addresses
put (hash(A), P’(A))
Store mapping from end host IP to its proxy’s overlay ID Similar to approach in Internet Indirection Infrastructure (I3)
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Pros and Cons Leverage small neighbor sets Less neighbor paths to monitor: O(n) O(log(n))
Reduction in probing bandwidth Faster fault detection
Actively maintain static route redundancy Manageable for “small” # of paths Redirect traffic immediately when a failure is detected
Eliminate on-the-fly calculation of new routes Restore redundancy in background after failure
Fast fault detection + precomputed paths = more responsiveness Cons: overlay imposes routing stretch (mostly < 2)
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
In-network Resiliency Details Active periodic probes for fault-detection
Exponentially weighted moving average link quality estimation Avoid route flapping due to short term loss artifacts Loss rate Ln = (1 - ) Ln-1 + p
Simple approach taken, much ongoing research Smart fault-detection / propagation (Zhuang04) Intelligent and cooperative path selection (Seshardri04)
Maintaining backup paths Create and store backup routes at node insertion Query neighbors after failures to restore redundancy
Ask any neighbor at or above routing level of faulty nodee.g. ABCD sees ABDE failed, can ask any AB?? node for info
Simple policies to choose among redundant paths
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
First Reachable Link Selection (FRLS) Use link quality estimation to choose
shortest “usable” path Use shortest path with
minimal quality > T Correlated failures
Reduce with intelligent topology construction
Goal: leverage redundancy available
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Evaluation Metrics for evaluation
How much routing resiliency can we exploit? How fast can we adapt to faults (responsiveness)?
Experimental platforms Event-based simulations on transit stub topologies
Data collected over multiple 5000-node topologies PlanetLab measurements
Microbenchmarks on responsiveness
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Exploiting Route Redundancy (Sim)
Simulation of Tapestry, 2 backup paths per routing entry Transit-stub topology shown, results from TIER and AS graphs similar
00.10.20.30.40.50.60.70.80.9
1
0 0.05 0.1 0.15 0.2Proportion of IP Links Broken
% o
f All
Pairs
Rea
chab
le
Instantaneous IP Tapestry / FRLS
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Responsiveness to Faults (PlanetLab)
Two reasonable values for filter constant Response time scales linearly to probe period
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200
Link Probe Period (ms)
Tim
e to
Sw
itch
Rou
tes
(ms)
alpha=0.2alpha=0.4
300
660 = 0.2 = 0.4
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Link Probing Bandwidth (Planetlab)
0
1
2
3
4
5
6
7
1 10 100 1000
Size of Overlay
Ban
dwid
th P
er N
ode
(KB
/s)
PR=300msPR=600ms
Bandwidth increases logarithmically with overlay size Medium sized routing overlays incur low probing bandwidth
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Conclusion Trading flexibility for scalability and responsiveness
Structured routing has low path maintenance costs Allows “caching” of backup paths for quick failover
Can no longer construct arbitrary paths But simple policy exploits available redundancy well
Fast enough for most interactive applications 300ms beacon period response time < 700ms ~300 nodes, b/w cost = 7KB/s
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Ongoing Questions Is this the right approach?
Is there a lower bound on desired responsiveness? Is this responsive enough for VoIP?
If not, is multipath routing the solution?
What about deployment issues? How does inter-domain deployment happen? A third-party approach? (Akamai for routing)
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Related Work Redirection overlays
Detour (IEEE Micro 99) Resilient Overlay Networks (SOSP 01) Internet Indirection Infrastructure (SIGCOMM 02) Secure Overlay Services (SIGCOMM 02)
Topology estimation techniques Adaptive probing (IPTPS 03) Internet tomography (IMC 03) Routing underlay (SIGCOMM 03)
Many, many other structured peer-to-peer overlays
Thanks to Dennis Geels / Sean Rhea for their work on BMark
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Backup Slides
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Another Perspective on Reachability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2
Proportion of IP Links Broken
Prop
ortio
n of
All
Path
sPortion of all pair-wise paths where
no failure-free paths remain
Portion of all paths where IP and FRLS both
route successfully
A path exists, but neither IP nor
FRLS can locate the path
FRLS finds path, where
short-term IP routing fails
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Constrained Multicast Used only when all paths are below
quality threshold Send duplicate messages on multiple
paths Leverage route convergence
Assign unique messageIDs
Mark duplicates Keep moving window of IDs Recognize and drop duplicates
Limitations Assumes loss not from congestion Ideal for local area routing
2046
1111
2281 2530
2299 2274 2286
2225
? ? ?
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Latency Overhead of Misrouting
00.10.20.30.40.50.60.70.80.9
1
20 40 60 80 100 120 140 160 180
Latency from source to destination (ms)
Prop
ortio
nal L
aten
cy O
verh
ead
Hop 0 Hop 1 Hop 2 Hop 3 Hop 4
April 22, 2023 [email protected] Lunch Seminar, Jan. 2004
Bandwidth Cost of Constrained Multicast
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100 120 140 160
Latency from source to destination (ms)
Prop
ortio
nal B
andw
idth
Ove
rhea
d
Hop 0 Hop 1 Hop 2 Hop 3 Hop 4