1 week 7 routing protocols. 2 intra-as routing r also known as interior gateway protocols (igp) r...

119
1 Week 7 Routing Protocols

Upload: lawrence-allison

Post on 28-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

1

Week 7 Routing Protocols

2

Intra-AS Routing

Also known as Interior Gateway Protocols (IGP) Most common Intra-AS routing protocols:

RIP: Routing Information Protocol

OSPF: Open Shortest Path First

IGRP: Interior Gateway Routing Protocol (Cisco proprietary)

3

RIP ( Routing Information Protocol)

Distance vector algorithm Included in BSD-UNIX Distribution in 1982 Distance metric: # of hops (max = 15 hops)

DC

BA

u v

w

x

yz

destination hops u 1 v 2 w 2 x 3 y 3 z 2

4

RIP advertisements

Distance vectors: exchanged among neighbors every 30 sec via Response Message (also called advertisement)

Each advertisement: list of up to 25 destination nets within AS

5

RIP: Example

Destination Network Next Router Num. of hops to dest. w A 2

y B 2 z B 7

x -- 1…. …. ....

w x y

z

A

C

D B

Routing table in D

6

RIP: Example

Destination Network Next Router Num. of hops to dest. w A 2

y B 2 z B A 7 5

x -- 1…. …. ....Routing table in D

w x y

z

A

C

D B

Dest Next hops w - - x - - z C 4 …. … ...

Advertisementfrom A to D

7

RIP: Link Failure and Recovery If no advertisement heard after 180 sec -->

neighbor/link declared dead routes via neighbor invalidated new advertisements sent to neighbors neighbors in turn send out new advertisements

(if tables changed) link failure info quickly propagates to entire net poison reverse used to prevent ping-pong

loops (infinite distance = 16 hops)

8

OSPF (Open Shortest Path First)

“open”: publicly available Uses Link State algorithm

LS packet dissemination Topology map at each node Route computation using Dijkstra’s algorithm

OSPF advertisement carries one entry per neighbor router

Advertisements disseminated to entire AS (via flooding) Carried in OSPF messages directly over IP (rather than

TCP or UDP

9

OSPF “advanced” features (not in RIP)

Security: all OSPF messages authenticated (to prevent malicious intrusion)

Multiple same-cost paths allowed (only one path in RIP)

For each link, multiple cost metrics for different TOS (e.g., satellite link cost set “low” for best effort; high for real time)

Integrated uni- and multicast support: Multicast OSPF (MOSPF) uses same topology

data base as OSPF Hierarchical OSPF in large domains.

10

Hierarchical OSPF

11

Hierarchical OSPF

Two-level hierarchy: local area, backbone. Link-state advertisements only in area each nodes has detailed area topology; only know

direction (shortest path) to nets in other areas. Area border routers: “summarize” distances to

nets in own area, advertise to other Area Border routers.

Backbone routers: run OSPF routing limited to backbone.

Boundary routers: connect to other AS’s.

12

Internet inter-AS routing: BGP

BGP (Border Gateway Protocol): the de facto standard

BGP provides each AS a means to:1. Obtain subnet reachability information from

neighboring ASs.2. Propagate the reachability information to all

routers internal to the AS.3. Determine “good” routes to subnets based

on reachability information and policy. Allows a subnet to advertise its

existence to rest of the Internet: “I am here”

13

BGP basics Pairs of routers (BGP peers) exchange routing info over

semi-permanent TCP conctns: BGP sessions Note that BGP sessions do not correspond to physical links. When AS2 advertises a prefix to AS1, AS2 is promising it

will forward any datagrams destined to that prefix towards the prefix. AS2 can aggregate prefixes in its advertisement

3b

1d

3a

1c2aAS3

AS1

AS21a

2c

2b

1b

3c

eBGP session

iBGP session

14

Distributing reachability info With eBGP session between 3a and 1c, AS3 sends prefix

reachability info to AS1. 1c can then use iBGP do distribute this new prefix reach

info to all routers in AS1 1b can then re-advertise the new reach info to AS2 over

the 1b-to-2a eBGP session When router learns about a new prefix, it creates an

entry for the prefix in its forwarding table.

3b

1d

3a

1c2aAS3

AS1

AS21a

2c

2b

1b

3c

eBGP session

iBGP session

15

Outline

1. Highlights2. Addressing and CIDR3. BGP Messages and Prefix Attributes4. BGP Decision and Filtering Processes 5. I-BGP6. Route Reflectors7. Multihoming8. Aggregation9. Routing Instability 10. BGP Table Growth

16

Internet Topology

• AS (Autonomous System) - a collection of routers under the same technical and administrative domain.• EGP (External Gateway Protocol) - used between two AS’s to allow them to exchange routing information so that traffic can be forwarded across AS borders. Example: BGP

large ISP

large ISP

medium ISP

dialupISP

dedicatedaccess ISP

EGP

17

Purpose: to share connectivity information

R border router

internal router

BGPR2

R1

R3

A

AS1

AS2

you can reachnet A via me

traffic to A

table at R1:dest next hopA R2

18

BGP Sessions One router can participate in many BGP

sessions. Initially … node advertises ALL routes it wants

neighbor to know (could be >50K routes) Ongoing … only inform neighbor of changes

BGP SessionsAS1

AS2

AS3

19

Routing Protocols

R border router

internal router

R1

AS1

R4

R5

B

AS3

E-BGP

R2R3

A

AS2 announce B

IGP: Interior Gateway Protocol.Examples: IS-IS, OSPF

I-BGP

20

Configuration and Policy

A BGP node has a notion of which routes to share with its neighbor. It may only advertise a portion of its routing table to a neighbor.

A BGP node does not have to accept every route that it learns from its neighbor. It can selectively accept and reject messages.

What to share with neighbors and what to accept from neighbors is determined by the routing policy, that is specified in a router’s configuration file.

21

History Popularity:

until a number of years ago BGP fairly unknown. Used by small number of large ISPs.

in 1995 (beginning of web popularity) number of organizations using BGP grew tremendously.

Two reasons for growth of usage and interest: significant growth in number of ISPs; organizations were born that had mission-critical

dependence upon their connectivity CIDR (Classless Inter-Domain Routing)

introduced in 1995

22

Assigning IP address and AS numbers (Ideally) A host gets its IP address from the IP address

block of its organization An organization gets an IP address block from its

ISP’s address block An ISP gets its address block from its own

provider OR from one of the 3 routing registries: ARIN: American Registry for Internet Numbers RIPE: Reseaux IP Europeens APNIC: Asia Pacific Network Information Center

Each AS is assigned a 16-bit number (65536 total) Currently 10,000 AS’s in use

23

Addressing Schemes Original addressing schemes (class-based):

32 bits divided into 2 parts:

Class A

Class B

Class C

CIDR introduced to solve 2 problems: exhaustion of IP address space size and growth rate of routing table

network host 0

80

network host 0

160

network host 0

240~2 million nets256 hosts

24

Problem #1: Lifetime of Address Space

Example: an organization needs 500 addresses. A single class C address not enough (256 hosts). Instead a class B address is allocated. (~64K hosts) That’s overkill -a huge waste.

CIDR allows networks to be assigned on arbitrary bit boundaries. permits arbitrary sized masks: 178.24.14.0/23 is valid requires explicit masks to be passed in routing protocols

CIDR solution for example above: organization is allocated a single /23 address (equivalent of 2 class C’s).

25

Problem #2: Routing Table Size

serviceprovider

232.71.0.0232.71.1.0232.71.2.0…..232.71.255.0

232.71.0.0232.71.1.0232.71.2.0…..232.71.255.0

Globalinternet

Without CIDR:

serviceprovider

232.71.0.0232.71.1.0232.71.2.0…..232.71.255.0

Globalinternet

With CIDR:

232.71.0.0/16

26

CIDR: Classless Inter-Domain Routing Address format <IP address/prefix P>.

The prefix denotes the upper P bits of the IP address.

Idea - use aggregation - provide routing for a large number of customers by advertising one common prefix. This is possible because nature of addressing is

hierarchical

Summarizing routing information reduces the size of routing tables, but allows to maintain connectivity.

Aggregation is critical to the scalability and survivability of the Internet

27

Address Arithmetic: Address Blocks

The <address/prefix> pair defines an address block:

Examples: 128.15.0.0/16 => [ 128.15.0.0 - 128.15.255.255 ] 188.24.0.0/13 => [ 188.24.0.0 - 188.31.255.255 ]

consider 2nd octet in binary:

Address block sizes a /13 address block has 232-13 addresses (/16 has 232-16) a /13 address block is 8 times as big as a /16 address

blockbecause 232-13 = 232-16 * 23

13th bit

198.00011000.0.0

settable

28

CIDR: longest prefix match Because prefixes of arbitrary length allowed,

overlapping prefixes can exist. Example:

router hears 124.39.0.0/16 from one neighborand 124.39.11.0/24 from another neighbor

Router forwards packet according to most specific forwarding information, called longest prefix match Packet with destination 124.39.11.32 will be forwarded

using /24 entry. Packet w/destination 124.39.22.45 will be forwarded

using /16 entry

29

Will CIDR work ? For CIDR to be successful need:

address registries must assign addresses using CIDR strategy

providers and subscribers should configure their networks, and allocate addresses to allow for a maximum amount of aggregation

BGP must be configured to do aggregation as much as possible

Factors that complicate achieving aggregation multihoming, proxy aggregation, changing

providers

30

Four Basic Messages

Open: Establishes BGP session (uses TCP port #179)

Notification: Report unusual conditions

Update:Inform neighbor of new routes that become activeInform neighbor of old routes that become inactive

Keepalive: Inform neighbor that connection is still viable

33

UPDATE Message

used to either advertise and/or withdraw prefixes

path attributes: list of attributes that pertain to ALL the prefixes in the Reachability Info field

Withdrawn routes length (2 octets)

Withdrawn routes (variable length)

Total path attributes length (2 octets)

Path Attributes (variable length)

Reachability Information (variable length)

FORMAT:

35

ATTRIBUTES

Prefix Next hop AS Path

128.73.4.21/21 232.14.63.4 1239 701 3985 631

ORIGIN: Who originated the announcement? Where was a

prefix injected into BGP? IGP, EGP or Incomplete (often used for static routes)

AS-PATH: a list of AS’s through which the announcement for a

prefix has passed each AS prepends its AS # to the AS-PATH attribute

when forwarding an announcement useful to detect and prevent loops

36

Sequence of ASes a route has traversed

Loop detection Apply policy

AS 100

AS 300

AS 200

AS 500

AS 400

170.10.0.0/16 180.10.0.0/16

150.10.0.0/16

Network Path

180.10.0.0/16 300 200 100

170.10.0.0/16 300 200

150.10.0.0/16 300 400

Network Path180.10.0.0/16 300 200 100170.10.0.0/16 300 200

AS-Path Attribute

37

Loop Handling

AS1172.16.10.0/24

AS4

AS3

AS2172.16.10.0/24 -- 1

172.16.10.0/24 – 2 1

172.16.10.0/24 – 3 2 1172.16.10.0/24 – 4 3 2 1

38

AS_path Manipulation

AS50192.213.1.0/24

AS100

AS300AS200192.213.1.0/24 -- 50

192.213.1.0/24 – 50 50 50

192.213.1.0/24 – 200 50

192.213.1.0/24 – 100 50 50 50

192.213.1.0/24 – 300 200 50

39

Attribute: NEXT HOP For EBGP session, NEXT HOP = IP

address of neighbor that announced the route.

For IBGP sessions, if route originated inside AS, NEXT HOP = IP address of neighbor that announced the route

For routes originated outside AS, NEXT HOP of EBGP node that learned of route, is carried unaltered into IBGP.

Destination Next Hop

140.20.1.0/24 2.2.2.2

209.15.1.0/24 1.1.1.1

Destination Next Hop

140.20.1.0/24 2.2.2.2

2.2.2.0/24 3.3.3.3

3.3.3.0/24 Connected

209.15.1.0/24 1.1.1.1

1.1.1.0/24 3.3.3.3

BGP Table at Router C:

IP Routing Table at Router C:

B

A

D

C 2.2.2.2

3.3.3.3

140.20.1.0/24IBGP

EBGP

209.15.1.0/24

1.1.1.1

40

Attribute: Multi-Exit Discriminator (MED) when AS’s interconnected

via 2 or more links Hint to external neighbors

about the preferred path into an AS that has multiple entry points

AS announcing prefix sets MED

enables AS2 to indicate its preference

AS receiving prefix uses MED to select link

a way to specify how close a prefix is to the link it is announced on

a lower value is preferred called an external metric

Link BLink A

MED=10MED=50

AS1

AS2

AS4 AS3

41

Attribute: Local Preference Used to indicate

preference among multiple paths for the same prefix anywhere in the internet.

The higher the value the more preferred

Exchanged between IBGP peers only. Local to the AS.

Often used to select a specific exit point for a particular destination

AS4

AS2 AS3

AS1

140.20.1.0/24

Destination AS Path Local Pref

140.20.1.0/24 AS3 AS1 300

140.20.1.0/24 AS2 AS1 100

BGP table at AS4:

42

Example

SJ LA

ANET

ZNET

XNET YNET

128.213.0.0/16WINET

T1 T3128.213.0.0/16 128.213.0.0/16

Set local preference to 200Set local preference to 300

Affects the traffic going out of the AS

43

Routing Process Overview

Inputpolicyengine

Decisionprocess

Routesused byrouter

Routesreceived from peers

Routessent topeersOutput

policyengine

BGP table IP routingtable

Choosebest route

accept,deny, set preferences

forward,not forwardset MEDs

Page 145 of Halabi

44

Input Policy Engine Inbound filtering controls outbound traffic

filters route updates received from other peers filtering based on IP prefixes, AS_PATH, community denying a prefix means BGP does not want to reach

that prefix via the peer that sent the announcement accepting a prefix means traffic towards that prefix

may be forwarded to the peer that sent the announcement

Attribute Manipulation sets attributes on accepted routes example: specify LOCAL_PREF to set priorities among

multiple peers that can reach a given destination

45

BGP Decision Process

1. Choose route with highest LOCAL-PREF

2. If have more than 1 route, select route with shortest AS-PATH

3. If have more than 1 route, select according to lowest ORIGIN type where IGP < EGP < INCOMPLETE

4. If have more than 1 route, select route with lowest MED value

5. Select min cost path to NEXT HOP using IGP metrics

6. If have multiple internal paths, use BGP Router ID to break tie.

46

Output Policy Engine

Outbound Filtering controls inbound traffic forwarding a route means others may choose

to reach the prefix through you not forwarding a route means others must use

another router to reach the prefix may depend upon whether E-BGP or I-BGP

peer example: if ORIGIN=EGP and you are a non-

transit AS and BGP peer is non-customer, then don’t forward

Attribute Manipulation sets attributes such as AS_PATH and MEDs

47

Transit vs. Nontransit AS

AS1

ISP1 ISP2

r1r2 r2

r3

r2

r1 r3

AS1

ISP1 ISP2

r1r2,r3 r2,r1

r3

r2

r1 r3

Transit traffic = traffic whose source and destination are outside the AS

Nontransit AS: does not carry transit traffic Transit AS: does carry transit traffic• Advertise own routes only• Do not propagate routes learned from other AS’s• case 1:

• Advertises its own routes PLUS routeslearned from other AS’s

AS1

r1r2 r2r3

r2

AS2r1 r3• case 2:

48

Clients, Providers and Peers AS has customers,

providers and peers Relationships

between AS pairs: customer-provider peer-to-peer

Type of relationship influences policies

Exporting to provider:AS exports its routes & its customer’s routes, but not routes learned from other providers or peers

Exporting to peer: (same as above)

Exporting to customer:AS exports its routes plus routes learned from its providers and peers

49

Internal BGP (I-BGP) Used to distribute routes learned via EBGP to all the

routers within an AS I-BGP and E-BGP are same protocol in that

same message types used same attributes used same state machine BUT use different rules for readvertising prefixes

Rule #1: prefixes learned from an E-BGP neighbor can be readvertised to an I-BGP neighbor, and vice versa

Rule #2: prefixes learned from an I-BGP neighbor cannot be readvertised to another I-BGP neighbor

50

I-BGP: Preventing Loops and Setting Attributes Why rule #2? To prevent announcements from

looping. In E-BGP can detect via AS-PATH. AS-PATH not changed in I-BGP

Implication of rule: a full mesh of I-BGP sessions between each pair of routers in an AS is required

Setting Attributes: The router that injects the route into the I-BGP mesh is responsible for setting the LOCAL-PREF attribute prepending AS # to AS-PATH

51

Route Reflectors Problem: requiring a full mesh

of I-BGP sessions between all pairs of routers is hard to manage for large AS’s.

Solution: group routers into clusters. Assign a leader to each

cluster, called a route reflector (RR).

Members of a cluster are called clients of the RR

I-BGP Peering clients peer only with their

RR RR’s must be fully meshed

clients

RR

AB

C clusters

RR

RR

52

Route Reflectors: Rule on Announcements If received from RR, reflect to clients If received from a client, reflect to RRs

and clients If received from E-BGP, reflect to all - RRs

and clients RR’s reflect only the best route to a given

prefix, not all announcements they receive. helps size of routing table sometimes clients don’t need to carry full

table

53

Default Routes If you don’t have a

routing entry in the table for a destination, send it along the default route

Can be statically configured

Can be learned via BGP Can have multiple

defaults use LOCAL_PREF to

distinguish primary and backup default routes

AS1

0.0.0.0/0 next_hop=1.1.1.1

0.0.0.0/0 next_hop=2.2.2.2

AS2

Local pref=100Local pref=300

54

Multihoming

55

Single-homed vs. Multi-homed subscribers A single-homed network has one connection to

the Internet (i.e., to networks outside its domain)

A multi-homed network has multiple connections to the Internet. Two scenarios: can be multi-homed to a single provider can be multi-homed to multiple providers

Why multi-home? Reliability Performance

• a site’s bandwidth to internet is sum of bandwidth on all links

56

Single-homed AS

Subscriber called a “stub AS” Provider-Subscriber

communication for route advertisement: static configuration. most common.

• Provider’s router is configured with subscriber’s prefix.

• good if customer’s routes can be represented by small set of aggregate routes

• bad if customer has many noncontiguous subnets

can use BGP between provider and stub AS

R2

Subscriber

R1

Provider

57

Multihoming to a Single Provider: 4 scenarios

R2

Customer

R1

ISP

R2

Customer

R1

ISP

R2

Customer

R3

R1

ISP

R3

Customer

R1

ISP

R2

R4R3

58

Multihoming to Multiple Providers

Customer

ISP 1 ISP 2

ISP 3

59

Multihoming Issues Load sharing

how distribute the traffic over the multiple links?

Reliability if load sharing leads to preferering certain

links for certain subnets, is reliability reduced?

Address/Aggregation which subnet addresses should the

multihomed customer use? how will this affect its provider’s ability to

aggregate routes?

60

Load sharing from Customer to ISP using policy Goal: send traffic to

ISP’s customers on one link; send traffic to the rest of the Internet on 2nd link

Implement using policy to control announcements

R3

Customer

R1

ISP

R2

advertisecustomerroutes only

advertisedefaultroute 0/0

traffic

blue: announcementsred: traffic

61

Load sharing from ISP to customer using attributes

Goal: provider splits traffic across 2 links according to prefix

Implement this strategy using attributes customer sets MEDs provider sets

LOCAL_PREF

62

Example

SF NY

C2C3 C4

C5

SF NY(X,Y) (Z,W)

IBGP

C5:300C4:300Rest:250

C3:300C2:300Rest:200

X:200Y:200Rest:300

W:200Z:200Rest:250

63

Example

R2

Customer

R3

R1

ISP

140.35/16

208.22/16

208.22/16140.35/16

64

Address/Aggregation Issue Where should the

customer get its address block from? 1. From ISP1 2. From ISP2 3. From both ISP1 and

ISP2 4. Independently from

address registry

ISP 3

ISP 1 ISP 2

customer(cases 1 and 2 are equivalent)

65

Case 1 & 2: Get address block from one ISP

ISP 3

ISP 1

140.20/16

ISP 2

200.50/16

140.20.6/24

customer

140.20.6/24140.20.6/24

140.20/16200.50/16140.20.6/24

example: customer gets address from ISP 1

ISP 1’s aggregation is not broken

customer’s prefix not aggregatable at ISP 2

longer prefix becomes a traffic magnet

How good is load sharing?If all ISP’s generate same amount of traffic for customer, then ISP2-customer link twice as loaded as ISP1-customer link

66

Case 3: Get address block from both ISPs announcement policy:

announce prefix only to its “parent”

advantage: both ISP’s can aggregate the prefix they receive

disadvantage: lose reliability

load balancing good? depends upon how much traffic sent to each prefix

ISP 3

ISP 1 ISP 2

customer

140.20/16 200.50/16

140.20.1/24 200.50.1/24

140.20.1/24 200.50.1/24

67

Case 4: obtain address block from registry no aggregation possible no traffic magnets

created good reliability how achieve load

sharing? customer breaks

address block into 2 /25 blocks, and announce one per link (but may lose reliability)

OR use method of AS-PATH manipulation

150.55.10/24

150.55.10/24

ISP 3

ISP 1 ISP 2

customer

100.20/16 200.50/16

150.55.10/24

150.55.10/24

150.55.10/24

200.50/16100.20/16

68

AS-PATH manipulation Idea: prepend your AS

number in AS-PATH multiple times to discourage use of a link

makes AS-PATH seem longer than it is

recall BGP decision process uses shortest AS-PATH length as a criteria for selecting best path

Example: ISP 3 will choose path through AS 2 because its AS-PATH appears shorter

150.55.10/24 - 33

ISP 3

AS 1 AS 2

150.55.10/24 - 33 33

customer

150.55.10/24

AS 33

150.55.10/24 - 1 33 33 150.55.10/24 - 2 33

69

Aggregation

70

Address Arithmetic: When is aggregation possible? Case 1 Possible when one prefix contained in another. Example: Two AS’s having customer-provider

relationship. Provider does the aggregation. Provider has address block 18.0.0.0/8 Its customers have address blocks 18.6.0.0/15 and

18.9.0.0/15 Provider announces its address block only

Rule: Prefix p1 contains prefix p2 iff length(p2) > length(p1) ANDaddress(p2) / 232-length(p1) = address(p1) / 232-length(p1)

71

Address Arithmetic: When is aggregation possible? Case 2

Some pairs of consecutive prefixes Example: routes within the same AS:

AS has 2 address blocks: 1.2.2.0/24 = 0000001.00000010.00000010.00000000/241.2.3.0/24 = 0000001.00000010.00000011.00000000/24can announce instead 1.2.2.0/23

Rule: consecutive prefixes p1 and p2 are aggregatable iff length(p1)=length(p2) AND

address(p1) / 232-length(p1) +1 = address(p2) / 232-length(p2) AND address(p1) % 233-length(p1) = 0

73

“The cardinal sin of BGP routing is advertising routes that you don't know how to get to; called "black-holing" someone”

If you announce part of IP space owned by someone else, using a more-specific prefix, all their traffic flows to you. Makes that address block disconnected from the Internet.

Example: black holes can happen inadvertently by non-careful aggregation

Black holes and cardinal sins

ISP 1100.24/16

ISP 2222.2/16

NAP

100.24.16.0/21

Net A

[100.24.16.0-100.24.23.255]

100.24.0.0/20

Net B[100.24.0.0-100.24.15.255]

Net C

100.24.56.0/21

[100.24.56.0-100.24.63.255]

100.24/16

222.2/16

wrong !!100.24.0.0/18

74

Limitations to Aggregation Lack of hierarchical allocation of address

space prior to CIDR (before 1995) A single AS can have noncontiguous address

blocks Customer AS and provider AS can have non-

contiguous address blocks Reluctance of customers to renumber their

address space when they switch providers Multi-homing

multi-homed prefixes require global visibility may choose not to: load sharing

75

Rules of Thumb for Aggregation To avoid black holes: an ISP is not

allowed to aggregate routes unless it is a supernet of the address block it wants to aggregate

In other words, an ISP has to specifically announce each IP address range of its downstream customers that are not part of its own address space.

76

Routing Instability

77

Route Stability Routing instability: rapid fluctuation of network

reachability information route flapping: when a route is withdrawn and

re-announced repeatedly in a short period of time happens via UPDATE messages

because messages propagate to global Internet, route flapping behavior can cascade and deteriorate routing performance in many places

Effects: increased packet loss, increased network latency, CPU overhead, loss of connectivity

Route Stability Studies by C. Labovitz, R. Malan & F. Jahanian

78

Types of Routing Updates Forwarding instability

reflects legitimate topology changes e.g., changes in Prefix, NEXT_HOP and/or ASPATH affects forwarding paths used

Policy fluctuation reflects changes in policy e.g., changes in MED, LOCAL_PREF, etc. may not necessarily affect forwarding paths used

Pathological redundant messages reflect neither topology nor policy changes

79

Anecdotes of Route Flap Storms April 25, 1997 - small Virginia ISP injected incorrect

map into global Internet. Map said Virginia ISP had optimal connectivity to all destinations. Everyone sent their traffic to this ISP. Result: shutdown of Tier-1 ISPs for 2 hours.

August 14, 1998 - misconfigured database server forwarded all queries to “.net” to wrong server. Result: loss of connectivity to all .net servers for few hours.

Nov. 8, 1998 - router software bug led to malformed routing control message. Caused interoperability problem between Tier-1 ISPs. Result: persistent pathological oscillations and connectivity loss for several hours.

80

Taxonomy (as per Labovitz et. al.)Name Type Character

WADiff Explicit withdrawal f ollowed by announcement. Replace route with diff erent path

Legitimate

AADiff Announced twice (implicit withdrawal). Replace route with diff erent path.

Legitimate

WADup Explicit withdrawal f ollowed by announcement. Replace route with same path.

Legitimate or pathological

AADup Announced twice (implicit withdrawal). Replace route with same path.

Policy change or pathological

WWDup Repeated duplicate withdrawals

Pathological

81

General Statistics

1996: 3-5 million updates per day in Internet core

1998: 300K-700K updates per day in Internet core

1996: average number of announcements per day was ~275K.

1998: average number of announcements per day was ~400K

Correlation of instability and usage instability highest during business hours instability lowest during nights, on weekends and in

summer

82

Per Event Type Statistics

1996 relative impact (approximately): WWDup (96%), AADup (2%), WADup (1%), AADiff(1/2%), WADiff (1/2%)

1998 relative impact (approximately): AADup (30-40%), WWDup(25-30%), AADiff (~20%) other (rest)

83

Who’s Responsible?

AS’s No single AS dominates instability statistics No correlation between the size of an AS

and its share of updates generated. Prefixes

Instability is evenly distributed across routes.

Example of measurements: • 75% of AADiff events come from prefixes change

less than 10 times a day.• 80-90% of instability comes from prefixes that are

announced less than 50 times/day.

84

Sources of Instabilities in General router configuration errors transient physical and data link

problems software bugs problems with leased lines (electrical

timing issues that cause false alarms of disconnect)

router failures network upgrades and maintenance

85

Instability Problem and Cause. Example 1. Problem: 3-5 million duplicate withdrawals Cause: stateless BGP implementation

time-space tradeoff: no state maintained on what advertised to peers

when receive any change, transmit withdrawal to all peers regardless of whether previously notified or not

sent updates for both explicit and implicit withdrawals

By 1998, most vendors had BGP implementations with partial state.

Result: number of WWDups reduced by an order of magnitude

86

Instability Problem and Cause. Example 2 Problem: duplicate announcements Cause: min-advertisement timer & stateless

BGP min-adv timer: wait 30 seconds. Combine all

received updates in last 30 seconds into single outbound update message (if possible).

within 30 seconds route can be withdrawn and re-announced so that there is no net change to original announcement

Solution: Have BGP keep some state about recently sent messages to peers. Avoid sending duplicate messages

87

Instability Problem and Cause. Example 3

AS 1R2

R1 R3

R4

R6

R5

R border router

internal router

AS2

AS3

R8

R7

E-BGP

10

23

5 6

IGP

Net X

4

MED=15

MED=5

Example: interaction IGP/BGPpolicy: set MED using IGP metrics, such as shortest path

MED=24

88

Controlling route instability: Route Dampening track number of times a route has

flapped over a period of time routes that exhibit a high level of

instability in a short period of time should be suppressed (not advertised)

penalize ill behaved routes proportionally to their expected future stability

if a suppressed route stops flapping for a long enough period of time, unsuppress it (readvertise)

89

Route Dampening

time

pena

lty

reuse limit

suppress limit

90

Route Dampening Algorithm

Each time a route flaps, increase the penalty

If the route has not flapped in the last ‘half-life’ time units, then cut penalty in half.

If the penalty > suppress limit, then suppress the route

If the penalty < reuse limit, then free a suppressed route

91

Side Effect of Route Dampening A legitimate update may arrive and it

will be ignored because that route has been suppressed and not yet released.

The modification needed for the legitimate announcement is delayed

92

Aggregation can help route flapping If a more-specific route is

flapping, but provider only announces aggregated prefix, then other networks don’t see route flap.

Hence aggregation can mask route flapping.

Aggregation helps combat instability because it reduces the number of networks visible in the core Internet.

Tier 1 provider

Tier 2 provider

customer

140.40.4/24

140.40/16

flapping

not flapping

93

Growth of BGP Tables

94

Long Term Growth Trends in Internet Routing Will this routing system be able to scale

and meet the growth of the Internet and its ever-expanding level of demands? Are there any inherent limitations? As more devices connect to Internet and

consume addresses, the need to maintain reachability to these addresses implies larger routing tables

What is the ability of the system to produce a stable view of the overall network topology?

95

BGP Table Growth (1989-2001)

97

Reasons for Exponential Growth

Measures Growth

Number of AS’s Growing exponentially - 50%yearly

Range of addresses spanned bytable

Growing 7% yearly

Average span of route tableentry

Growing 1 bit per 29 months

Holes in the table Currently at 37%

Number of announcements Growing 42% yearly

Data in last 3 slides from Geoff Huston’s INET publications

98

Increasing fine levels of routing details in BGP table

AS space growth at 50%

& addresses spanned by the table growing at

7%=> each AS advertising smaller address ranges

/24 is fastest growth prefix in table (in absolute terms). /24 - /31 is area with fastest relative growth

1999: average span of prefix 16,000 addresses (mean prefix length 18.03)2000: average span of prefix 12,000 addresses (mean prefix length 18.44)

99

Holes When both aggregated prefix and a

more-specific prefix exist in the table, the more-specific prefix is called a “hole”.

Why are holes created? Punch hole in policy of larger aggregated

announcement to create a different policy for finer announcement.

Load sharing in multihoming scenario reliability via multihoming

37% of BGP table is holes.

100

Multihoming vs. Resiliency Multiple peering relationships can be cheaper

than using a single upstream provider implies: multihoming is seen as a substitute for

upstream service resiliency Impact

providers can’t command a price for reliability, and thus don’t spend money to engineer it into network.

resiliency is becoming responsibility of customer not provider

Can BGP scale adequately to continue to undertake this role?

101

Conclusions

Things are getting better (stability) router software and configuration

management are maturing increased emphasis on aggregation and

route dampening are helping Things are getting worse (scalability)

multihoming is still growing internet topology growing less hierarchical ‘noise’ in BGP table is growing

102

Longest Prefix Match Algorithms

103

Forwarding Engine

header

payload

Packet

Router

Destination Address

Outgoing Port

Dest-network PortForwarding Table

Routing Lookup Data Structure

65.0.0.0/8

128.9.0.0/16

149.12.0.0/19

3

1

7

104

The Search Operation is not a Direct Lookup

(Incoming port, label)

Ad

dre

ss

MemoryD

ata

(Outgoing port, label)

IP addresses: 32 bits long 4G entries

105

The Search Operation is also not an Exact Match Search

Hashing Balanced binary search trees

Exact match search: search for a key in a collection of keys of the same length.

Relatively well studied data structures:

106

0 224 232-1

128.9.0.0/16

65.0.0.0

142.12.0.0/19

65.0.0.0/8

65.255.255.255

Example Forwarding Table

Destination IP Prefix Outgoing Port65.0.0.0/8 3

128.9.0.0/16 1

142.12.0.0/19 7

IP prefix: 0-32 bits

Prefix length

128.9.16.14

107

Prefixes can Overlap

128.9.16.0/21 128.9.172.0/21

128.9.176.0/24

Routing lookup: Find the longest matching prefix (aka the most specific route) among all prefixes that match the destination address.

0 232-1

128.9.0.0/16142.12.0.0/1965.0.0.0/8

128.9.16.14

Longest matching prefix

108

8

32

24

Prefixes

Pre

fix L

en

gth

128.9.0.0/16142.12.0.0/19

65.0.0.0/8

Difficulty of Longest Prefix Match

128.9.16.14

128.9.172.0/21

128.9.176.0/24

128.9.16.0/21

109

Lookup Rate Required

12540.0OC768c2002-03

31.2510.0OC192c2000-01

7.812.5OC48c1999-00

1.940.622OC12c1998-99

40B packets (Mpps)

Line-rate (Gbps)

LineYear

DRAM: 50-80 ns, SRAM: 5-10 ns

31.25 Mpps 33 ns

110

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Size of the Forwarding Table

Source: http://www.telstra.net/ops/bgptable.html

95 96 97 98 99 00Year

Num

ber

of

Pre

fixes

10,000/year

111

Longest Prefix Match is Harder than Exact Match The destination address of an arriving

packet does not carry with it the information to determine the length of the longest matching prefix

Hence, one needs to search among the space of all prefix lengths; as well as the space of all prefixes of a given length

112

LPM in IPv4Use 32 exact match algorithms for LPM!

Exact matchagainst prefixes

of length 1

Exact matchagainst prefixes

of length 2

Exact matchagainst prefixes

of length 32

Network Address PortPriorityEncodeand pick

113

Metrics for Lookup Algorithms Speed (= number of memory accesses) Storage requirements (= amount of memory) Low update time (support >10K updates/s) Scalability

With length of prefix: IPv4 unicast (32b), Ethernet (48b), IPv4 multicast (64b), IPv6 unicast (128b)

With size of routing table: (sweetspot for today’s designs = 1 million)

Flexibility in implementation Low preprocessing time

114

Radix Trie

P1 111* H1

P2 10* H2

P3 1010* H3

P4 10101 H4

P2

P3

P4

P1

A

B

C

G

D

F

H

E

1

0

0

1 1

1

1

Lookup 10111

Add P5=1110*

I

0

P5

next-hop-ptr (if prefix)

left-ptr right-ptr

Trie node

115

Radix Trie

W-bit prefixes: O(W) lookup, O(NW) storage and O(W) update complexity

Advantages

SimplicityExtensible to wider fields

Disadvantages

Worst case lookup slowWastage of storage space in chains

116

Leaf-pushed Binary Trie

A

B

C

G

D

E

1

0

0

1

1

left-ptr or next-hop

Trie node

right-ptr or next-hop

P2

P4P3

P2

P1P1 111* H1

P2 10* H2

P3 1010* H3

P4 10101 H4

117

PATRICIA

2A

B C

E

10

1

Patricia tree internal node

3

P3

P2

P4

P110

0F G

D5

bit-position

left-ptr right-ptr

Lookup 10111

P1 111* H1

P2 10* H2

P3 1010* H3

P4 10101 H4Bitpos 12345

Practical Algorithm To Retrieve Information Coded In Alphanumeric

118

• W-bit prefixes: O(W2) lookup, O(N) storage and O(W) update complexity

Advantages

Decreased storage Extensible to wider fields

Disadvantages

Worst case lookup slowBacktracking makes implementation complex

PATRICIA

119

Path-compressed Tree

1, , 2A

B C10

10,P2,4

P4

P1

1

0

E

D1010,P3,5

bit-position

left-ptr right-ptr

variable-length bitstring

next-hop (if prefix present)

Path-compressed tree node structure

Lookup 10111

P1 111* H1

P2 10* H2

P3 1010* H3

P4 10101 H4

120

• W-bit prefixes: O(W) lookup, O(N) storage and O(W) update complexity

Advantages

Decreased storage

Disadvantages

Worst case lookup slow

Path-compressed Trie

121

Binary Search by Prefix Intervals

{}, 1, 10, 100, 101, 1110

{}: 00000-11111 1: 10000-11111 10: 10000-10111

100: 10000-10011 101: 10100-101111110: 11100-11101

( ( ( ( ) ( ) ) ( ) ) )

00000 10000 10000 10000 10011 10100 10111 10111 11100 11101 11111 11111

1100

122

Rules

If the destination address we search for fits to the right of a left paranthesis, the prefix represented by the left paranthesis is the longest match

Otherwise, increment for each right paranthesis, decrement for each left paranthesis until the count reaches -1

123

Example revisited

( ( ( ( ) ( ) ) ( ) ) )

00000 10000 10000

10000 10011 10100 10111 10111 11100 11101 11111 11111

1100

121210-1

1100 is in the interval 10000-11111, longest prefix match is for the prefix 1

124

Advantages

Storage is linearCan be ‘balanced’Lookup time independent of W

Disadvantages

But, lookup time is dependent on NIncremental updates more complex than triesEach node is big in size: requires higher memory bandwidth

•W-bit N prefixes: O(logN) lookup, O(N) storage

Binary Search on Intervals