high performance switches and routers: theory and practice sigcomm 99 august 30, 1999 harvard...

189
High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical Engineering and Computer Science [email protected] [email protected]

Post on 21-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

High Performance Switches and Routers:

Theory and PracticeSigcomm 99

August 30, 1999

Harvard UniversityHigh PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

Nick McKeown Balaji Prabhakar

Departments of Electrical Engineering and Computer Science

[email protected] [email protected]

Page 2: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 2

Tutorial Outline

• Introduction: What is a Packet Switch?

• Packet Lookup and Classification: Where does a packet go next?

• Switching Fabrics:How does the packet get there?

• Output Scheduling:When should the packet leave?

Page 3: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 3

IntroductionWhat is a Packet Switch?

• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers

Page 4: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 4

Basic Architectural Components

PolicingOutput

SchedulingSwitching

Routing

CongestionControl

ReservationAdmissionControl

Control

Datapath:per-packet processing

Page 5: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 5

Basic Architectural ComponentsDatapath: per-packet processing

ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.

Page 6: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 6

Where high performance packet switches are used

Enterprise WAN access& Enterprise Campus Switch

- Carrier Class Core Router- ATM Switch- Frame Relay Switch

The Internet Core

Edge Router

Page 7: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 7

IntroductionWhat is a Packet Switch?

• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers

Page 8: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 8

ATM Switch

• Lookup cell VCI/VPI in VC table.

• Replace old VCI/VPI with new.

• Forward cell to outgoing interface.

• Transmit cell onto link.

Page 9: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 9

Ethernet Switch

• Lookup frame DA in forwarding table.– If known, forward to correct port.– If unknown, broadcast to all ports.

• Learn SA of incoming frame.

• Forward frame to outgoing interface.

• Transmit frame onto link.

Page 10: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 10

IP Router

• Lookup packet DA in forwarding table.– If known, forward to correct port.– If unknown, drop packet.

• Decrement TTL, update header Cksum.

• Forward packet to outgoing interface.

• Transmit packet onto link.

Page 11: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 11

IntroductionWhat is a Packet Switch?

• Basic Architectural Components• Some Example Packet Switches• The Evolution of IP Routers

Page 12: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 12

First-Generation IP Routers

Shared Backplane

Line Interface

CPU

Memory

CPU BufferMemory

LineInterface

DMA

MAC

LineInterface

DMA

MAC

LineInterface

DMA

MAC

Page 13: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 13

Second-Generation IP Routers

CPU BufferMemory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

Page 14: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 14

Third-Generation Switches/Routers

LineCard

MAC

LocalBuffer

Memory

CPUCard

LineCard

MAC

LocalBuffer

Memory

Switched Backplane

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

Line Interface

CPUM

emory

Page 15: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 15

1 2 3 4 5 6 7 8 9 10 1112131415 16

17 1819 20 2122 23242526 2728 2930 3132

13 14 15 16 17 18

19 20 21 22 23 24

25 26 27 28 29 30

31 32 21

1 2 3 4 5 6

7 8 9 10 11 12

Fourth-Generation Switches/RoutersClustering and Multistage

Page 16: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 16

Packet SwitchesReferences

• J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A high performance self-routing broadband packet switch architecture”, ISS ‘90.

• J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734-743.

• C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998.

• N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan-Feb 1997.

Page 17: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 17

Tutorial Outline

• Introduction: What is a Packet Switch?

• Packet Lookup and Classification: Where does a packet go next?

• Switching Fabrics:How does the packet get there?

• Output Scheduling:When should the packet leave?

Page 18: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 18

Basic Architectural ComponentsDatapath: per-packet processing

ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.

Page 19: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 19

Forwarding Decisions• ATM and MPLS switches

– Direct Lookup • Bridges and Ethernet switches

– Associative Lookup– Hashing– Trees and tries

• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods

• Packet Classification

Page 20: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 20

ATM and MPLS SwitchesDirect Lookup

VCI

Address

Memory

Data

(Port, VCI)

Page 21: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 21

Forwarding Decisions• ATM and MPLS switches

– Direct Lookup • Bridges and Ethernet switches

– Associative Lookup– Hashing– Trees and tries

• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods

• Packet Classification

Page 22: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 22

Bridges and Ethernet SwitchesAssociative Lookups

NetworkAddress

AssociatedData

AssociativeMemory or CAM

Search Data

48

log2N

AssociatedData

Hit?

Address{

Advantages:• Simple

Disadvantages• Slow• High Power• Small• Expensive

Page 23: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 23

Bridges and Ethernet SwitchesHashing

HashingFunction

Memory

Add

ress

Dat

a

Search Data

48

log2N

AssociatedData

Hit?

Address{16

Page 24: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 24

Lookups Using HashingAn example

Hashing Function

CRC-16

16

#1 #2 #3 #4

#1 #2

#1 #2 #3Linked lists

Memory

Search Data

48

log2N

AssociatedData

Hit?

Address{

Page 25: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 25

Lookups Using HashingPerformance of simple example

Where:

ER Expected number of memory references=

M Number of memory addresses in table=

N Number of linked lists= M N=

ER 12--- 1

1 1 1N----–

M–

--------------------------------+

=

Page 26: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 26

Lookups Using Hashing

Advantages:

• Simple

• Expected lookup time can be small

Disadvantages

• Non-deterministic lookup time

• Inefficient use of memory

Page 27: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 27

Trees and Tries

Binary Search Tree

< >

< > < >

log2 N

N entries

Binary Search Trie

0 1

0 1 0 1

111010

Page 28: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 28

Trees and TriesMultiway tries

16-ary Search Trie

0000, ptr 1111, ptr

0000, 0 1111, ptr

000011110000

0000, 0 1111, ptr

111111111111

Page 29: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 29

Trees and TriesMultiway tries

Degree ofTree

# MemReferences

# Nodes(x106)

Total Memory(Mbytes)

FractionWasted (%)

2 48 1.09 4.3 494 24 0.53 4.3 738 16 0.35 5.6 8616 12 0.25 8.3 9364 8 0.17 21 98256 6 0.12 64 99.5

Ew DL 1– 1 1 N

DL-------–

D–

Di 1 Di 1–– N 1 D1 i–– N–

i 1=

L 1–

+=

En 1 DL 1 N

DL-------–

DDi Di 1– 1 Di 1–– N–

i 1=

L 1–

+ +=

Where:

D Degree of tree=

L Number of layers/references=

N Number of entries in table =

En Expected number of nodes=

Ew Expected amount of wasted memory=

Table produced from 215 randomly generated 48-bit addresses

Page 30: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 30

Forwarding Decisions• ATM and MPLS switches

– Direct Lookup • Bridges and Ethernet switches

– Associative Lookup– Hashing– Trees and tries

• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods

• Packet Classification

Page 31: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 31

Caching Addresses

CPU BufferMemory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

Fast Path

Slow Path

Page 32: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 32

Caching Addresses

LAN:Average flow < 40 packets

WAN: Huge Number of flows

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cache = 10% of Full Table

Cache Hit Rate

Page 33: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 33

IP RoutersClass-based addresses

Class A Class B Class C D

212.17.9.4

Class A

Class B

Class C212.17.9.0 Port 4

Exact Match

Routing Table:

IP Address Space

Page 34: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 34

IP RoutersCIDR

A B C D0 232-1

0 232-1

128.9/16

128.9.0.0

216

142.12/19

65/8

Classless:

Class-based:

128.9.16.14

Page 35: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 35

IP RoutersCIDR

0 232-1

128.9/16

128.9.16.14

128.9.16/20 128.9.176/20

128.9.19/24

128.9.25/24

Most specific route = “longest matching prefix”

Page 36: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 36

IP RoutersMetrics for Lookups

128.9.16.14 128.9/16128.9.16/20

128.9.176/20

128.9.19/24128.9.25/24

142.12/19

65/8

Prefix Port35271013

• Lookup time• Storage space• Update time• Preprocessing time

Page 37: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 37

IP Router Lookup

IPv4 unicast destination address based lookup

Dstn Addr Next Hop

--------

---- ----

--------

Destination Next HopForwarding Table

Next Hop Computation

Forwarding Engine

Incoming Packet

HEADER

Page 38: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 38

Need more than IPv4 unicast lookups

• Multicast • PIM SM

– Longest Prefix Matching on the source and group address – Try (S,G) followed by (*,G) followed by (*,*,RP) – Check Incoming Interface

• DVMRP: – Incoming Interface Check followed by (S,G) lookup

• IPv6 • 128 bit destination address field

• Exact address architecture not yet known

Page 39: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 39

Lookup Performance Required

Gigabit Ethernet (84B packets): 1.49 Mpps

Line Line Rate Pkt size=40B Pkt size=240B

T1 1.5Mbps 4.68 Kpps 0.78 Kpps

OC3 155Mbps 480 Kpps 80 Kpps

OC12 622Mbps 1.94 Mpps 323 Kpps

OC48 2.5Gbps 7.81 Mpps 1.3 Mpps

OC192 10 Gbps 31.25 Mpps 5.21 Mpps

Page 40: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 40

Size of the Routing Table

Source: http://www.telstra.net/ops/bgptable.html

Page 41: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 41

Ternary CAMs

10.0.0.0 R1

10.1.0.0 R2

10.1.1.0 R310.1.3.0 R4

255.0.0.0255.255.0.0

255.255.255.0

255.255.255.0

255.255.255.25510.1.3.1 R4

Value Mask

Priority Encoder

Next Hop

Associative Memory

Page 42: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 42

Binary Tries

Example Prefixes

a) 00001b) 00010c) 00011d) 001e) 0101f) 011g) 100h) 1010i) 1100j) 11110000

a b c

d

e

f g

h i

j

0 1

Page 43: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 43

Patricia Tree

Skip=5

j

a b c

d

e

f g

0 1

h i

Example Prefixesa) 00001b) 00010c) 00011d) 001e) 0101f) 011g) 100h) 1010i) 1100j) 11110000

Page 44: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 44

Patricia Tree

Disadvantages• Many memory accesses• May need backtracking• Pointers take up a lot of

space

Advantages• General Solution• Extensible to wider

fields

Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries)

40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]

Page 45: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 45

Binary search on trie levels

P

Level 0

Level 29

Level 8

Page 46: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 46

Binary search on trie levels

10.0.0.0/810.1.0.0/1610.1.1.0/24

Example Prefixes

10.1.2.0/24Length Hash

8

12

16

24

Store a hash table for each prefix lengthto aid search at a particular trie level.

10.2.3.0/24

Example Addrs

10.1.1.410.4.4.310.2.3.910.2.4.8

10.0.0.0/810.1.0.0/1610.1.1.0/24

Example Prefixes

10.1.2.0/2410.2.3.0/2410

10.1, 10.2

10.1.1, 10.1.2, 10.2.3

Page 47: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 47

Binary search on trie levels

Disadvantages• Multiple hashed memory

accesses.• Updates are complex.

Advantages• Scaleable to IPv6.

33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]

Page 48: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 48

Compacting Forwarding Tables

1 0 0 0 1 0 1 1 1 0 0 0 1 1

1 1

Page 49: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 49

Compacting Forwarding Tables

10001010 11100010 10000010 10110100 11000000

R1, 0 R5, 0R2, 3 R3, 7 R4, 9

0 13

Codeword array

Base index array

0 1

0 321 4

Page 50: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 50

Compacting Forwarding Tables

Disadvantages• Scalability to larger

tables?• Updates are complex.

Advantages• Extremely small data

structure - can fit in cache.

33K entries: 160KB data structure with average 2Mpps [O(W/k)]

Page 51: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 51

16-ary Search Trie

0000, ptr 1111, ptr

0000, 0 1111, ptr

000011110000

0000, 0 1111, ptr

111111111111

Multi-bit Tries

Page 52: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 52

Compressed Tries

L16

L24

L8

Only 3 memory accesses

Page 53: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 53

Routing Lookups in Hardware

Prefix length

Num

ber

Most prefixes are 24-bits or shorter

Page 54: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 54

Routing Lookups in Hardware14

2.19

.6.1

4

Prefixes up to 24-bits

142.

19.6

14

1 Next Hop

24

Next Hop

142.19.6

224 = 16M entries

Page 55: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 55

Routing Lookups in Hardware12

8.3.

72.4

4

Prefixes up to 24-bits

128.

3.72

44

1 Next Hop

128.3.72

24 0 Pointer

8

Prefixes above 24-bits

Next Hop

Next Hop

Next Hop

offs

etba

se

Page 56: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 56

Routing Lookups in HardwarePrefixes up to n-bits

2n entries:

0

N + M

N

i j Prefixeslonger than

N+M bits

Next Hop

2m

i entries

Page 57: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 57

Routing Lookups in Hardware

Disadvantages• Large memory required

(9-33MB)• Depends on prefix-length

distribution.

Advantages• 20Mpps with 50ns

DRAM• Easy to implement in

hardware

Various compression schemes can be employed to decrease thestorage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.

Page 58: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 58

IP Router LookupsReferences

• A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3-14.

• B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248-56, vol. 3.

• M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25-36.

• P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241-1248, vol. 3.

• S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998.

• V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998.

Page 59: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 59

Forwarding Decisions• ATM and MPLS switches

– Direct Lookup • Bridges and Ethernet switches

– Associative Lookup– Hashing– Trees and tries

• IP Routers– Caching– CIDR– Patricia trees/tries– Other methods

• Packet Classification

Page 60: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 60

Providing Value Added ServicesSome examples

• Differentiated services – Regard traffic from Autonomous System #33 as `platinum grade’

• Access Control Lists– Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp

• Committed Access Rate– Rate limit WWW traffic from sub interface#739 to 10Mbps

• Policy based Routing– Route all voice traffic through the ATM network

Page 61: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 61

Packet Classification

Action

--------

---- ----

--------

Predicate ActionClassifier (Policy Database)

Packet Classification

Forwarding Engine

Incoming Packet

HEADER

Page 62: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 62

Multi-field Packet Classification

Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.

Field 1 Field 2 … Field k Action

Rule 1 152.163.190.69/ 21 152.163.80.11/ 32 … UDP A1

Rule 2 152.168.3.0/ 24 152.163.0.0/ 16 … TCP A2

… … … … … …

Rule N 152.168.0.0/ 16 152.0.0.0/ 8 … ANY An

Page 63: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 63

R5

Geometric Interpretation in 2D

R4

R3

R2R1

R7

P2

Field #1

Fie

ld #

2

R6

Field #1 Field #2 Data

P1

e.g. (128.16.46.23, *)e.g. (144.24/16, 64/24)

Page 64: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 64

Proposed Schemes

Pros ConsSequentialEvaluation

Small storage, scales well withnumber of fields

Slow classification rates

Ternary CAMs Single cycle classification Cost, density, powerconsumption

Grid of Tries(Srinivasan etal[Sigcomm

98])

Small storage requirements andfast lookup rates for two fields.Suitable for big classifiers

Not easily extendible tomore than two fields.

Page 65: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 65

Proposed Schemes (Contd.)

Pros ConsCrossproducting

(Srinivasan etal[Sigcomm 98])

Fast accesses.Suitable formultiple fields.

Large memoryrequirements. Suitablewithout caching forclassifiers with fewer than50 rules.

Bil-level Parallelism(Lakshman and

Stiliadis[Sigcomm 98])

Suitable formultiple fields.

Large memory bandwidthrequired. Comparativelyslow lookup rate.Hardware only.

Page 66: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 66

Proposed Schemes (Contd.)

Pros ConsHierarchical

Intelligent Cuttings(Gupta and

McKeown[HotI 99])

Suitable for multiplefields. Small memoryrequirements. Goodupdate time.

Large preprocessingtime.

Tuple Space Search(Srinivasan et

al[Sigcomm 99])

Suitable for multiplefields. The basic schemehas good update timesand memoryrequirements.

Classification rate can below. Requires perfecthashing for determinism.

Recursive FlowClassification (GuptaandMcKeown[Sigcomm99])

Fast accesses. Suitable formultiple fields.Reasonable memoryrequirements for real-lifeclassifiers.

Large preprocessing timeand memoryrequirements for largeclassifiers.

Page 67: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 67

Grid of Tries

R7

R4

R6R5R3R2

R1

Dimension 1

Dimension 2

Page 68: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 68

Grid of Tries

Disadvantages• Static solution• Not easy to extend to

higher dimensions

Advantages• Good solution for two

dimensions

20K entries: 2MB data structure with 9 memory accesses [at most 2W]

Page 69: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 69

Classification using Bit Parallelism

R4 R3 R2R11

1

0

0

1

0

1

1

Page 70: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 70

Classification using Bit Parallelism

Disadvantages• Large memory

bandwidth• Hardware optimized

Advantages• Good solution for

multiple dimensions for small classifiers

512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.

Page 71: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 71

Classification Using Multiple FieldsRecursive Flow Classification

Packet Header

F1

F2

F3

F4

Fn

MemoryMemory

Action

Memory

2S = 2128 2T = 212

2S = 21282T = 212264

224

Page 72: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 72

Packet ClassificationReferences

• T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202.

• V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214.

• V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999.

• P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999.

• P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.

Page 73: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 73

Tutorial Outline

• Introduction: What is a Packet Switch?

• Packet Lookup and Classification: Where does a packet go next?

• Switching Fabrics:How does the packet get there?

• Output Scheduling:When should the packet leave?

Page 74: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 74

Switching Fabrics

• Output and Input Queueing

• Output Queueing

• Input Queueing– Scheduling algorithms– Combining input and output queues– Other non-blocking fabrics– Multicast traffic

Page 75: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 75

Basic Architectural ComponentsDatapath: per-packet processing

ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.

Page 76: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 76

InterconnectsTwo basic techniques

Input Queueing Output Queueing

Usually a non-blockingswitch fabric (e.g. crossbar)

Usually a fast bus

Page 77: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 77

InterconnectsOutput Queueing

Individual Output Queues Centralized Shared Memory

Memory b/w = (N+1).R

1

2

N

Memory b/w = 2N.R

1

2

N

Page 78: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 78

Output QueueingThe “ideal”

1

1

1

1

1

1

1

1

1

11

1

2

2

2

2

2

2

Page 79: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 79

Output QueueingHow fast can we make centralized shared memory?

SharedMemory

200 byte bus

5ns SRAM

1

2

N

• 5ns per memory operation• Two memory operations per packet• Therefore, up to 160Gb/s• In practice, closer to 80Gb/s

Page 80: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 80

Switching Fabrics

• Output and Input Queueing

• Output Queueing

• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic

Page 81: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 81

InterconnectsInput Queueing with Crossbar

configuration

Dat

a In

Data Out

Scheduler

Memory b/w = 2R

Page 82: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 82

Input QueueingHead of Line Blocking

Del

ay

Load58.6% 100%

Page 83: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 83

Head of Line Blocking

Page 84: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 84

Page 85: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 85

Page 86: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 86

Input QueueingVirtual output queues

Page 87: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 87

Input QueuesVirtual Output Queues

Del

ay

Load100%

Page 88: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 88

Input Queueing

Scheduler

Memory b/w = 2R

Can be quitecomplex!

Page 89: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 89

Input QueueingScheduling

Input 1

Q(1,1)

Q(1,n)

A1(t)

Input m

Q(m,1)

Q(m,n)

Am(t)

D1(t)

Dn(t)

Output 1

Output n

Matching, MA1,1(t)

?

Page 90: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 90

Input QueueingScheduling

RequestGraph

123

4

12342

5

242

7

BipartiteMatching

1234

1234

(Weight = 18)

Question: Maximum weight or maximum size?

Page 91: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 91

Input QueueingScheduling

• Maximum Size– Maximizes instantaneous throughput– Does it maximize long-term throughput?

• Maximum Weight– Can clear most backlogged queues– But does it sacrifice long-term throughput?

Page 92: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 92

Input QueueingScheduling

1

2

1

2

1

2

1

2

Page 93: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 93

Input QueueingLongest Queue First or

Oldest Cell First

1234

1234

1234

1234

10 1

1

1

1 10

Maximum weight

Weight Waiting Time

100%Queue Length { } =

Page 94: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 94

Input QueueingWhy is serving long/old queues better than

serving maximum number of queues?• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.• When traffic is non-uniform, some queues become longer than others.• A good algorithm keeps the queue lengths matched, and services a large number of queues.

VOQ #

Avg

Occ

upan

cy Uniform traffic

VOQ #

Avg

Occ

upan

cyNon-uniform traffic

Page 95: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 95

Input QueueingPractical Algorithms

• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP

• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)

Page 96: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 96

Wave Front Arbiter

Requests Match

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Page 97: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 97

Wave Front Arbiter

Requests Match

Page 98: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 98

Wave Front ArbiterImplementation

1,1 1,2 1,3 1,4

2,1 2,2 2,3 2,4

3,1 3,2 3,3 3,4

4,1 4,2 4,3 4,4

Combinational Logic Blocks

Page 99: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 99

Wave Front ArbiterWrapped WFA (WWFA)

Requests Match

N steps instead of2N-1

Page 100: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 100

Input QueueingPractical Algorithms

• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP

• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)

Page 101: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 101

Parallel Iterative Matching

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Requests

1

2

3

4

1

2

3

4Grant

1

2

3

4

1

2

3

4Accept/Match

1

2

3

4

1

2

3

4

#1

#2

Random Selection

1

2

3

4

1

2

3

4

Random Selection

Page 102: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 102

Parallel Iterative MatchingMaximal is not Maximum

1

2

3

4

1

2

3

4Requests Accept/Match

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Page 103: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 103

Parallel Iterative MatchingAnalytical Results

E C Nlog

E Ui N2

4i------- C # of iterations required to resolve connections=

N # of ports =

Ui # of unresolved connections after iteration i=

Number of iterations to converge:

Page 104: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 104

Parallel Iterative Matching

Page 105: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 105

Parallel Iterative Matching

Page 106: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 106

Parallel Iterative Matching

Page 107: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 107

Input QueueingPractical Algorithms

• Maximal Size Algorithms– Wave Front Arbiter (WFA)– Parallel Iterative Matching (PIM)– iSLIP

• Maximal Weight Algorithms– Fair Access Round Robin (FARR)– Longest Port First (LPF)

Page 108: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 108

iSLIP

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Requests

1

2

3

4

1

2

3

4Grant

1

2

3

4

1

2

3

4Accept/Match

1

2

3

4

1

2

3

4

#1

#2

Round-Robin Selection

1

2

3

4

1

2

3

4

Round-Robin Selection

Page 109: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 109

iSLIPProperties

• Random under low load• TDM under high load• Lowest priority to MRU• 1 iteration: fair to outputs• Converges in at most N iterations. On average <=

log2N

• Implementation: N priority encoders• Up to 100% throughput for uniform traffic

Page 110: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 110

iSLIP

Page 111: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 111

iSLIP

Page 112: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 112

iSLIPImplementation

Grant

Grant

Grant

Accept

Accept

Accept

1

2

N

1

2

N

State

N

N

N

Decision

log2N

log2N

log2N

ProgrammablePriority Encoder

Page 113: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 113

Input Queueing ReferencesReferences

• M. Karol et al. “Input vs Output Queueing on a Space-Division Packet Switch”, IEEE Trans Comm., Dec 1987, pp. 1347-1356.

• Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27.

• T. Anderson et al. “High-Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys., Nov 1993, pp. 319-352.

• N. McKeown, “The iSLIP scheduling algorithm for Input-Queued Switches”, IEEE Trans Networking, April 1999, pp. 188-201.

• C. Lund et al. “Fair prioritized scheduling in an input-buffered switch”, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69.

• A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches”, IEEE Infocom 98, April 1998.

Page 114: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 114

Switching Fabrics

• Output and Input Queueing

• Output Queueing

• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic

Page 115: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 115

Other Non-Blocking FabricsClos Network

Page 116: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 116

Other Non-Blocking FabricsClos Network

Expansion factor required = 2-1/N (but still blocking for multicast)

Page 117: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 117

Other Non-Blocking FabricsSelf-Routing Networks

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

Page 118: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 118

Other Non-Blocking FabricsSelf-Routing Networks

3

7

5

2

6

0

1

4

7

2

3

5

6

1

0

4

7

5

2

3

1

0

6

4

7

0

5

1

3

4

2

6

7

4

5

6

0

3

1

2

7

6

4

5

3

2

0

2

7

6

5

4

3

2

1

0

000001

010011

100101

110111

Batcher Sorter Self-Routing Network

The Non-blocking Batcher Banyan Network

• Fabric can be used as scheduler. •Batcher-Banyan network is blocking for multicast.

Page 119: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 119

Switching Fabrics

• Output and Input Queueing

• Output Queueing

• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic

Page 120: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 120

Speedup

• Context– input-queued switches

– output-queued switches

– the speedup problem

• Early approaches

• Algorithms

• Implementation considerations

Page 121: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 121

Speedup: Context

Memory

Memory

The placement of memory gives

- Output-queued switches- Input-queued switches- Combined input- and output-queued switches

A generic switch

Page 122: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 122

Output-queued switches

Best delay and throughput performance- Possible to erect “bandwidth firewalls” between sessions

Main problem- Requires high fabric speedup (S = N)

Unsuitable for high-speed switching

Page 123: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 123

Input-queued switches

Big advantage - Speedup of one is sufficient

Main problem- Can’t guarantee delay due to input contention

Overcoming input contention: use higher speedup

Page 124: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 124

A Comparison

Line Rate MemoryBW

Access TimePer cell

MemoryBW

Access Time

Memory speeds for 32x32 switch

Output-queued Input-queued

100 Mb/s 3.3 Gb/s 128 ns 200 Mb/s 2.12 s

1 Gb/s 33 Gb/s 12.8 ns 2 Gb/s 212 ns

2.5 Gb/s 82.5 Gb/s 5.12 ns 5 Gb/s 84.8 ns

10 Gb/s 330 Gb/s 1.28ns 20 Gb/s 21.2 ns

Page 125: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 125

The Speedup Problem

Find a compromise: 1 < Speedup << N

- to get the performance of an OQ switch- close to the cost of an IQ switch

Essential for high speed QoS switching

Page 126: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 126

Some Early Approaches

Probabilistic Analyses

- assume traffic models (Bernoulli, Markov-modulated,

Numerical Methods

- use actual and simulated traffic traces- run different algorithms - set the “speedup dial” at various values

non-uniform loading, “friendly correlated”)- obtain mean throughput and delays, bounds on tails- analyze different fabrics (crossbar, multistage, etc)

Page 127: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 127

The findings

Very tantalizing ...

- under different settings (traffic, loading, algorithm, etc)

- and even for varying switch sizes

A speedup of between 2 and 5 was sufficient!

Page 128: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 128

Using Speedup

1

1

1

2

2

Page 129: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 129

Intuition

Speedup = 1

Speedup = 2

Fabric throughput = .58

Bernoulli IID inputs

Fabric throughput = 1.16

Bernoulli IID inputs

I/p efficiency, = 1/1.16

Ave I/p queue = 6.25

Page 130: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 130

Intuition (continued)

Speedup = 3Fabric throughput = 1.74

Bernoulli IID inputs

Input efficiency = 1/1.74

Speedup = 4 Fabric throughput = 2.32

Bernoulli IID inputs

Input efficiency = 1/2.32

Ave I/p queue = 0.75

Ave I/p queue = 1.35

Page 131: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 131

Issues

Need hard guarantees- exact, not average

Robustness- realistic, even adversarial, traffic not friendly Bernoulli IID

Page 132: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 132

The Ideal Solution

Speedup = N

?Speedup << N

Inputs Outputs

Question: Can we find- a simple and good algorithms - that exactly mimics output-queueing- regardless of switch sizes and traffic patterns?

Page 133: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 133

What is exact mimicking?

Apply same inputs to an OQ and a CIOQ switch- packet by packet

Obtain same outputs- packet by packet

Page 134: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 134

Algorithm - MUCF

Key concept: urgency value- urgency = departure time - present time

Page 135: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 135

MUCF

The algorithm

- Outputs try to get their most urgent packets- Inputs grant to output whose packet is most urgent, ties broken by port number- Loser outputs for next most urgent packet- Algorithm terminates when no more matchings are possible

Page 136: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 136

Stable Marriage Problem

MariaHillary Monica

PedroJohnBill

Men = Outputs

Women = Inputs

Page 137: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 137

An example

Observation: Only two reasons a packet doesn’t get to its output

- Input contention, Output contention

- This is why speedup of 2 works!!

Page 138: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 138

What does this get us?

Speedup of 4 is sufficient for exact emulation of FIFO

OQ switches, with MUCF

What about non-FIFO OQ switches?E.g. WFQ, Strict priority

Page 139: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 139

Other results

To exactly emulate an NxN OQ switch

- Speedup of 2 - 1/N is necessary and sufficient

- Input traffic patterns can be absolutely arbitrary

(Hence a speedup of 2 is sufficient for all N)

- Emulated OQ switch may use a “monotone”

- E.g.: FIFO, LIFO, strict priority, WFQ, etc

scheduling policies

Page 140: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 140

What gives?

Complexity of the algorithms

- Extra hardware for processing

- Extra run time (time complexity)

What is the benefit?

- Reduced memory bandwidth requirements

Tradeoff: Memory for processing

- Moore’s Law supports this tradeoff

Page 141: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 141

Implementation - a closer look

Main sources of difficulty

- Estimating urgency, etc - info is distributed

- Matching process - too many iterations?

Estimating urgency depends on what is being emulated

- Like taking a ticket to hold a place in a queue

- FIFO, Strict priorities - no problem

- WFQ, etc - problems

(and communicating this info among I/ps and O/ps)

Page 142: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 142

Implementation (contd)

Matching process

- A variant of the stable marriage problem

- Worst-case number of iterations in switching = N- High probability and average approxly log(N)

- Worst-case number of iterations for SMP = N2

Page 143: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 143

Other Work

Relax stringent requirement of exact emulation

- Least Occupied O/p First Algorithm (LOOFA)

- Disallow arbitrary inputs

Keeps outputs always busy if there are packets

By time-stamping packets, it also exactly mimics

E.g. leaky bucket constrained

Obtain worst-case delay bounds

Page 144: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 144

References for speedup- Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89.

- A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and

and output buffers and speed constraints”, Infocom 91.

- S-T. Chuang et al, “Matching output queueing with a combined input and

and output queued switch”, IEEE JSAC, vol 17, no 6, 1999.

- B. Prabhakar, N. McKeown, “On the speedup required for combined input

and output queued switching”, Automatica, vol 35, 1999.

- P. Krishna et al, “On the speedup required for work-conserving crossbar

switches”, IEEE JSAC, vol 17, no 6, 1999.- A. Charny, “Providing QoS guarantees in input buffered crossbar switches

with speedup”, PhD Thesis, MIT, 1998.

Page 145: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 145

Switching Fabrics

• Output and Input Queueing

• Output Queueing

• Input Queueing– Scheduling algorithms– Other non-blocking fabrics– Combining input and output queues– Multicast traffic

Page 146: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 146

Multicast Switching

• The problem

• Switching with crossbar fabrics

• Switching with other fabrics

Page 147: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 147

Multicasting

1

2

64

3 5

Page 148: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 148

Crossbar fabrics: Method 1

Copy networks

Copy network + unicast switching

Increased hardware, increased input contention

Page 149: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 149

Method 2Use copying properties of crossbar fabric

No fanout-splitting: Easy, but lowthroughput

Fanout-splitting: higher throughput, but not as simple.Leaves “residue”.

Page 150: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 150

The effect of fanout-splitting

Performance of an 8x8 switch with and without fanout-splittingunder uniform IID traffic

Page 151: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 151

Placement of residue

Key question: How should outputs grant requests?

(and hence decide placement of residue)

Page 152: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 152

Residue and throughput

Result: Concentrating residue brings more new workforward. Hence leads to higher throughput.

But, there are fairness problems to deal with.

This and other problems can be looked at in a unifiedway by mapping the multicasting problem onto a variation of Tetris.

Page 153: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 153

Multicasting and Tetris

Output ports1 2 3 54

1 2 3 54Input ports

Residue

Page 154: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 154

Multicasting and Tetris

Output ports1 2 3 54

1 2 3 54Input ports

ResidueConcentrated

Page 155: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 155

Replication by recyclingMain idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves.

a

b

e

x dy

c x

y

a

b

c

x

y

d

e

Page 156: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 156

Replication by recycling (cont’d)

Receive

Recycle

Network

Reseq TransmitOutputTable

Scaleable to large fanouts. Needs resequencing at outputs andintroduces variable delays.

Page 157: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 157

References for Multicasting

• J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991.

• B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996

• B. Prabhakar et al. “Multicast scheduling for input-queued switches”, IEEE JSAC, 1997

• J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994

Page 158: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 158

Tutorial Outline

• Introduction: What is a Packet Switch?

• Packet Lookup and Classification: Where does a packet go next?

• Switching Fabrics:How does the packet get there?

• Output Scheduling:When should the packet leave?

Page 159: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 159

Output Scheduling

• What is output scheduling?

• How is it done?

• Practical Considerations

Page 160: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 160

Output Scheduling

scheduler

Allocating output bandwidthControlling packet delay

Page 161: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 161

Output Scheduling

FIFO

Fair Queueing

Page 162: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 162

Motivation

• FIFO is natural but gives poor QoS

– bursty flows increase delays for others

– hence cannot guarantee delays

Need round robin scheduling of packets

– Fair Queueing

– Weighted Fair Queueing, Generalized Processor Sharing

Page 163: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 163

Fair queueing: Main issues

• Level of granularity

– packet-by-packet? (favors long packets)

– bit-by-bit? (ideal, but very complicated)

• Packet Generalized Processor Sharing (PGPS)

– serves packet-by-packet

– and imitates bit-by-bit schedule within a tolerance

Page 164: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 164

How does WFQ work?

WR = 1WG = 5WP = 2

Page 165: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 165

Delay guarantees

• Theorem

If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.

Page 166: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 166

Practical considerations

• For every packet, the scheduler needs to

– classify it into the right flow queue and maintain a linked-list

for each flow

– schedule it for departure

• Complexities of both are o(log [# of flows])

– first is hard to overcome

– second can be overcome by DRR

Page 167: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 167

Deficit Round Robin

50 700 250

400 600

200 600 100

500

500 Quantum size

250

500

500400

750

1000

Good approximation of FQ

Much simpler to implement

Page 168: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 168

But...

• WFQ is still very hard to implement

– classification is a problem

– needs to maintain too much state information

– doesn’t scale well

Page 169: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 169

Strict Priorities and Diff Serv

• Classify flows into priority classes

– maintain only per-class queues

– perform FIFO within each class

– avoid “curse of dimensionality”

Page 170: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 170

Diff Serv• A framework for providing differentiated QoS

– set Type of Service (ToS) bits in packet headers

– this classifies packets into classes

– routers maintain per-class queues

– condition traffic at network edges to conform to

class requirements

May still need queue management inside the network

Page 171: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 171

References for O/p Scheduling

- A. Demers et al, “Analysis and simulation of a fair queueing algorithm”,

ACM SIGCOMM 1989.

- A. Parekh, R. Gallager, “A generalized processor sharing approach to

flow control in integrated services networks: the single node

- M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round

Robin”, ACM SIGCOMM, 1995.- K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model

and Definitions”, Internet Draft, 1998.

case”, IEEE Trans. on Networking, June 1993. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple nodecase”, IEEE Trans. on Networking, August 1993.

Page 172: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 172

• Problems with traditional queue management

– tail drop

• Active Queue Management

– goals

– an example

– effectiveness

Active Queue Management

Page 173: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 173

Max Queue Length

Tail Drop Queue ManagementLock-Out

Page 174: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 174

• Drop packets only when queue is full

– long steady-state delay

– global synchronization

– bias against bursty traffic

Tail Drop Queue Management

Page 175: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 175

Max Queue Length

Global Synchronization

Page 176: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 176

Max Queue Length

Bias Against Bursty Traffic

Page 177: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 177

• Drop from front on full queue

• Drop at random on full queue

both solve the lock-out problem both have the full-queues problem

Alternative Queue Management Schemes

Page 178: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 178

• Solve lock-out and full-queue problems– no lock-out behavior– no global synchronization– no bias against bursty flow

• Provide better QoS at a router– low steady-state delay– lower packet dropping

Active Queue ManagementGoals

Page 179: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 179

• Problems with traditional queue management

– tail drop

• Active Queue Management

– goals

an example

– effectiveness

Active Queue Management

Page 180: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 180

Random Early Detection (RED)

if qavg < minth: admit every packet

else if qavg <= maxth: drop an incoming packet with

p = (qavg - minth)/(maxth - minth)

else if qavg > maxth: drop every incoming packet

minthmaxth

P1PkP2

qavg

Page 181: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 181

Effectiveness of RED: Lock-Out

• Packets are randomly dropped

• Each flow has the same probability of being

discarded

Page 182: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 182

• Drop packets probabilistically in anticipation of congestion (not when queue is full)

• Use qavg to decide packet dropping probability: allow instantaneous bursts

• Randomness avoids global synchronization

Effectiveness of RED: Full-Queue

Page 183: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 183

What QoS does RED Provide?

• Lower buffer delay: good interactive service – qavg is controlled to be small

• Given responsive flows: packet dropping is reduced– early congestion indication allows traffic to throttle back before congestion

• Given responsive flows: fair bandwidth allocation

Page 184: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 184

Unresponsive or aggressive flows

• Don’t properly back off during congestion

• Take away bandwidth from TCP compatible flows

• Monopolize buffer space

Page 185: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 185

Control Unresponsive Flows

• Some active queue management schemes

– RED with penalty box– Flow RED (FRED)– Stabilized RED (SRED)

identify and penalize unresponsive flows with a bit of extra work

Page 186: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 186

Active Queue ManagementReferences

• B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC2309, 1998.

• S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993.

• D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997

• T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999

• S. Floyd, K. Fall, “Router mechanisms to support end-to-end congestion control”, LBL technical report, 1997

Page 187: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 187

Tutorial Outline

• Introduction: What is a Packet Switch?

• Packet Lookup and Classification: Where does a packet go next?

• Switching Fabrics:How does the packet get there?

• Output Scheduling:When should the packet leave?

Page 188: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 188

Basic Architectural Components

PolicingOutput

SchedulingSwitching

Routing

CongestionControl

ReservationAdmissionControl

Control

Datapath:per-packet processing

Page 189: High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical

Copyright 1999. All Rights Reserved 189

Basic Architectural ComponentsDatapath: per-packet processing

ForwardingDecision

ForwardingDecision

ForwardingDecision

ForwardingTable

ForwardingTable

ForwardingTable

Interconnect

OutputScheduling

1.

2.

3.