1 ip routers with memory that runs slower than the line rate nick mckeown assistant professor of...
TRANSCRIPT
1
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
IP routers with memory thatruns slower than the line rate
Nick McKeownAssistant Professor of Electrical Engineering and Computer Science, Stanford University
[email protected]://www.stanford.edu/~nickm
2
Outline
• Trends in packet switch design • Additional problem:
“Data rates may soon exceed memory bandwidth”
• The Fork-Join Router & Parallel Packet Switches
3
Output 2
Output N
First Packet SwitchesShared Memory
Large, single dynamically allocated memory buffer:N writes per “cell” timeN reads per “cell” time.
Limited by memory bandwidth.
Input 1 Output 1
Input N
Input 2
Numerous work has proven and made possible:– Fairness– Delay Guarantees– Delay Variation Control– Loss Guarantees– Statistical Guarantees
4
Later Packet SwitchesSingle-stage crossbar with CIOQ and
VOQs
1 write per “cell” time 1 read per “cell” timeRate of writes/reads determined by switch
fabric speedup
Lookup&
DropPolicy
OutputScheduling
Virtual Output Queues
OutputScheduling
OutputScheduling
SwitchFabric
SwitchArbitration
Linecard Linecard
Switch Core(Bufferless)
Lookup&
DropPolicy
Lookup&
DropPolicy
5
Myths about CIOQ-based crossbar switches
1. “Input-queued crossbars have low throughput”– An input-queued crossbar can have as high
throughput as any switch.
2. “Crossbars don’t support multicast traffic well”– A crossbar inherently supports multicast efficiently.
3. “Crossbars don’t scale well”– Today, it is the number of chip I/Os, not the number
of crosspoints, that limits the size of a switch fabric. Expect 5Tb/s crossbar switches.
6
Myths about CIOQ-based crossbar switches (2)
4. “Crossbar switches can’t support delay/QoS guarantees”
– With an internal speedup of 2, a CIOQ switch can (in theory) precisely emulate a shared memory switch for all traffic.
7
What makes sense today?
Shared Memory
Input Queued
CIOQ Multistage
Blocking No No No Yes
Speedup High High Small High
Emulation of SM Yes No Yes No
Multicast Good Good Good Poor
Resequencing No No No Yes
Power Low OK OK High
Packaging - OK OK Complex
8
Summary of trend
Output 2
Output N
Input 1 Output 1
Input N
Input 2
SwitchFabric
SwitchArbitration
Higher CapacityMultistage:•Clos•Banyan•Toroidal…
Less frequentarbitration
Limited by:Memory bandwidth~50Gb/s
Limited by:Per-cell arbitrationPower~5Tb/s
1
2
9
Buffer MemoryHow Fast Can I Make a Packet Buffer?
BufferMemory
10ns on-chip DRAM
Rough Estimate:– 10ns per memory operation.– Two memory operations per
packet.– Therefore, maximum ~26Gb/s.
64-byte wide bus 64-byte wide bus
Exte
rnal
Lin
ee.g
. O
C7
68c
Sw
itch
Fabri
c
10
How can we make routers with 40Gb/s, 160Gb/s,…
interfaces?
11
Higher capacity and higher linerates
Output 2
Output N
Input 1 Output 1
Input N
Input 2
SwitchFabric
SwitchArbitration
Multistage
Less frequentarbitration
Limited by:Memory bandwidth~50Gb/s
Limited by:Per-cell arbitrationPower~5Tb/s
1
2
More parallelism:Fork-Join Router
3
Higher capacity
Higher Linerates
12
Fork-Join Router
How can we:– Increase capacity. – Reduce power per subsystem.
While at the same time…– Keep the system simple. – Support line rates faster than memory
bandwidth. – Provide delay guarantees.
Increase parallelism.
Multiple racks.
Single-stage buffering.
Pkt-by-pkt load balancing.
Hmmm….?
13
The Fork-Join Router
1
2
k
1
N
rate, R
rate, R
rate, R
rate, R
1
N
Router
Bufferless
14
The Fork-Join Router
• Advantages– Single-stage of buffering– kpower per subsystem – kmemory bandwidth – kfowarding table lookup rate
15
The Fork-Join Router
• Questions– Switching: What is the performance?– Forwarding Lookups: How do they
work?
16
A Parallel Packet Switch
1
N
rate, R
rate, R
rate, R
rate, R
1
N
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
Arriving packet tagged with egress port
17
Performance Questions
1. Can it be work-conserving?2. Can it emulate a single big output
queued switch?3. Can it support delay guarantees,
strict-priorities, WFQ, …?
18
Work Conservation
rate, R1rate, R
1
2
k
1
R/k
R/k
R/k
R/k
R/k
R/k
Input LinkConstraint
Output LinkConstraint
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
19
Work Conservation
rate, R1rate, R
1
2
k
1
R/k
R/k
R/k
R/k
R/k
R/k
1
2
3 Output LinkConstraint
45
1
2
3
4
1234115
20
Work Conservation
1
N
rate, R
rate, R
rate, R
rate, R
1
N
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
S(R/k)
S(R/k)
S(R/k)
S(R/k)
S(R/k)
S(R/k)
21
Precise Emulation of an Output Queued Switch
N N
Output Queued Switch
1
N
Parallel Packet Switch
= ?
1
N
1
N
22
Parallel Packet SwitchTheorems
1. If S > 2k/(k+2) 2 then a parallel packet switch can be work-conserving for all traffic.
2. If S > 2k/(k+2) 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.
23
Parallel Packet SwitchTheorems
3. If S > 3k/(k+3) 3 then a parallel packet switch can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
24
Parallel Packet SwitchTheorems
4. If S >= 1 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a FCFS switch for all traffic.
25
Co-ordination buffers
rate, R
rate, R
rate, R
rate, R
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
R/k
R/k
R/k
R/k
R/k
R/k
Size Nk Size Nk
26
Parallel Packet SwitchTheorems
5. If S > 2 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
27
The Fork-Join Router
• Questions– Switching: What is the performance?– Forwarding Lookups: How do they
work?
28
The Fork-Join RouterLookahead Forwarding Table Lookups
Packet tagged with egress port at next
router
Lookup performed in
parallel at rate R/k
29
The Fork-Join Router
1
2
k
1
N
rate, R
rate, R
rate, R
rate, R
1
N
Router
•Possibly >100Tb/s aggregate capacity•Linerates in excess of 100Gb/s