intel slide 1 a comparative study of arbitration algorithms for the alpha 21364 pipelined router...
DESCRIPTION
Intel Slide 3 The Alpha x7 Router CROSSBARCROSSBAR Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O 7 Output ports: 4 network, 2 memory/cache, 1 I/O Router Pipeline Length = 13/14 cycles Virtual Cut-ThroughTRANSCRIPT
![Page 1: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/1.jpg)
Slide 1
Inte
lA Comparative Study of Arbitration
Algorithms for the Alpha 21364 Pipelined Router
Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve
Lang*, & Dave Webb$
(ack: Richard Kessler)
Intel*, UPV!, & HP$
Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002
![Page 2: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/2.jpg)
Slide 2
Inte
lAlpha 21364 Network
21364 Chip(including Router)
RambusMemory
I/O
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
M
IO
L2 CacheData
L2 CacheData
Router MC2 MC1
L2 Cache Tags
21264CORE
![Page 3: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/3.jpg)
Slide 3
Inte
lThe Alpha 21364 8x7 Router
CROSSBAR
Input Ports
OutputPorts
Distributed Arbitration Algorithm Controls the Crossbar
• 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O• 7 Output ports: 4 network, 2 memory/cache, 1 I/O• Router Pipeline Length = 13/14 cycles• Virtual Cut-Through
![Page 4: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/4.jpg)
Slide 4
Inte
lProblem: Maximize # Matches
Input Port 0 1 2
Input Port 1 1 2 3
Input Port 2 1 2 3
Input Port 3 1 2 3
Input Port 4 1 6 3
Input Port 5 0 2 3
Input Port 6 4 2 3
Input Port 7 5 2 3
• Oldest Packet First: one match• Smarter algorithm (shaded boxes): 7 matches (perfect)
numbers in table cells: destination output port
older packet at input port
3
![Page 5: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/5.jpg)
Slide 5
Inte
lSimpler Algorithms Have Fewer Matches
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30
% Occupied Input Packet Buffers in a 21364 router
# A
rbitr
atio
n M
atch
es P
er C
ycle
PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (SPAA)
Assumes all output ports are free
complexity
![Page 6: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/6.jpg)
Slide 6
Inte
lComplexity may not pay off
0
1
2
3
4
5
6
7
0 0.25 0.5 0.75
Fraction of Output Ports Occupied
# A
rbitr
atio
n M
atch
es P
er
Cyc
le
PerfectComplex (WFA)Complex (PIM)Complex (PIM1)Simple (21364)
complexity
@ 30% input buffer occupancy
![Page 7: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/7.jpg)
Slide 7
Inte
lKey Results
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when many
output ports are busy)+ SPAA minimizes interactions between ports+ SPAA can be pipelined more effectively
Rotary Rule + avoids network saturation under very heavy load
![Page 8: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/8.jpg)
Slide 8
Inte
lWave Front Arbiter (WFA)
Proposed by Tamir & Chi, 1993– used in the SGI Spider/Origin switch
Implement via “connection” matrix
E
N
S
W
Grant
Request
i,j
1 2 3 4
5
6
7
output ports
Grant = Request & N & W
S = N & NOT(Grant)
E = W & NOT(Grant)
input port 0
input port 1
input port 2
input port 3
![Page 9: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/9.jpg)
Slide 9
Inte
lWFA Advantage & Pipeline
+ High degree of interaction among output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via a connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
![Page 10: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/10.jpg)
Slide 10
Inte
lWFA Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell
changes every cycle restarting (1) before (2) completes is complex
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
![Page 11: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/11.jpg)
Slide 11
Inte
lParallel Iterative Matching (PIM)
Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every
output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet
randomly Accept: unselected input port selects a grant randomly
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Output Port 0 unused in this arbitration round
![Page 12: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/12.jpg)
Slide 12
Inte
lPIM1 Advantage & Pipeline
+ High interaction between input and output portsreduces arbitration collisions & improves # of matches
Algorithm (implemented via connection matrix)(1) Select packet at input port & load matrix (1.5 cycles)(2) Run through matrix and inform input ports (1.5 cycles)(3) Forward arbitration to output ports (1 cycle)
(1) (2) (3)1.5 1.5 1
![Page 13: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/13.jpg)
Slide 13
Inte
lPIM1 Limitations
- Higher number of estimated cycles 4 cycles in 0.18 micron
- Harder to pipeline effectively restarting (1) before (2) completes is complex
same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)
large in-flight packet table due to large number of nominations (up to 54)
may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
3 cycles
(1) (2) (3)1.5 1.5 1
(1) (2) (3)
![Page 14: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/14.jpg)
Slide 14
Inte
lSimple, Pipelined Arbitration Algorithm (SPAA)
used in the Alpha 21364 Router Algorithm
Nominate: each input port nominates packets for exactly one output port (one packet nominated only once)
Grant: each output port selects an input port packet based on the least-recently selected one
Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles
input port 0
input port 1
output port 0
output port 1
input port 0
input port 1
output port 0
output port 1
Nominate Grant Accept
Reset
![Page 15: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/15.jpg)
Slide 15
Inte
lSPAA’s Simplicity
Low degree of interaction among ports- increases arbitration collisions+ reduces complexity
Algorithm (no centralized matrix)(1) Select packet at input port & load matrix (1 cycle)(2) Forward packets to output ports (1 cycle)(3) Output ports select packets and return feedback to input ports
(1 cycle)
1
(1) (2) (3)11
![Page 16: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/16.jpg)
Slide 16
Inte
lSPAA’s Advantages
+ Fewer cycles 3 cycles in 0.18micron
+ Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port
+ Easier to pipeline restart (1) for free input ports before (2) completes
only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix
speculative read allows data flits to follow header flits
(1) (2) (3)
1
(1) (2) (3)11
1 cycle
![Page 17: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/17.jpg)
Slide 17
Inte
lSummary: Simpler is Better
WFA PIM1 SPAAAlpha 21364
# Matches Per Cycle High Medium Lower
# cycles (0.18 microns)
4 4 3
Restart Rate
Every 3 cycles
Every 3 cycles
Every cycle
![Page 18: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/18.jpg)
Slide 18
Inte
lSaturation Behavior
• Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time)
• Ideally, operate at saturation bandwidth • Solution: throttle input load
64 Node Network, Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rage
Pac
ket L
aten
cy
(nan
osec
onds
)
SPAA-base
saturation point
![Page 19: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/19.jpg)
Slide 19
Inte
lRotary Rule
21364’s in-built throttling+ maximum outstanding cache miss requests per processor = 16
Rotary Rule: more throttling+ 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports+ also, clears network congestion+ relies on anti-starvation mechanism
WFA+Rotary: change first cell SPAA+Rotary: change output port priority to the
Rotary Rule
![Page 20: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/20.jpg)
Slide 20
Inte
lSimulation Methodology
Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL
Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally
![Page 21: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/21.jpg)
Slide 21
Inte
l64 Node Network: Base Case
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rage
Pac
ket L
aten
cy
(nan
osec
onds
)
PIM1
WFA-base
SPAA-base
• SPAA outperforms WFA & PIM124% higher throughput at knee
Knee
![Page 22: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/22.jpg)
Slide 22
Inte
l64 Node Network: With Rotary Rule
Random Traffic0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8
Delivered flits/router/nanoseconds
Ave
rage
Pac
ket L
aten
cy
(nan
osec
onds
)
PIM1WFA-baseWFA-rotarySPAA-baseSPAA-rotary
• Rotary Rule helps both SPAA & WFA
![Page 23: Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel](https://reader035.vdocuments.site/reader035/viewer/2022062600/5a4d1b267f8b9ab05999713b/html5/thumbnails/23.jpg)
Slide 23
Inte
lSummary & Conclusions
Arbitration Algorithms– WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993,
SGI Spider)– PIM1: Parallel Iterative Matching with one iteration
(Anderson, et al., ASPLOS 1992)– SPAA: Simple, Pipelined Arbitration Algorithm (21364)
SPAA outperforms WFA & PIM1+ SPAA’s matching power similar to WFA & PIM1 (when
many output ports are busy)+ SPAA minimizes interactions between ports+ SPAA can be pipelined more effectively
Rotary Rule+ avoids network saturation under heavy load