highly fault-tolerant noc routing with application-aware congestion management doowon lee, ritesh...
TRANSCRIPT
Highly Fault-tolerant NoC Routingwith Application-aware
Congestion Management
Doowon Lee, Ritesh Parikh and Valeria BertaccoUniversity of Michigan
2
Wide Range of Applications
(picture sources) 1. N-body simulation: https://www.astro.rug.nl/~weygaert 2. semiconductor: http://spectrum.ieee.org 3. computational biology: http://csbio.cs.umn.edu/ 4. molecular structure: http://nanotechnologyuniverse.com
everyday applications
cloud computing
physical simulation
scientific applications
computationalchemistry
computational biology
semiconductorsimulation
varying computation characteristic,user requirement, etc.
3
Application Running on Network-on-Chip
(picture sources) 1. Video encoder: Gary Sullivan et al., Standardized Extensions of High Efficiency Video Coding (HEVC) 2. Tilera TILE-Gx8072: http://www.tilera.com
application example: video encoder
chip multiprocessorwith network-on-chip (NoC)
mapping
analysis
communication frequency
destination
sour
ce
64-thread simulationof SPLASH-2 (ocean)
(number of flits)
some pairscommunicatemore frequently
A
B
B
A
4
Fragile Networks-on-Chip
increasing transistor density transistor reliability↓
network-on-chip… possible single point of failure
22 nm(Intel)
14 nm(Intel)
7 nm(IBM)
tail of transistor scaling
permanent faults solution:network-on-chiproutingreconfiguration
5
How to reduce NoC degradation from faults?
state-of-the-artrouting reconfiguration[Aisopos 11]
0 10 20 30 40 50 600
2
4
6
8
10
number of faults affecting the NoC
satu
ratio
n th
roug
hput
(fl
its/c
ycle
) minimum throughput requirement
our goal
motivating experiment: fault vs. performance degradation
KEY IDEA: application-aware routing optimized to application’s communication patterns
Network-on-chip reconfiguration entails performance degradation
6
solution 1problem
Application-Aware Routing (1/2)
various route options(no restriction)
S
D
path diversity = 6
1
1
1
1
1
2
1
3
1 2
1 3
S
D
1
1
1
1
1
1
1
1
1 2
path diversity = 3
deadlock-free
deadlock possible
avoid deadlock
by restricting turns 0 0
How do we find adaptive routing optimized to communication patterns?
7
solution 2
Application-Aware Routing (2/2)
various route options(no restriction)
S
D
path diversity = 6
1
1
1
1
1
2
1
3
1 2
1 3
Where to best place turn restrictions? NP-complete problem
path diversity = 6
11S
D
1
1
1
2
1
3
1 3
1 2
How do we find adaptive routing optimized to communication patterns?
OUR CONTRIBUTION: turn-restriction placement heuristic
deadlock possible
avoid deadlock
by restricting turns
problem
8
Presentation Outline
• FATE (Fault- and Application-aware Turn-model Extension)(1) Turn-enabling rules (2) Load estimation “How to reduce search?” “Which is the most valuable turn?”
(3) Overall routing computation algorithm
• Experimental evaluation• Conclusions
0 1
3 4
2
5
6 7 8
9
How to reduce turn-restriction search?
To avoid unfruitful turn-restriction patterns…
0 1
2 3
pattern 1. network disconnection pattern 2. non-minimal restriction
pattern 3. possible deadlock
0 1
3 4
2
5
0 1
2 3
10
Turn-Enabling Rules
0 1
3 4
2
5
6 7 8
basic rules
enable adjacent turns(cycle, node, link)
0 1
4 5
2
6
8 9
3
7
10 11
15141312
advanced rules
enable remote turns(horizontal, vertical, diagonal)
… each time a turn is disabled, several others should be enabledTo avoid unfruitful turn-restriction patterns…
11
Traffic-Load Estimation
Which is the most valuable turn? use traffic-load estimation to decide
specific goals(1) balancing link utilization(2) prioritizing turns that are critical
load calculation steps
pathdiversity
linkload
turnload
cycleload
weightscaling
take into account hop-by-hop route-decisions
12
Traffic-Load Estimation Step by Step
0 1
4 5
8 9
3
7
11
151312
2
6
10
14source
destination
1
1
1
1
1
1
1
1
1
2
1
0
3
3
1
2
3
3
3
3
6
6
path diversity
link load
turn load
cycle load
weight scaling
multiply by communication frequencymedium traffic low traffic
high traffic
13
Example: Link, Turn, Cycle Load (1/2)
link load (from path diversity)
0 1
4 5
8 9
3
7
11
151312
2
6
10
14source
destination
1
1
pathdiversity
1
1
1
1
1/2 = 0.5
0.5
link load
0.25
0.25
0.25
0.25
90.25
0.25
turn load
0.125
0.1250
0.25
9 10
1413
cycle load
sum: 2 4
0.125
diversity link turn cycle scaletraffic-load estimation 5 steps:
1
1
1
2
1
0
0.17
0.17
0.17
0.33
0.17
0
6
14
Example: Link, Turn, Cycle Load (2/2)
link load (from path diversity)
0 1
4 5
8 9
3
7
11
151312
2
6
10
14source
destination
1/2 = 0.5
0.5
link load
0.25
0.25
0.25
0.25
0.25
turn load0.25
0(no path)
9 10
1413
0.125
cycle load
14
sum:0.3750.25
diversity link turn cycle scaletraffic-load estimation 5 steps:
0.17
0.17
0.17
0.33
0.17
0
15
Example: Weight Scaling
1
4 5
8 9
7
11
13S1
D12
6
10
14
0.125
0.250.38
0.38
most congested cycle
1
4 5
8 9
7
11
D213S1
D1S2 2
6
10
14
2.5
53
3
9.8
8
13.2
12.5
9.2
9
scaling
sourcedestination S1D1 S2D2communication frequency 20 8
D2
S2
13.5
16
Putting it all together
1
4 5
8 9
7
11
D213S1
D1S2 2
6
10
14
1
4 5
8 9
7
11
D213S1
D1S2 2
6
10
14
1) evaluate turns, one at a time (choose the one leading to least congestion)
2) apply turn-enabling rules
iterate this process until no undecided turn is left
1 2
3 4
17
Backtracking
deadlock possible due to greedy turn-restriction selections turn-enabling rules do not resolve all deadlock-causing patterns
backtrack to the last decision
example placement
0 1
4 5
2
6
8 9
3
7
10 11
decision tree
node 5turn NW
node 6turn NE
deadlockdetected
backtra
ck
node 3turn SW
…
18
FATE Route-Computation Procedure
start (trigger)
end
estimate traffic load
choose turn to bedisabled
deadlock?disconnect?
no undecidedturn?
apply turn-enablingrules
back
trac
k
loop
: disabled turn: enabled turn: undecided turn
: high traffic: medium traffic: low traffic
network example
procedure flowchart trigger: (1) new application launch(2) fault occurrence
19
Presentation Outline
• FATE routing
• Experimental evaluation
– Experimental setup
– Evaluation on faulty topologies
– Evaluation on fault-free topologies
– Overheads
• Conclusions
20
Experimental Setup
• BookSim simulation with 8 X 8 mesh networks– 3-stage router pipeline, 2 VCs/protocol class, 5 flits/VC
• Fault injection– faults in bidirectional links– 5 fault rates: 1 faulty link, 3%, 5%, 10%, and 15% faulty links– 10 random fault patterns for each fault rate
• Traffic benchmarks– 5 synthetic patterns: bit complement, bit reversal, shuffle, transpose,
uniform random– 11 traces from SPLASH-2 multi-threaded workloads
• generated from gem5 simulation with MESI cache coherence• 4 memory controllers at mesh corners
21
Prior Routing Solutions
• Fault-tolerant routing– Breadth-First Search (BFS) [Schroeder 91, Aisopos 11]– Depth-First Search (DFS) [Sancho 04]
• Application-aware routing– Bandwidth-Sensitive Oblivious Routing (BSOR) [Kinsy 09, Kinsy 13] – Application-Specific Routing Algorithms (APSRA) [Palesi 08]
• Fully-adaptive routing on 2D mesh (congestion management)– Dynamic XY (DyXY) [Li 06]– Neighbor on Path (NoP) [Ascia 08]– Regional Congestion Awareness (RCA) [Gratz 08]
22
Saturation Throughput for Synthetic Patterns
number of faulty links
satu
ratio
n th
roug
hput
(pac
ket/
cycl
e/ro
uter
)
0
0.01
0.02
0.03
0.04
0.05BFS DFS BSOR APSRA FATE
bitcomp bitrev shuffle transpose uniform0
0.01
0.02
0.03
0.04
0.05BFS DFS BSOR APSRA FATE
traffic pattern
satu
ratio
n th
roug
hput
(pac
ket/
cycl
e/ro
uter
)
9.5% 10.6% 17.7%23.3%
33.3%
5.5% -0.5% 0.1%2.9%
9.3%
less performancedegradation asfaults increase
33.3% ↑ over fault-tolerant routing
9.3% ↑ over app.-aware routing
gains maximizedwith unbalancedload
still provide gainwith uniform load
(15% fault rate)
fault-tolerant application-aware our solution
23
Packet Latency for SPLASH-2 Traces
1 fault 3% faults 5% faults 10% faults 15% faults0
20406080
100120
BFS DFS BSOR APSRA FATE
aver
age
pack
et la
tenc
y (c
ycle
s)
number of faulty links
0
20
40
60
80
100
120
benchmark programaver
age
pack
et la
tenc
y (c
ycle
s)
minimal increaseuntil 5% faults
up to 59% (13%)latency reductionover BFS (APSRA)
13%
228 cycles
59%
significantly lowerlatency in 5 programs
(15% fault rate)
24
Performance on Fault-Free Meshes
3 VCs 4 VCs 6 VCs0
0.02
0.04
0.06
0.08
0.1DORDyXYNoPRCA1DBFSDFSBSORAPSRAFATE
number of VCs
satu
ratio
n th
roug
hput
(pac
ket/
cycl
e/ro
uter
)
fully-adaptive
fault-tolerant
application-aware
Compared to DOR, fault-tolerant and application-aware routing,FATE always provides higher saturation throughput ( better traffic-load estimation)
Compared to fully-adaptive,FATE outperforms at small number of VCs ( more VCs for normal transfer)
deterministic
our solution
25
Overheads
• Software computation– 2-4 sec for 8X8 meshes on Intel Xeon® processor
(two orders of magnitude faster than APSRA)
– ~110 turn-placement attempts
(little dependence on fault rate)
• Hardware overheads– Area: 6% increase (routing table, route-computation logic)
– Power consumption not measured
• Better power-efficiency than APSRA
• Can be more power-efficient than application-agnostic solutions
when reusing same routing multiple times
26
Conclusions
• FATE provides highly fault-tolerant routing with graceful
performance degradation by leveraging application traffic patterns
• Performance improvement over existing fault-tolerant routing
33% improvement in saturation throughput (synthetic traffic patterns)
59% improvement in packet latency (SPLASH-2 traces)
• Two orders of magnitude faster route-computation
27
Thank you! Question?
28
Backup Slides
29
Various Turn-Restriction Choices
exponential increase of turn-restrictionchoices as network size increases
4 possibilities
16 possibilities (not shown other 8 cases)
2-D mesh with M nodes con-tains possibilities𝟒(√𝑴−𝟏)×(√𝑴−𝟏)
example 1: 4 nodes
example 2: 6 nodes
30
Basic Turn-Enabling Rules(Cycle, Node, Link)
0 1
2 3
rule 1(cycle): undecided
: enabled: disabled
turn types
0 1
3 4
2
5
6 7 8
rule 2 (node)0 2
5
6 7 8
1
3 4
rule 3 (link)
Which turns should be enabled upon a turn-restriction decision?(1) to minimize the number of restrictions(2) to guarantee deadlock-freedom
0 1
3 4
2
5
violatedturn
What happens ifwe break the rules?
deadlock happens
31
Advanced Turn-Enabling Rules(Common Link, Opposite-corner Turn)
: undecided: enabled (basic)
: disabled
turn types
: enabled (advanced)
: candidate
0 1
4 5
2
6
8 9
3
7
10 11
15141312
rule 4: common link0 1
4 5
2
6
3
7
Why rule 4? Let’s applying basic rules…
should beenabled forboth candidates
rule 5: opposite-corner turn0 1
4 5
2
6
8 9
3
7
10 11
15141312
horizontalenabling
verticalenabling
diagonalenabling
see paperfor details
32
Applying Basic Turn-Enabling Rulesto Faulty Topologies
rule 1: cyclespecial case – no doublecount:counted only for one cycle
0 1
3 4
2
5
6 7 8
mutualturn
rules 2 & 3: node & link
no special change
0 1
3 4
2
5
6 7 8
deadlock when disabling only mutual turn
33
Applying Advanced Turn-Enabling Rulesto Faulty Topologies
rule 4: common link
apply only towards fault-free directions
0 1
4 5
2
6
8 9
3
7
10 11
15141312
rule 5: opposite-corner turn
apply as if fault-free
0 1
4 5
2
6
8 9
3
7
10 11
15141312