reliability trade-offs against power and performance · reliability trade-offs against power and...
TRANSCRIPT
Reliability trade-offs against power and performance
Nenad Stanković
Mentor: Michael Imhof
Reliable NoC in the Many Core Era
5/19/2009
2
Overview
Reliable NoC in the Many Core Era
Introduction to reliability in NoC’s
Reliability issues
Solution to high-reliability design
Conclusion
3
Overview
Reliable NoC in the Many Core Era
Introduction to reliability in NoC’s
Reliability issues
Solution to high-reliability design
Conclusion
4
Introduction to Reliability
Technology scaling
System Design
Performance
Area
Power
[Source: www.seed.slb.com]
5
Introduction to Reliability
Problems that lead to NoC
Systems-on-chip to Networks-on-
chip
Fault tolerance design
High reliability, but…
less power consumption
on a smaller area
with more performance
6
Overview
Reliable NoC in the Many Core Era
Introduction to reliability in NoC’s
Reliability issues
Solution to high-reliability design
Conclusion
7
Reliability Issues
Number of problems
increasing with technology
scaling
New problems emerging
No concrete overall solutions
to NoC design available
Errors Faults
ReliabilityPower
AreaPerformance
Architectures
DESIGN
Technology
8
Main Factors
Wire loads
Core temperature and aging
Power management
Reconfiguration circuits
Environmental effects and
industrial application
Quality of service
Throughput, Latency
Best-effort design
9
Scaling Problems
Process variability
Transient faults
Crosstalk, EMI
Other noises
Leakage, Interconnect
Supply Voltage errors
[Source: http://www.eecs.berkeley.edu/]
10
Overview
Reliable NoC in the Many Core Era
Introduction to reliability in NoC’s
Reliability issues
Solution to high-reliability design
Conclusion
11
Solutions in NoC Designs
Countermeasures:
Avoidance
Detection
Containment
Isolation
Recovery
Domain:
Functional view
Design space
Hardware view
Countermeasures
Avoidance
Detection Containment Isolation
Recovery
Domain
Hardware view
Design Space
Functional view
12
Countermeasures
Avoidance
Detection
Containment
Isolation
Recovery
13
Countermeasures
Avoidance
Detection
Containment
Isolation
Recovery
D D
D
D
D
D D D
D
14
Countermeasures
Avoidance
Detection
Containment
Isolation
Recovery
15
Countermeasures
Avoidance
Detection
Containment
Isolation
Recovery
16
Countermeasures
Avoidance
Detection
Containment
Isolation
Recovery
17
Solutions in NoC Designs
Domain:
Functional view
Design space
Hardware view
18
Domain: Functional View
Retransmission schemes
E2E, HBH, FEC, HE2E, HFEC
Buffers
Latency and Power
Coding
Hamming code
Single bit errors
Small overhead in area and power
Message, Router IDs
Checker circuits
19
Solutions in NoC Designs
Domain:
Functional view
Design space
Hardware view
20
Domain: Design Space
Estimation
Topology, traffic, communication
methods
Analytical models vs. simulation
Power vs. performance
Software
Tools, Frameworks
Simulation and synthesis
21
Estimation Problems
Topology, traffic,
communication methods
Analytical models vs.
simulation
Power vs. performance
[Source: Ge Fen, Wu Ning, Wang Qi “Simulation and Performance Evaluation for Network on Chip design Using OPNET” ]
22
Estimation Problems
Topology, traffic,
communication methods
Analytical models vs.
simulation
Power vs. performance
[Source: Ge Fen, Wu Ning, Wang Qi “Simulation and Performance Evaluation for Network on Chip design Using OPNET” ]
23
Estimation Problems
Topology, traffic,
communication methods
Analytical models vs.
simulation
Power vs. performance
[Source: Ge Fen, Wu Ning, Wang Qi “Simulation and Performance Evaluation for Network on Chip design Using OPNET” ]
24
Estimation Problems
Topology, traffic,
communication methods
Analytical models vs.
simulation
Power vs. performance
Blocking time
Message transfer
time
Total Latency
Average Blocking Length
Average distance in Hops
Average Waiting
time
Switch arbiter conflict
probability Injection rate
Virtual channel conflict
probability
25
Estimation Problems
Topology, traffic,
communication methods
Analytical models vs.
simulation
Power vs. performance
UtilizationPower Profile
Power Estimation
Average Buffer Power Consumption
Average Routing Computation
Power Consumtion
Average Crossbar Power
Consumption
Average Link Power
Consumption
26
Estimation Problems
Topology, traffic,
communication methods
Analytical models vs.
simulation
Power vs. performance
[Source: Jongman Kim, Dongkook Park, Chrysostomos Nicopoulos, N. Vijaykrishnan, C. R. Das, “Design and Analysis of an NoC Architecture from Performance, Reliability and Energy Perspective” ]
27
Software
Tools, Frameworks
Configurability
Mapping
Simulation and Synthesis
PIRATE,SMAP,OPNET
Design space exploration
Power and Performance
assessment
[Source: Gianluca Palermo, Christina Silvano, “PIRATE: A Framework for Power/Performance Exploration of Networks-On-Chip Architectures” ]
28
Solutions in NoC Designs
Domain:
Functional view
Design space
Hardware view
29
Domain: Hardware View
Multiple problems to consider
System level
Fault resistance
Variability tolerant
30
System Level
Topology, Mapping
Pipelines and busses
Wireless communication
Power management
System, Component
Links
Estimator circuits
Stage 1
Error checking circuit
Stage 2
Error checking circuit
Stage 3
Error checking circuit
Stage 4
Error checking circuit
LINK
PE1
PE3
PE2
PE4
31
System Level
Topology, Mapping
Pipelines and busses
Wireless communication
Power management
System, Component
Links
Estimator circuits
PE3 PE4
PE1 PE2
Control Policy
Estimator
Power Manager
Router
Core
PE
32
Domain: Hardware View
System level
Fault resistance
Variability tolerant
33
Fault Resistance Design
Memory
Error detection and
correction
Circuit level
Memory
Routing unit
Switch Arbiter
Virtual Channel
Handshaking
Routing Algorithm
Hamming Code
Retrans. Buffers, IDs
Voting system
Components
34
Fault Resistance Design
Memory
Error detection and
correction
Circuit level
Main FF
Delayed FF
MUX
Error Control Circuit
XOR
INPUT OUTPUT
CLK
CLK_D
SEL
35
Fault Resistance Design
Memory
Error detection and
correction
Circuit level
Wire length
Supply Voltage
[Source: Atul Maheshwari, Wayne Burleson, Russell Tessier, “Trading Off Transient Fault Tolerance and Power Consumption in Deep Submicron (DSM) VLSI Circuits” ]
36
Fault Resistance Design
Memory
Error detection and
correction
Circuit level
Transistor sizes
Threshold voltage
[Source: Atul Maheshwari, Wayne Burleson, Russell Tessier, “Trading Off Transient Fault Tolerance and Power Consumption in Deep Submicron (DSM) VLSI Circuits” ]
37
Domain: Hardware View
System level
Fault resistance
Variability tolerant
38
Variation Tolerant Design
Voltage swing and clock
skewing
Self calibrating and
reconfigurable circuits
PE1 PE2
PE3 PE4
Voltage Swing
Stage 1
Stage 2
Stage 3
Stage 4
LINK
Clock Skewing
39
Variation Tolerant Design
Voltage swing and clock
skewing
Self calibrating and
reconfigurable circuits
Adaptive V. Swing
MAIN FFMUX
Error Checker and
Configurability Circuit
INPUT OUTPUT
CLK
SELFF1
FF2
FF3
16%
32%
48%
SEL
40
Overview
Introduction to reliability in NoC’s
Reliability issues
Solution to high-reliability design
Conclusion
41
Conclusion
Reliable NoC in the Many Core Era
Reliability
Power
Performance
Area
Fault tolerance
Technology
Errors, faults
Solutions:
Various Possibilities
Levels
Design Space Exploration
Trade-offs
Stage 1
Error checking circuit
Stage 2
Error checking circuit
Stage 3
Error checking circuit
Stage 4
Error checking circuit
LINK
Thank you for your attention!
Reliable NoC in the Many Core Era
5/19/2009