online timing variation tolerance for digital integrated circuits guihai yan & xiaowei li state...
TRANSCRIPT
Online Timing Variation Tolerance for Digital Integrated Circuits
Guihai Yan & Xiaowei Li
State Key Laboratory of Computer Architecture,Institute of Computing Technology, Chinese Academy of Sciences
(ICT, CAS)
Sources of timing variation
PVT variation Dynamic: Voltage & Temperature fluctuations Static: Process variation
Aging degradation NBTI, PBTI TDDB
Soft errors (in non-regular logics) SEU & SET
Process variation Sub-wavelength Lithography
“What you get is not what you want”
Systematic Random dopant fluctuations
Vth variation Random
1980 1990 2000 2010 2020
100nm
1m
10nm
1980 1990 2000 2010 20201980 1990 2000 2010 2020
100nm
1m
10nm
193nm193nm248nm248nm
365nm365nmLithographyLithographyWavelengthWavelength
65nm65nm
90nm90nm
130nm130nm
GenerationGeneration
GapGap
45nm45nm
32nm32nm
180nm180nm
13nm 13nm EUVEUV
Max Freq. differentiate by 20% ![Teodorescu, ISCA’08]
P variation is time-independent, “DC component”
Temperature variation
Application-specificSlow-varying
Milliseconds Typical thermal
constant : 2ms
[Donald, ISCA’06]
T variation is slow-varying, “Low-frequency components”
EL Synthesizer
EL Synthesizer
EL Synthesizer
EL Synthesizer
TM Agent
Core1 Core2
Core3 Core4
Voltage variation Fast-changing
Inductive noise• a.k.a. L(di/dt)
problem IR-drop
Why it is harder to keep a constant voltage level ?Example:Power budget: 100W ,Working voltage: 1V ,Current: 100A ,To keep voltage fluctuation between ±5%, RPDN < 0.5 mOhm
PDN hierarchy modelV variation is fast-changing,
“High-frequency components”
Aging degradation
Aging mechanisms NBTI (PMOS) PBTI (NMOS) TDDB
20%degradation10years
LifetimeUseful time
Infant mortality
Aging
Failu
re r
ate
Soft errors SEU (Single Event Upset)
Unintentional bit-flip in storage cells SET (Single Event Transient)
Transient voltage pulse propagating in combinational logics
Flip-flop
clk
So Combinational Logic
Si
……
……
SEUSET
Outline
TEA-TM Timing emergency-aware thread migration PVT variations co-optimization
SVFD Stability violation based fault detection On-line fault detection via timing sensing Delay fault, aging delay, soft errors
MicroFix Margin-reducing with timing sensing Application to DVFS
ReviveNet Aging-delay tolerance
TEA-TM: Timing Emergency-Aware Thread Migration
Focus on the essential Timing issue
Not Necessarily aggregated, but can cancel off each others in some cases. Hence, “Complementary”.
Process variation
Voltage variation
Temperature variation
Timing variation
( + , - ) ( + , - ) ( + , - )
Some terms
Timing emergency (TE) Emergency level (EL)
“Density” of TE Define: EL = # of TE per 100
millions cyclesTime
Dela
y Timing Emergency
Threshold
Violent
Mild
Slow corner
Fast corner
Voltage Temperature
Process
Large fluctuation
Small fluctuation
Hot
Cool
How PVT Variations Complement each other ?
• Observation in time domain
What if exchange the threads on Core1 and Core2?
T. Mild, V. MildCore1:
Large margin, low EL
T. Violent, V. ViolentCore2:
Little margin, High EL
Time
Del
ay
Threshold
Time
Del
ay
T Violent, V Violent
T Mild, V Mild T Mild, V Violent
T Violent, V Mild
Emergency
Excessive headroom
Mild + Violent
Frequency domain analysis
Migrate threads = “ Graft” V component
Del
ay
DT
H
Time
Core2
Del
ay
DT
H
Time
Core1
TM
TM
T V
FrequencyS
pect
rum
de
viat
ion
T V
Frequency
Spe
ctru
m
devi
atio
n
T
V
Frequency
Spe
ctru
m
devi
atio
nT
V
FrequencyS
pect
rum
de
viat
ion
P P
P P
Frequency domain analysis (cont.)
Relative frequency spectrum deviations on 2GHz quad-core processor. P: 0-100Hz, T: 100Hz-1MHz, V: 1MHz-250MHz.
Potential Core3 and Core4 are mild
Strategy exchange threads on Core1 and Core4, Core2 and Core 3
EL Synthesizer
EL Synthesizer
EL Synthesizer
EL Synthesizer
TM Agent
Core1 Core2
Core3 Core4
TEA-TM Summary
Analyzing the complementary effect from both time and frequency
domain Presenting a delay sensor-
based scheme (TEA-TM) to exploit the complementary effect Simple, cost-efficient FFT-like heuristic
Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core ProcessorsGuihai Yan, Xiaoyao Liang, Yinhe Han, Xiaowei Li,In the Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France. pp.485-496, Jun. 2010.
Throughput: 30%
Fairness: 80%
Stability Violation
Stable Period vs. Variable Period
Time
(n-1)T nT
Si
So
Stable PeriodVariable Period
t1 t2
Stability Violation: Signal transitions occur in Stable
Period.
Flip-flop
Combinational Logic
Flip-flop
……
clk clk
……Si So
In what situations would SVs occur?
• Delay faults resulting from – Delay defects (introduced in manufacturing processes)– Aging (Wearout) induced performance degradation
Due to Delay Fault
Setup time Setup time violation
T T
• But, Can soft error be modeled by SV?
Thus, delay faults caused stability violation do not differ too much from “setup time violation”
YES!
How do Soft Errors cause SV?
Flip-flop
clk
So Combinational Logic
Si
……
……
SEU
Si violates Stability Requirement!
SET
So violates Stability Requirement!
Notice: NOLY the SVs occurring in “vulnerable window”--- within which the flip-flops are updated --- could cause
failures.
Time
(n-1)T nT
Si
So
Stable PeriodVariable Period
t1 t2
Time
(n-1)T nT
Si
So
Stable PeriodVariable Period
t1 t2
The next problem is How to detect stability violations?
Low cost stability checker
Delay faults and soft errors can be modeled as Stability Violations.
VDD
CLKS M1 M2
M3 M4
M5
S1 S2
S3
M6
M7
M8
GND
S5
A1S4
M10
GND
GND
B1 An
VDD VDD VDD VDDCLKS
M9
X
Y
STABILITY CHECKER COMPRESSOR
M11
M12
GNDCo Co_b
D QCin
CLK
Co
Co_b
Comb.
XOR Protection
SiK-1 SiK SoK
B1
CLKG
CLKG
Bn
CLKS
Latch
Latch
CLKSoft error/Delay fault
Detected
Aging Delay Detected
OUTPUT LATCH
Flip-flop
Some Rresults Implementation
SVFD-protected FPU Using 65nm PTM, Hspice Simulation
• A Unified Online Fault Detection Scheme via Checking of Stability Violation Guihai Yan, Yinhe Han, Xiaowei Li, IEEE/ACM Desing, Automation and Test in Europe (DATE’09), pp.496-501, 2009.
• SVFD: A Versatile Online Fault Detection Scheme via Checking of Stability Violation Guihai Yan, Yinhe Han, Xiaowei Li, IEEE Transactions on Very Large Scale Integration Systems (T-VLSI), 19(9), Sep. 2011.
Besides of fault detection, what else can we do with SVFD?
Dynamic margin reduction MicroFix: an application to
DVFSAging tolerance
ReviveNet: Fine-grained aging delay tolerance
Dynamic margin reduction
……(K-1)FFs
KFFs
Delay Error Prediction Signals
……(K-1)th stage
LogicKth stage
Logic
Timing Sensors
Timing Sensors
Target Pipeline
Voltage/Frequency
Control
Normal Voltage Supply
………… …… …… ……
CLK……
……
……
FCLK
BCLK
Conservative Voltage Supply
CLK
BCLK
FCLKT×TH
T×TH
UAFFFAFF
GFF
FCLK BCLK
BAFF
CLK
FFs
Timing sensors setup
Operational Principles
V, F V, FV, F
Reduce frequency from F to F Reduce voltage from V to V
(a) Traditional DVFS
Increase frequency from F to F Increase voltage from V to V
Reducing Power
Increasing Performance
V, FV, F
Increase voltage from V to V
Increase frequency from F to F
V, F
V V-v Monitoring
No error predicted
V V+ v
Error predicted
F F + fMonitoring
No error predicted
F F- fError
predicted
(b) MicroFix enhanced DVFS
Reduce frequency from F to F
Reduce voltage from V to V
Restore a tight margin
Restore a tight margin
Fine-grained margin exploited
P1
P2
FF
FFCritial Path
Cycle Period
Non-critical Path
K-1th stage Kth stage
Cycle Period
FF
FF
FF
FF
Generous Flip-flop (GFF) Forward Adaptable Flip-flop (FAFF)
Backward Adaptable Flip-flop (BAFF) Unadaptable Flip-flop (UAFF)
Localized timing imbalance
Case study results
Apply to a FPU 32nm PTM models
TH=0.2~0.3 is an optimal choice!Efficiency Improvement: 35% EDP
MicroFix: Using Timing Interpolation and Delay Sensors for Power ReductionGuihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li, ACM Transactions on Design Automation of Electronic Systems (TODAES), 16(2), 1-21, 2011.
MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance EfficiencyGuihai Yan, YinheHan, Hui Liu, Xiaoyao Liang, Xiaowei Li, ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’09), pp395-400, 2009.
Localized Aging Tolerance
Fresh
Aging delay
T
Guard band
Aging delay
Stability violation in guard band,is NOT“ timing violation”
Delay fault
Detection slack
Stability violation in detection slack, is“ timing violation” ——Delay fault
T
The chance for aging adaptation We have chance to “act before it’s too late”
Nudge for timing margin
Dynamic time borrowing Path-grained, NOT stage-grained
……(K-1)FFs
CLK
KFFs
CLK
Aging Alarms
……(K-1) stage
LogicK stage Logic
Aging Alarms
ReviveNet
AdaptationAgent
Adaptation Agent
To prior Agents
From next Agents
Aging SensorAging
SensorAging
Sensor
Aging SensorAging
SensorAging
Sensor
Aging sensors setup
Coarse-grained detection
Upstream flip-flops
Downstream flip-flops
…… Logic
Sensor1
Sensorn
……
… …
Aging alarm
…
Timing Non-critical Signals
…
…
Stability CheckerO
RStability Checker
Stability Checker
Ou
tpu
t L
atch
Aging alarm
VDD
CLKM1 M2
M3 M4
M5
S1 S2
S3
M6
M7
M8
GND
S5
S4
VDD VDD
GND
Trail-based adaptation
FFs
CLK
CLK FCLK
MU
XM
UX
MU
X
BCLK
FCLK
BCLK
UAFF
FAFF
GFF
CLK
BCLK
FCLKM
UX
MU
X
BAFF
MU
X
AgentTH/2
TH/2
Da
ta-i
n
Da
ta-o
ut
Round-Robin Trial Adaptation (K)01. The Kth Agent receives an aging emergency 02. FOR each adaptation state candidate 03. Conduct a trial adaptation04. IF the emergency is eliminated 05. THEN break (Adaptation succeeded!) 06. ELSE 07. Recover this trial adaptation 08. IF all the adaptation states have been reached09. THEN break (Adaptation failed!) 10. END FOR
Adaptation latency is non-critical
Trail till success
Fine-grained adaptation
Implementation
False-alarm filterSharing filters to reduce overhead
ReviveNet: A Self-adaptive Architecture for Improving Lifetime Reliability via Localized Timing AdaptationGuihai Yan, Yinhe Han, Xiaowei Li,IEEE Transactions on Computers (TC), 60(9), Sep. 2011.
Conclusion
Dynamic timing variation is increasingly critical
Online timing variation detection and tolerance is a promising approach to dynamic variation
Application-specific timing variation MicroFix for DVFS ReviveNet for aging tolerance
Holistic solution can be more cost-effective TEA-TM Architectural optimization for Circuit symptom
Publication (Chronological order)1. Guihai Yan, Yinhe Han, Xiaowei Li,
ReviveNet: A Self-adaptive Architecture for Improving Lifetime Reliability via Localized Timing Adaptation, IEEE Transactions on Computers (TC), Vol.60, No.9, pp.1219-1232, Sep. 2011.
2. Guihai Yan, Yinhe Han, Xiaowei Li, SVFD: A Versatile Online Fault Detection Scheme via Checking of Stability Violation, IEEE Transactions on Very Large Scale Integration Systems (T-VLSI), Vol.19, No.9, pp.1627-1640, Sep. 2011.
3. Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li, MicroFix: Using Timing Interpolation and Delay Sensors for Power Reduction, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol.16, No.2, pp.1-21, Mar. 2011.
4. Jianbo Dong, Lei Zhang, Yinhe Han, Guihai Yan, Xiaowei Li, Performance-asymmetry-aware Scheduling for Chip Multiprocessors with Static Core Coupling, Journal of Systems Architecture, Vol.56, pp.534-542, 2010.
5. Guihai Yan, Xiaoyao Liang, Yinhe Han, Xiaowei Li, Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors, In the Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France. pp.485-496, Jun. 2010.
6. Guihai Yan, YinheHan, Hui Liu, Xiaoyao Liang, Xiaowei Li, MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency, ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED'09), pp.395-400, 2009.
7. Song Jin, Yinhe Han, Lei Zhang, Huawei Li , Xiaowei Li and Guihai Yan, M-IVC: Using Multiple Input Vectors to Minimize Aging-induced Delay, Proc. of IEEE Asian Test Symposium (ATS'09), 2009.
8. Guihai Yan, Yinhe Han, Xiaowei Li, A Unified Online Fault Detection Scheme via Checking of Stability Violation, IEEE/ACM Desing, Automation and Test in Europe (DATE'09), pp.496-501, 2009.
9. Guihai Yan, Yinhe Han, Xiaowei Li, Hui Liu, BAT: Performance-Driven Crosstalk Mitigation Based on Bus-grouping Asynchronous Transmission, IEICE Transactions On Electronics, Vol.E91-C, No.10, pp.1690-1697, Oct, 2008.
Book Chapters
Fault Tolerance Designs for Digital Integrated Circuits: Tolerating defects/faults, parameter variations, and soft errors (in Chinese), Beijing, Science Press, 2011. ISBN 978-7-03-030576-3.
When I’ve done a program…