microfix: exploiting path-grained timing adaptability for improving power-performance efficiency...
TRANSCRIPT
MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency
Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li
Key Laboratory of Computer System and Architecture,ICT (Institute of Computing Technology), CAS, Beijing, P.R. China
NVIDIA Corporation, USA
Outline
• What’s Path-grained Timing Adaptability (PTA)
• Potential of PTA for Efficiency Improvement
• How to Exploit PTA
• Case Study Results
• Conclusions
Impact of DVFS to Path Delay
P1
P2
FF
FFCritial Path
TCycle Period
Non-critical Path
K-1th stage Kth stage
T T
• Traditionally, suppose voltage scaling down makes P1 and P2 timing critical, then what?
• Scaling down frequency to all stages of pipeline
Question:
• Can these emerging critical paths be salvaged to trade more voltage scaling down?
• Maybe Yes! By fine-grained time stealing
Timing Imbalance
T T
FF
FF
FF
FF
PCP
TCycle Period
NCP
Generous Flip-flop (GFF)
Backward Adaptable Flip-flop (BAFF)
Forward Adaptable Flip-flop (FAFF)
Unadaptable Flip-flop (UAFF)
Slack_up > TH, Slack_dn > TH
Slack_up > TH, Slack_dn ≤ TH
Slack_up ≤ TH, Slack_dn > TH
Slack_up ≤ TH, Slack_dn ≤ TH
Intrinsic Timing Imbalance
• Case study• FPU, adopted by OpenSPARC T1
• Support all IEEE 754 floating-point data types
• Synthesized by Synopsys Design Compiler with UMC 0.18um technology
• Cycle period: (1+10%) ×T critical
1
10
100
1000
10000
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
TH=0.1Cycle
TH=0.15Cycle
TH=0.2Cycle
TH=0.25Cycle
TH=0.3Cycle
TH=0.35Cycle
TH=0.4Cycle
# F
lip-f
lop
s
The GFFs, FAFFs, and BAFFs take considerable even dominated proportion!
Attractive Potential
DVFS Exacerbating Imbalance
• Generally, the time margin of longer paths diminish much more faster than that of short ones
FF
• Assume that the path delay is the sum of delay of gates on the path
• TG : the gate delay
• Delta: the delay change during the voltage scaling down
• Before voltage scaling down• △S1 = (n - m) × TG
• After voltage scaling down• △S2 = (n - m) × (TG + Delta)
Define: S=|Slack_dn △ - Slack_up|
Slack_dnSlack_up
n gates m gates
△S1 < S△ 2
Example
If the Imbalance be utilized…• Check the lower bound of cycle period T
• Traditionally:
T1 = n× (TG+Delta)• From MicroFix’s perspective:
T2 = (m+n)/2 × (TG+Delta) ≤ T1 - TH
T
δ
n
(m+n)/2
Without MicroFix
With MicroFix
F=1/Tδ= δ(V)
F
1/V
1/n
2/(m+n)
Without MicroFix
With MicroFix
Note: preclude the UAFFs
How to deal with UAFFs?
• Two-supply voltage scheme [Usami, JSSC’98] [Ghosh, TCAD’07]
• Critical Isolation: the critical paths resulting in UAFFs
• The supply voltage of Critical Isolation are more conservative than that of other portion out of Critical Isolation.
Critical Isolation
Powered by Conservative Voltage
Powered by Aggressive Voltage
The exploitable scope of MicroFix
How to “Fix’’?
• Two supply voltage scheme• Timing sensors [Yan, DATE’09][Agarwal, VTS’07]
• Multiple-phase Clocks (generated by a DLL)
……(K-1)FFs
KFFs
Delay Error Prediction Signals
……(K-1)th stage
LogicKth stage
Logic
Timing Sensors
Timing Sensors
Target Pipeline
Voltage/Frequency
Control
Normal Voltage Supply
………… …… …… ……
CLK……
……
……
FCLK
BCLK
Conservative Voltage Supply
CLK
BCLK
FCLKT×TH
T×TH
UAFFFAFF
GFF
FCLK BCLK
BAFF
CLK
FFs
Operational Principles
V, F V, FV, F
Reduce frequency from F to F Reduce voltage from V to V
(a) Traditional DVFS
Increase frequency from F to F Increase voltage from V to V
Reducing Power
Increasing Performance
V, FV, F
Increase voltage from V to V
Increase frequency from F to F
V, F
V V-v Monitoring
No error predicted
V V+ v
Error predicted
F F + fMonitoring
No error predicted
F F- fError
predicted
(b) MicroFix enhanced DVFS
Reduce frequency from F to F
Reduce voltage from V to V
Restore a tight margin
Restore a tight margin
Ensure that the restored margin ‘v’ and ‘f ’ can guard safe voltage and frequency turning.
Experimental Setup
• Gate-level• Study the adaptability and overhead with a
synthesized FPU – Timing info. -> PrimeTime
• Transistor-level • Investigated the Power-Performance tradeoffs
with Hspice simulations – 32nm PTM models dedicated for HP and LP
applications, respectively.
Exploring Design Tradeoffs
• ‘TH’ play a critical role in determining the ultimate Efficiency
Critical Isolation
The exploitable scope of MicroFix
Critical Isolation
The exploitable scope of MicroFix
Smaller ‘TH’, smaller CI, but less aggressive voltage reduction!
1
10
100
1000
10000
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
GF
F
FA
FF
BA
FF
UA
FF
TH=0.1Cycle
TH=0.15Cycle
TH=0.2Cycle
TH=0.25Cycle
TH=0.3Cycle
TH=0.35Cycle
TH=0.4Cycle
# F
lip-f
lop
s
Larger ‘TH’, larger CI, but more aggressive voltage reduction!
What ‘TH’ is optimal?
Exploring Design Tradeoffs /2
• Percentage of Cells in Critical Isolation
0.00% 0.00% 0.16%2.04%
10.82%
22.70%
33.52%
0%
5%
10%
15%
20%
25%
30%
35%
40%
0.1 0.15 0.2 0.25 0.3 0.35 0.4
TH
Pe
rce
nta
ge
of
Ce
lls
Exploring Design Tradeoffs /3
• Sensor Area Overhead• a sensor is about 8x that of a pipeline flip-flop (based on the
number of transistors) [Yan, DATE09]
• The paths in the critical isolation and those with ‘over-larger’ slack (i.e. slack >T × TH + tmargin) do not need to be monitored by sensors
0.00%
2.10%3.75%
9.20%
12.34%10.95%
9.97%
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
0.1 0.15 0.2 0.25 0.3 0.35 0.4TH
Se
nso
r a
rea
ove
rhe
ad
Exploring Design Tradeoffs /4
• Sensor Power Overhead• in the most pessimistic case (TH=0.3, all sensors
simultaneously flag timing errors): 14%
• HOWEVER, such worst-case power overhead can hardly happen due to three reasons
1) Sensors do not need to be always on
2) It’s almost impossible all sensors flag impending timing errors simultaneously
3) TH=0.3 actually is not a optimal configuration
Therefore, the pessimistic power overhead won’t offset much efficiency of MicroFix!
Hspice Simulations
• Object: Investigate the detailed delay-power relation of the target pipeline
• It is ideal to directly simulate the transistor-level model of the target pipeline with Hspice; however it is very labor-intensive and time consuming.
• So we took a indirect way to conduct the Hspice simulations
Ptotal(V,F) = Pcomb(V,F)+Pff(V,F)1/F = T = tc + tsetup + tc−to−q
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6
Voltage (V)N
orm
aliz
ed P
ower
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6
Voltage (V)
Nor
amliz
ed D
elay
(a) (b)
High Perf.
Low PowerLow Power
High Perf.
Combinational Component
• ISCAS85 (c432, c499, c880, c1355, c1908, c2670)
• 32nm PTM models (HP and LP versions)
Normalized V-D and V-P relations comply well with all of the simulated benchmarks!
Sequential Component
• V-D
• V-P
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
1 0.9 0.8 0.7 0.6Voltage (V)
No
rma
lize
d D
ela
y
t_setup + t_c-to_q
t_setup + t_c-to_q
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6Voltage (V)
Nor
mal
ized
Pow
er
α=1α=0.5α=0.25α=1α=0.5α=0.25
Low PowerHigh Perf.
Low Power
High Perf.
(a) (b)
Efficiency Comparsion
TH = 0.2 is an optimal choice!Efficiency Improvement: 35% EDP, 28% PDP
0.00%
2.10%3.75%
9.20%
12.34%10.95%
9.97%
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
0.1 0.15 0.2 0.25 0.3 0.35 0.4TH
Se
nso
r a
rea
ove
rhe
ad
Conclusion
• MicroFix can improve DVFS efficiency by exploiting the path-grained adaptability
• The timing imbalance threshold, TH, implies a critical design tradeoff
• The efficiency of EDP for HP application up to 35% and PDP for LP application up to 28%, at the expense of only 7% area overhead
Thanks!
Q&A