optimization for leakage power reduction using multi-threshold voltages for high performance...
TRANSCRIPT
Optimization for Leakage Power Reduction using Multi-Threshold Voltages for High Performance
Microprocessors
Jeegar Shah, Marius Evers, Jeff Trull, Alper HalbutogullariAMD
Sunnyvale, CAMarch 19, 2007
ISPD 2007Austin
ISPD 20072 March 19, 2007
Agenda
• Justification for threshold voltage selection for leakage power
reduction and multi-corner cycle time adjustments
• Multi-Threshold voltage selection flow
• Heuristic VTH selection algorithm
• Dynamic Forward traversal VTH selection algorithm
• Results
• Conclusions
• Q & A
ISPD 20073 March 19, 2007
Motivation
• Reduce leakage power by increasing the threshold voltages of non-critical gates.
• Meet aggressive timing constraints
• Support the above constraints for multiple process corners
• Optimize extremely rigid designs at post-route step to handle process variability
• Support multi-VTH flows (scalable as more VTH libraries are made available)
• Generate design variants with power-performance tradeoff
ISPD 20074 March 19, 2007
METHODOLOGY & OPTIMIZATION FLOW
ISPD 20075 March 19, 2007
Methodology Flow
1. Start with unoptimized design
2. Read in constraints for multiple corners
3. Run Static Timing Analysis for each of these corners
4. Optimize first to meet aggressive timing constraints for each corner by down-swapping (selecting lower VTH cells for critical path gates)
5. Then optimize to reduce leakage power by up-swapping (selecting higher VTH cells for critical path gates)
6. Let multiple corners interact
7. Iterate 3-6
8. Static Timing Analysis check
ISPD 20076 March 19, 2007
Simultaneous optimizations across multiple corners
STA 1 STA 1
Corner 1 Corner 2
OptimizationIteration 1
OptimizationIteration 1
Exchange swapsas they are computed
STA 2 STA 2
New design
ISPD 20077 March 19, 2007
Multi-Threshold VTH selection flowStart with MVT cell
design with few protected user defined cells
Determine which cells to change to
LVT based on heuristic and smart
swap algorithms
Done Swapping MVT cells to LVT?
Swap Remaining MVT cells to HVT
cells
Determine which HVT cells to
change to MVT cells using the 2
algorithms
Done Swapping HVT cells to MVT?
Run Static Timing Analysis
Run Optimization engine on design
Swap selected MVT cells to LVT
cells
Run Static Timing Analysis
Run Optimization Engine on design
Swap selected HVT cells to MVT
cells
Finish
Y
Y
N
N
ISPD 20078 March 19, 2007
Optimization flow – Multi corner + design variant
Lib Lib LibLib
Mobileconstraints
Desktopconstraints
Corner 1 Corner 2 Corner 3 Corner 4Un-optimized
Design
Optimized for corner 1
Optimized for corner 2
Optimized for corner 3
Optimized for corner 4
Optimized Mobile design
Optimized Desktop design
ISPD 20079 March 19, 2007
Multi VTH scalable – 3 VTH example
Un-optimizedMVT Design
Un-optimizedHVT Design
Un-optimizedMVT Design
+ FinalDesign
Step 1: Meet timing constraints : down-swap
Fix critical paths by changing to LVT
MVT
LVT
Step 2: Reduce leakage power : up-swap
HVT
LVT
LVT
HVT MVT
LVTHVT
Extract HVT
ISPD 200710 March 19, 2007
Heuristic VTH Selection Algorithm
ISPD 200711 March 19, 2007
Heuristic Algorithm
• Sensitivity analysis based heuristic approach
• Picks instances that have the most impact on performance with reasonable leakage costs
• Instances picked affect multiple paths
• Circuit topology aware
• Works best for the first few optimization iterations
• Flexibility to chose an instance selection window size to fine-grain the optimization
ISPD 200712 March 19, 2007
Heuristic algorithm – Pros and Cons
Pros
• Extremely fast
• Efficiently selects instances that affect multiple critical paths.
• Changing only these instances to low VTH cells helps meet aggressive timing constraints at very low power leakage costs.
• Parametrizable instance selection windows
• Topology aware algorithm
ISPD 200713 March 19, 2007
Cons
• Effective only in the first few set of iterations.
• Does not work best when fine-grain optimization is required
• No timing update or analysis done to improve results within a single round of iteration.
• Each iteration picks a window of instances for VTH
selection. Timing information is not updated with every swap with the same selection group.
ISPD 200714 March 19, 2007
1.list all launching flops2.foreach flop f3. do depth first recursive forward traversal 4. calculate time benefit if swapped from libraries5. determine total VTH layout width (cost)6. calculate benefit/cost score7. for each immediate o/p pin8. prorate each score 9. criticality with other relatively critical pins 10. register capture flop 11. [recursively get downstream scores]12. add downstream scores to current inst score13.for each flop from list of capture flops14. do depth first recursive reverse traversal15. calculate time benefit if swapped from libraries16. determine total VTH layout width (cost)17. calculate benefit/cost score18. for each immediate i/p pin19. prorate each score based on i/p pin 20. criticality with other relatively critical pins21. [recursively get upstream scores]22. add upstream scores to current inst score23.list all instances in decreasing final scores24. pick top x% of instances and swap them to lower VTH 25.update database and perform STA 26.repeat
PseudoCode for heuristic algorithm
ISPD 200715 March 19, 2007
Definition of Instance score
m
( ) ( ) -
Score =( ) ( ) 2
a bp p
p Vt p Vta ba b
p pp Vt p Vt
Width m Width mdelay delay
Width m Width m
a : Original Cell
b : Potential Cell selection
m : Instance under consideration
p : Each transistor within cell ‘a’ or cell ‘b’
ISPD 200716 March 19, 2007
inst inst inst instScore = prorate((benefit/cost) ) + downScore + upScore
Updated topological instance score
Individual score fromSensitivity analysis
Scores of Instances
downstream
Scores of Instancesupstream
ISPD 200717 March 19, 2007
Computing DownCone scores
m
o n
0n=FO(m) o=FI(Gate(n)) p=Vt
downScore =
x C ( / ))
s.t. slk - slk <
n p n odownScore Width (m) slk slk
q
m: instance being considered for selectionn: Fanout gate of m
m
n
ISPD 200718 March 19, 2007
Computing UpCone scores
0n=Gate(FI(m)) o=FI(n) p=Vt
( x C ( / ))n p FI (m) oupScore Width (m) slk slk
FI(m) n
upScore =
s.t. slk - slk < qm: instance being considered for selectionn: Fanin gate of m
m
n
ISPD 200719 March 19, 2007
Upscore proration
0n=Gate(FI(m)) o=FI(n) p=Vt
( x C ( / ))n p FI (m) oupScore Width (m) slk slk
FI(m) n
upScore =
s.t. slk - slk < q
0n=Gate(FI(m)) p=Vt
= x (C )m n pupScore upScore Width (m)
With Proration
Without Proration
ISPD 200720 March 19, 2007
Downscore proration
m
o n
0n=FO(m) o=FI(Gate(n)) p=Vt
downScore =
x C ( / ))
s.t. slk - slk <
n p n odownScore Width (m) slk slk
q
0n=FO(m) p=Vt
= x (C ) m n pdownScore downScore Width (m)
With Proration
Without Proration
ISPD 200721 March 19, 2007
Advantage of proration
0
0.2
0.4
0.6
0.8
1
1.2
-10 -5 0Timing slack considered for
optimization (ps)
No
rmal
ized
Lea
kag
e p
ow
er
WithoutproratedconesWith proratedcones
Leakage power Normalized with respect to non-prorated cones
ISPD 200722 March 19, 2007
Dynamic Path Traversing
VTH Swap algorithm
ISPD 200723 March 19, 2007
Dynamic Path Traversing
• Regular Forward traversal algorithm
• Breadth-first search from flop to flop
• Works with a power and timing budget to do VTH selection
• Only forward traversal, though backward traversal could be implemented
• Stops optimizing when either power or timing budget is exhausted
• Budgets scaled for every path based on a linear formulation of combinational logic depth and effective fanout
•Works best for the last few iterations where fine-grain optimization is required
ISPD 200724 March 19, 2007
Pros and Cons
Pros
• Simple implementation
• Constantly works with a power and timing budget
• After every VTH selection, the budgets are updated
• Timing between swaps is more up-to-date as compared to the Heuristic algorithm
• Timing paths can be differentiated based on combinational depth and fanout
ISPD 200725 March 19, 2007
Cons
• Not as fast as the Heuristic algorithm
• Complementary to the Heuristic algorithm
• Works best for fine-grain selection. Not good at selecting the most ‘influential’ instances.
• Since it is traverses forward and is budget limited, it ends up selecting instances closer to the launching flop
• No circuit topology information
ISPD 200726 March 19, 2007
Psuedo Code for Dynamic algorithm
1.list all launching flops2.decide worst slack to consider (eg.wslk = -40ps)3.foreach launching flop f4. Start with worst slack at o/p pin (path slack)5. Start with an approximate swap cost budget6. do breadth first recursive forward traversal7. for each instance failing timing8. calculate time benefit if swapped from libraries9. determine leakage delta (cost)10. swap this instance to its lower VTH version11. New Timing budget = Slack of path – time benefit of inst12. New power budget =Budget – delta power of this inst13. Update design database for new VTH cells14. exit loop if timing met (wslk)15. exit loop if path is unconstrained 16. exit if receiving flop reached17. exit loop if budget exhausted18.19.update design database 20.perform STA and repeat with new wslk
ISPD 200727 March 19, 2007
Flow iteration (scalable)
Swap from MVT to LVT (11 iterations)
H-2, H-4, H-8, D-60, H-15, D-40, H-20, D-20, H-8, D-10, D-0
Swap from HVT to MVT with LVT swaps included (11 iterations)
H-2, H-4, H-8, D-60, H-15, D-40, H-20, D-20, H-8, D-10, D-0
Swap from VHVT to HVT with LVT and MVT swaps included (11 iterations)
H-2, H-4, H-8, D-60, H-15, D-40, H-20, D-20, H-8, D-10, D-0
H-4 => Heuristic flow with 4% instance window
D-40 => Dynamic algorithm with worst slack of -40 ps
ISPD 200728 March 19, 2007
Slack Distribution after optimization
ISPD 200729 March 19, 2007
Experiments
Ex 1: Initial unoptimized design not meeting timing constraints
Ex 2: Quick implementation of backward followed by forward (Front-based technique [12] *)
Ex 3: 6 step iteration using only the Dynamic swapper algorithm
Ex 4: 6 step iteration using only the Heuristic swapper algorithm
Ex 5: 6 step iteration using alternating combinations of the Dynamic and Heuristic swapper algorithms
*[12] Srivastava, “Minimizing total power by simultaneous Vdd/VTH assignment, IEEE Transactions on Computer Aided Design; 2004
ISPD 200730 March 19, 2007
Results
Ex 1 Ex 2 Ex 3 Ex 4 Ex 5
HVT (%) 8.5 22.9 31.2 39.9 47.1
MVT (%) 90.4 37.1 52.3 45.7 40.2
LVT (%) 0.3 39.2 15.7 13.6 12
Total Leakage Power (W) 2.278 6.560 3.554 3.122 2.834
ISPD 200731 March 19, 2007
Conclusions
• Described here is a post-route optimization flow for VTH selection that supports multiple corners
• This iterative flow uses 2 complementary instance selection techniques : Heuristic and a budget based forward traversal algorithm
• The flow is not limited to 2-3 VTH levels but is scalable for any number of levels
• The Heuristic algorithm is a unique non-solver based topologically aware heuristic that optimizes over multiple paths simultaneously by including the effects of the upstream and downtream logic cones
• Can handle huge full chip microprocessor designs with more than 5 million stdcell gates
• No extensive probabilistic stdcell characterization is required.
• Process corners can simulate inter-chip variations that are not currently handled by statistical methods.
• Multiple process corner optimizations occur in parallel and optimization results are shared between different servers in real-time. This reduces the number of iterations and improves the quality of the optimization.
• Solver based techniques failed to handle full chip industrial size designs. These designs were handled by this flow
ISPD 200732 March 19, 2007
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2006 Advanced Micro Devices, Inc. All rights reserved.
Thanks
ISPD 200733 March 19, 2007
Backup Slides
ISPD 200734 March 19, 2007
Solver based statistical tools
• Inaccurate sensitivity models based on delta VTH variation of transistor widths
• Difficulty in translating transistor model sensitivities of power based on variational parameters to huge libraries
• Lack of interchip variation and consideration of only intra-chip variations
• Virtual memory constraints for linear solvers on industrial size designs and modeling approximations involved in non-linear solvers
• No topological information taken into consideration in path based heuristic approaches
• Inappropriate consideration of logic fanouts
• In statistical methods, the optimization step is usually decoupled from the librray characterization step
ISPD 200735 March 19, 2007
Downstream Score
ISPD 200736 March 19, 2007
Upstream Score