directions in low-power cad

Directions in Low-Power CAD

Dennis Sylvester

University of [email protected]

http://vlsida.eecs.umich.edu

With acknowledgements to: Prof. David Blaauw, Dr. Sarvesh Kulkarni, Saumil Shah, Kavi Chopra

mailto:[email protected]

http://vlsida.eecs.umich.edu/

Topics A new dual-Vth assignment formulation Dual-Vdd power distribution Approaches to parametric yield optimization: statistical

leakage + delay

Motivation We require high-performance yet low-power circuits

Leakage power contributes significantly to total power

All High- Vth implementation too slow

All Low-Vth implementation too leaky

Dual- Vth processes popular

Problem Definition Minimize

Total Circuit Power Subject to

Circuit Delay Constraint Sizing Constraints

Optimization Variables Gate Sizes Gate Threshold Voltages

SwitchingSubthreshold

leakage

S. Narendra et al [ICCAD ’03]

Gate Sizing + Vth Assignment Problem Prior Work

Traditionally a discrete problem

Previous approaches Separate Sizing and Vth Assignment Mixed Integer Non-Linear Programming Sensitivity-based methods (DUET, etc) Continuous formulation [Chen, ASP-DAC ‘05]

Very reliant on discretization heuristic

Proposed Approach – Self-snapping formulation Continuous formulation – Use of large variety of

algorithms/powerful non-linear optimizers possible Solution has almost all gates assigned to one of the

two available threshold voltages

Small fraction of gates with intermediate Vth’s, can be handled heuristically

Discretization algorithm has negligible power impact and can be very simple

Proposed Approach – Mixed- Vth Gates Consider each gate to be a parallel combination of high and low

Vth gates

RC Delay ModelD R Ceff l

D=R Ceff l

ClR Rl h=

R W +R Wl h h l

C =C +K (W +W )l Load SL l h

HVt Gate

LVt Gate

Mixed Gate Linear Power Model

W +P Wl l h h=PLVt HVtP=P +P

HVt LVtl h

l h

/ W / Wl h Cl/ W / Wl h

R R=

R +R

Complete Dual- Vth Problem Formulation Similar to single-Vth gate sizing problem, with simple gate delays

replaced with High Vth/Low Vth parallel combinations Minimize

Subject to: , ,, ,W P Wh i h iPl i l ii G

0a Aj

a D aj i i ({1,..., } { })i n inputs

D ai i { }i inputs

{ ( )}j input i

,0 Wl i 1, ..., .i n

,0 Wh i 1, ..., .i n

, ,i UiW Wl i h iL 1, ..., .i n

Proof of Discretized Solution Conceptually separate optimization process into two

distinct phases: D-Phase : Fix delays of all gates W-Phase : Find the minimum-power sizing solution

that satisfies the chosen D vector

Hypothetical separation for proof – Not implemented in actual optimization procedure

W-Phase Proof of discrete optimal solution under arbitrary D-vector

sufficient W-Phase formulation Minimize

Subject to:

, , , ,P W P Wl i l i h i h ii G

( ( ( )) ( )), , , , ,( ), ,, , , ,

C W W K W Winp j l j h j SL l i h ij fanout i

i

R Rl i h iR W R Wl i h i h i l i D

0,Wl i

0,Wh i

1, ..., .i n

1, ..., .i n

1, ..., .i n

W-Phase

Linear programming problem n basic variables, n non-basic variables Therefore, only n non-zero variables

Every gate snapped to either high-Vth or low-Vth

Addition of upper and lower bounds on total size leads to some non-snapped gates

Number extremely small – simple heuristic achieves good results

Practical Constraint – Fixed-Width Input Drivers Sequential elements driving the combinational circuit Delay of these elements affected by primary input

widths Modeled as fixed-width drivers

Extension of Discretization Analysis m+n constraints in the optimization problem n+m basic variables, n-m non-basic variables Therefore, n+m positive variables Total number of non-snapped gates bounded by

number of inputs Once again, small in number; can be handled

heuristically In practice, number of non-snapped gates found to be

much less than the number of inputs

Discretization Heuristics Iterative snapping

Round gates to closer Vth and re-optimize until non-snapped solution achieved

Single-pass Vth assignment Fix all gates to closer Vth and re-optimize only for gate

sizes

Second heuristic faster with negligible power impact

Results Snapping properties of some circuits

# of non-snapped gates is very small Dominated by gates at upper and lower size bounds Approach is easily extendable to multi-Vth AND multi-Lgate

c2670 c3540 c5315 c6288 c7552 i8 i9 i10 --0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

0.0018

0.0020

% o

f tot

al n

on-s

napp

ed g

ates

due

to in

put d

river

s

% o

f tot

al n

on-s

napp

ed g

ates

Circuit

Results Power and runtime comparisons between proposed

approach and sensitivity-based algorithm at 2% timing backoff (results shown for larger circuits only)

Average: 31% leakage reduction vs. previous approaches

CktSBA Continuous

Formulation%

Improvement Runtime(s)

Static Dyn. Static Dyn. Total Static Total SBA Cont

C3540 0.26 0.74 0.16 0.78 0.94 38.14 6.46 28 51

C5315 0.22 0.78 0.15 0.80 0.95 30.53 5.11 52 133

C6288 0.35 0.65 0.26 0.65 0.91 24.69 9.04 136 443

C7552 0.31 0.69 0.24 0.68 0.91 23.93 8.87 94 171

i8 0.24 0.76 0.19 0.75 0.94 21.57 5.87 24 35

i9 0.20 0.80 0.16 0.77 0.94 17.65 6.47 9 21

i10 0.31 0.69 0.23 0.69 0.92 24.8 7.69 287 373

Topics A new dual-Vth assignment formulation

Dual-Vdd power distribution Approaches to parametric yield optimization: statistical

leakage + delay

Multiple supply design

Relies on applying a lower supply (VDDL) to gates along non-critical paths thus reducing power while meeting timing

A flexible fine-grained VDD assignment scheme promises best power reduction Gate-level Extended Clustered Voltage Scaling

However, physical design and power delivery are complicated

VDDH

IN

VDDL

VDDL Swing

Need for Level Conversion

DC Current

FF

FFFF

FF

FF

Implications of using multiple supplies

CoupledissuesAlgorithms

VDD assignment

CircuitsLevel shifting

Physical designVDD Granularity

Power deliveryDistributionGeneration

Critical Non-critical

CVS ECVS

Fine-grained

Islanding

IN

OUT

Power delivery for dual-VDD circuits

Power grid integrity vital for circuit performance Dual-VDD circuits require two supply voltages for operation Fine-grained dual-VDD can place VDDL/VDDH gates arbitrarily on the die Implications at the board, package and die level

Fixed resources need to be split between VDDL and VDDH

However, load on each supply is lower than on original single supply:

Power supply current demanded by a dual-VDD circuit is significantly lower than the corresponding single-VDD circuit, allowing robust power delivery within available resources (decap, C4, wiring)

Reduced current load on VDDL/VDDH Gate level comparison

Avg. 54% (33%) for VDDL = 0.8V (0.6V)

Circuit level comparison Avg. 49% (51%) and 28% (14%) for VDDH and VDDL for 0.8V (0.6V)

Single VDDVDD VDDH VDDL VDDH VDDL

c880 9.7 5.6 2.2 5.9 1.3c2670 23.6 11.9 6.5 10.1 3.0c5315 36.7 20.9 7.2 20.9 3.6c7552 47.9 13.9 19.4 20.4 8.5

AVERAGE % 100.0 48.5 27.7 50.7 13.5

Dual VDD: VDDL=0.8V Dual VDD: VDDL=0.6V

Low-VTH High-VTH Low-VTH High-VTH Low-VTH High-VTHINVX10 1.00 0.90 0.57 0.49 0.36 0.27

NAND2X2 1.00 0.85 0.54 0.45 0.34 0.23NAND3X6 1.00 0.88 0.55 0.47 0.35 0.24NOR2X1 1.00 0.86 0.52 0.39 0.30 0.19NOR3X4 1.00 0.85 0.50 0.37 0.29 0.18

AVERAGE 1.00 0.88 0.54 0.44 0.33 0.23

Single-VDD Dual-VDD: VDDL=0.8V Dual-VDD: VDDL=0.6V

VDD

ECVS

Package level results Two VRMs on board to supply VDDL and VDDH Ground path can be shared by VDDL and VDDH Decoupling capacitance divided in the ratio of current loads

Similar power supply noise with same resources as single-VDD case (decoupling capacitance, C4s)

VDDH

Lmb1 Rmb1 Lmb2 Rmb2 Lskt Rskt LpkgHRpkgH

RblkH

LblkH

CblkH

RhfH

LhfH

ChfH

Rpkg_capH

Lpkg_capH

Cpkg_capH

RdieH

CdieH

VDDHLoad

1

2

RblkL

LblkL

CblkL

RhfL

LhfL

ChfL

Rpkg_capL

Lpkg_capL

Cpkg_capL

RdieL

CdieL

VDDLLoad

3

VDDL

+

+

-

-

Lmb1 Rmb1 Lmb2Rmb2Lskt Rskt LpkgLRpkgL

I(VDDH)

I(VDDL)

Intel, “Intel Pentium 4 processor in the 432 pin/Intel 850 Chipset Platform,” 2002.

Dual-VDD physical design alternatives

Segregated placement constrains placer leading to higher core area and wirelength

Single-VDD Dual-VDD

Dual-VDD segregated Dual-VDD segregated

VDDH + VDDL row

VDDH + VDDL row

VDDH + VDDL row

VDDH + VDDL row

VDDH + VDDL row

VDDH + VDDL row

VDDH + VDDL row

VDDH VDDL GND

Dual-VDD fine-grained

C. Yeh, et al., “Layout techniques supporting the use of dual supply voltages for cell-based designs,” Proc. DAC, 1999.

M. Igarashi, et al., “A low-power design method using multiple supply voltages,” Proc. ISLPED, 1997.

Dual-VDD power grid alternatives Routing the power supply rails

Dual-VDD Dual-GND requires two separate grounds off-chip and complicates timing analysis and design of the board itself

Multi-rail standard cells can be used to realize the Dual-VDD grids allows placer to operate with no constraints

VDDGND

VDDHGNDH

VDDLGNDL

VDDHVDDL

GND(shared)

Single-VDD Dual-VDD Shared-GND Dual-VDD Dual-GND

VDDHGNDH

VDDLGNDL

4-rail cell3-rail cell

VDDHVDDL

GND(shared)

Dual-VDD standard cells topologies

Dual-VDD on-chip power grid design

Guidelines while designing the dual-VDD grid: Scale wires with respect to the single-VDD considering how the

current demand has scaled VDDL gates more sensitive to grid noise important since

ground is shared 120mV noise is 10% for a 1.2V gate, but 20% for a 0.6V gate

Placement of VDDL and VDDH gates assign more wiring resources to VDDL grid in areas where there is more demand for VDDL current

Consider effects that arise from the board and package level such as shared C4s

Fewer C4s leads to higher effective package R, L

Proposed technique D-Place Let = I(VDDH)/I(VDD) and = I(VDDL)/I(VDD) Scale wires as follows

WVDDLVDDHW

WVDDLVDDHW

WW

GND

VDDL

VDDH

VDDH VDDL GND

GlobalRegional

Local

global

local

regional

local

global

localglobal

regional

localregionallocal

effective

AreaArea

AreaArea

AreaArea

AreaArea

1

Partition the chip floorplan

Obtain eff. and as follows

Obtain currentconsumption ofSingle/Dual VDD designs (SPICE)

Calculate local,regional, global& effective & for each wiresegment

Size each wiresegment in eachlocal area usingeffective , β &simulate grid

Measure voltagedroop/bounce

Measure wirecongestion

Break down dieinto “local” &“regional” areas

Placementdatabase(Cadence)

Original Single VDD design

SingleVDD Lib file

Obtain DualVDD design

DualVDD Lib file

Peak voltage drop comparisons

D-Place grids better than single-VDD grids in AVG cases Inferior by < 2.6% (≈15mV) in some MAX cases 0.6V VDDL as robust as 0.8V 0.6V also provides higher power savings Proposed approach better by 2-7% (AVG) and 7-12% (MAX) compared to

prior approaches

Single VDD DVDG D-Vanilla D-PlaceMAX 16.9% 30.9% 16.4% 18.6%AVG 9.5% 14.7% 9.6% 9.5%MAX 25.6% 35.5% 32.2% 25.5%AVG 15.9% 19.8% 15.2% 14.5%MAX 29.6% 38.2% 37.4% 32.0%AVG 21.6% 23.4% 20.2% 19.8%MAX 26.8% 34.2% 34.5% 29.4%AVG 22.2% 21.0% 21.1% 18.7%

c880

c2670

c5315

c7552

Single VDD DVDG D-Vanilla D-PlaceMAX 16.9% 30.3% 16.3% 19.5%AVG 9.5% 15.9% 9.7% 9.8%MAX 25.6% 36.1% 27.6% 27.0%AVG 15.9% 22.1% 15.8% 15.3%MAX 29.6% 38.1% 33.0% 31.8%AVG 21.6% 25.4% 20.1% 20.3%MAX 26.8% 31.4% 31.6% 28.7%AVG 22.2% 24.9% 22.3% 20.1%

c880

c2670

c5315

c7552

VDDL = 0.6V VDDL = 0.8V

Voltage variation across die Voltage drop contours

Wiring congestion similar for dual-Vdd vs. single Vdd grids Lower current demands can lead to smaller amounts of decoupling cap;

lower leakage (or use same decap for better performance)

Dual-VDD grid no less robust than single-VDD grid

Topics A new dual-Vth assignment formulation Dual-Vdd power distribution

Approaches to parametric yield optimization: statistical leakage + delay

IntroductionOptical Proximity Effects Variation Chemical Mechanical Polishing Variations

Leff

V th

P

Process Parameter-space

Good TimingHigh Leakage

Power Yield Loss

Low Leakage PoorTiming

Timing Yield Loss

Power

Delay

P

Chip Performance-spaceThis Work: Optimize the timing and power yield using gate sizing

Pconst Tconst

Problem Description Nonlinear Continuous Optimization

Objective: Maximize Timing and Power YieldYield: A utility function defined w.r.t the JPDF of leakage and timing

Decision Variables: Gate Size

Efficient implementation requires Computing yield as function of decision variables - gate size Fast and Accurate Gradient computation

Power and Timing Yield Analysis (see DAC05 for more detail)

Timing Analysis [Sapatnekar03, Chandu05]

(d, d)

Delay

Log(Leakage)

Power Analysis (l, l)

Correlation (1 parameter)

Delay and Power Bivariate JPDF (d, d, l, l, )

RdXddDelay n

n

iii 1

10

n

iiild

1

RlXllLeakage n

n

iii 1

10log

d

l

Traditional Incremental TimingSize Up 7

Cut Set SSTA: Intuition

1

3

4

5

6

7

9

8

10

2

Consider Timing Graph

Unperturbed Sub Graph

Unperturbed Left Sub Graph

Unperturbed Right Sub Graph

Arrival Time (AT) Required Arrival Time (RT)

Max Cut Edge Time (CT)

Cut Edge Time(CT)

If Forward SSTA Reverse SSTA then Cut Set SSTA will give exact same sensitivities as naïve approach that recomputes yield relating to all nodes, most being unchanged

Statistical Yield Optimization Results

D < Dμ,initial , P < Pμ,initial

Circuit Yield without L (%) Yield with L (%)

c432 45.4 80.2

c499 39.2 59.0

c880 49.3 83.2

c1908 47.9 82.8

c2670 51.1 85.3

c3540 51.2 87.1

c5315 50.0 87.3

c6288 50.3 86.5

c7552 51.2 80.8

Initial yield ~0-2% due to inverse correlation

Gate sizing alone provides good improvements

Combined with Lgate biasing, provides outstanding results

Chopra, et al., ICCAD05

Another approach to statistical optimization General statistical optimization

Method relies on efficient deterministic formulations and variation space sampling to drive statistical optimization

Applicable to many mainstream VLSI design problems: gate sizing, Vth assignment, Leff biasing as well as potential new levers

Statistically Optimized Body Bias Clusteringfor Post-Silicon Tuning

Concept:Speed up critical gates using FBBand slow down non-critical gatesusing RBB to meet timing andpower constraints

Traditional view:Centralized body bias generatorcontrolling different die regions Ineffective for compensating

intra-die variations Highly suboptimal power

Critical Non-critical

BBcontroller

FsbFthth VVV 220

Coarse Body Bias Assignment

Simplified assignment minimizing routing overheads Biasing dictated by placement instead of gate criticality Disregards complex dependence of gate criticality on:

Circuit topology Correlations in process variations

Effective in tightening delay but leads to high power

ONE BIAS FOR ALL GATES

DELAY POWER

Critical

Correlated

Important to cluster gates to leverage ABB effectively

Proposed New Optimization Framework

4

5

1

2

37

6

Leff_4.1 Scenario ‘1’

Leff_5.1

Leff_1.1

Leff_3.1Leff_2.1

Leff_6.1

Leff_7.1

Generate sample scenarios

4

5

1

2

37

6

Leff_4.2 Scenario ‘2’

Leff_5.2

Leff_1.2

Leff_3.2Leff_2.2

Leff_6.2

Leff_7.24

5

1

2

37

6

Leff_4.x Scenario ‘x’

Leff_5.x

Leff_1.x

Leff_3.xLeff_2.x

Leff_6.x

Leff_7.x

Generate PDFs of optimal actions

Clustering

BB-PDF ρi,jGate

DETERMINISTICALLY optimize each scenario (i.e., tune each gate for each die scenario)

Solve BB assignmentfor each scenario

Post-silicon tuning

Results vs. Traditional Dual-Vth Leakage power

Dual-Vth vs. 2-4 ABB clusters Avg. 28-38% (51-59%) lower μ (95th)

Area Capo generates contiguous regions of similarly clustered cells while

minimally displacing cells 5-8% increase in wirelength and area

Delay 3-9X tighter σ

A few conclusions Parametric yield is a critical design objective going

forward Requires accurate estimation and fast optimization

approaches to this key metric Envision all tools in 4-6 years being yield-driven, rather

than timing or power alone

Lots of room for improvement in many ‘well-studied’ CAD problems today Recent examples; dual-Vth+sizing, placement (Cong, et

al)

directions in low-power cad

Documents

closer vth

vth assignmentfix

vth gatesconsider

singlevth gate sizing

slowall lowvth implementation

n nonbasic variablestherefore

nonsnapped gatesnumber

minimumpower sizing