architectural and circuit-levels design techniques for power and temperature optimizations in on-...

Architectural and Circuit-Levels Architectural and Circuit-Levels Design Techniques for Power Design Techniques for Power

and Temperature Optimizations and Temperature Optimizations in On-Chip SRAM Memoriesin On-Chip SRAM Memories

Houman Homayoun

PhD Candidate

Dept. of Computer Science, UC Irvine

April 2010 – Houman Homayoun University of California Irvine 2

Outline

Past Research Low Power Design

Power Management in Cache Peripheral Circuits (CASES-2008, ICCD-2008,ICCD-2007, TVLSI, CF-2010)

Clock Tree Leakage Power Management (ISQED-2010)

Thermal-Aware Design Thermal Management in Register File (HiPEAC-2010)

Reliability-Aware Design Process Variation Aware Cache Architecture for Aggressive Voltage-

Frequency Scaling (DATE-2009, CASES-2009)

Performance Evaluation and Improvement Adaptive Resource Resizing for Improving Performance in Embedded

Processor (DAC-2008, LCTES-2008)

RELOCATERegister File Local Access Pattern

Redistribution Mechanism for Power and Thermal Management in Out-of-

Order Embedded Processor

Houman Homayoun, Aseem Gupta, Alexander V. Veidenbaum

Avesta Sasan, Fadi J. Kurdahi, Nikil Dutt


Outline

Motivation Background study Study of Register file Underutilization Study of Register file default access patterns Access concentration and activity redistribution

to relocate register file access patterns Results


Why Temperature?

Higher power densities (Watt per mm2) lead to higher operating temperatures, which(i) Increase the probability of timing violations

(ii) Reduce IC lifetime

(iii) Lower operating frequency

(iv) Increase leakage power

(v) Require expensive cooling mechanisms

(vi) Overall increase in design effort and cost


Why Register File? RF is one of the hottest units in a processor

A small, heavily multi-ported SRAM Accessed very frequently

Example: IBM PowerPC 750FX, AMD Athlon 64

AMD Athlon 64 core floorplan blocksThermal Image of AMD Athlon 64 core floorplan blocks using infrared cameras, Courtesy of Renau et al. ISCA 2007


Prior Work: Activity Migration

Reduces temperature by migrating the activity to a replicated unit.

requires a replicated unit large area overhead

leads to a large performance degradation

Tem

pera

ture

T final

T ambient

Active Period

Idle Period

T init

T crisis

time

AM AM+PG


Conventional Register Renaming

Free List

Active List

Tail pointer

Head pointer Instruction # Original code Renamed code

1 RA <- ... PR1 <- ...

2 …. <- RA .... <- PR1

3 branch to _L branch to _L

4 RA <- ... PR4 <- ...

5 ... ...

... ...

6 _ L:

_ L:

7 …. <- RA .... <- PR1

Register Renamer Register allocation-release

• Physical registers are allocated/released in a somewhat random order


Analysis of Register File Operation: Register File Occupancy

(a)

0%10%20%30%40%50%60%70%80%90%

100%

RF_ocuupancy < 16 16 < RF_ocuupancy < 32

32 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64

(b)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

RF_ocuupancy < 16 16 < RF_ocuupancy < 3232 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64

MiBench SPECint2K

(a)

0%

5%

10%

15%

20%

25%

30%

35%

% p

erfo

rman

ce d

egra

dat

ion

48-entry 32-entry 16-entry

(b)

0%

10%

20%

30%

40%

50%

60%

% p

erfo

rman

ce d

egra

dat

ion

48-entry 32-entry 16-entry

Performance Degradation with a Smaller Register File


Analysis of Register File Operation

Register File Access Distribution Coefficient of variation (CV) shows a “deviation”

from average # of accesses for individual physical registers.

nai is the number of accesses to a physical register i during a specific period (10K cycles). na is the average

N, the total number of physical registers

na

nanaN

CV

n

ii

access

2

1

)(1


Coefficient of Variation

(a)

0%

2%

4%

6%

8%

10%

12%

% c

oef

fici

ent

of

vari

atio

n

(b)

0%

2%

4%

6%

8%

10%

12%

14%

% c

oef

fici

ent

of

vari

atio

n

MiBench SPEC2K


Register File Operation

Underutilization which is distributed uniformly

while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution


RELOCATE: Access Redistribution within a Register File

The goal is to “concentrate” accesses within a partition

of a RF (region) Some regions will be idle (for 10K cycles)

Can power-gate them and allow to cool down

register activity (a) baseline, (b) in-order (c) distant patterns


An Architectural Mechanism to Support Access Redistribution

Active partition: a register renamer partition currently used in register

renaming

Idle partition: a register renamer partition which does not participate in

renaming

Active region: a region of the register file corresponding to a register

renamer partition (whether active or idle) which has live registers

Idle region: a region of the register file corresponding to a register

renamer partition (whether active or idle) which has no live registers


Activity Migration without Replication

An access concentration mechanism allocates registers from only one partition

This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over

another partition (according to some algorithm) is then activated (referred to as additional active partitions or AAP )

To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.


The Access Concentration Mechanism

Partition activation order is 1-3-2-4

Free List

Active List

Free List

Active List

Free List

Active List

Free List

Active List

Partition P1

Free-list 1 full Free-list 3 full Free-list 2 full

Active List 4 emptyActive List 2 emptyActive List 3 empty

Partition P2

Partition P4

Partition P3

Free-list 4 full

Active List 1 empty

P1

P2

P3

P4


The Redistribution Mechanism

The default active partition is changed once every N cycles to redistribute the activity within the register file (according to some algorithm)

Once a new default partition (NDP) is selected, all active partitions (DAP+AAP) become idle.

The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up)

A physical register in an idle partition may be live

An idle RF region is power gated when its active list becomes empty.


Performance Impact? There is a two-cycle delay to wakeup a power gated

physical register region

The register renaming occurs in the front end of the

microprocessor pipeline whereas the register access occurs in the back end.

There is a delay of at least two pipeline stages between renaming and accessing a physical register file

Can wake up the requested region in time

Can wake up a required register file region without incurring a performance penalty

at the time of access


Experimental setupTable 1. Processor Architecture

L1 I-cache 8KB, ,4 way, 2 cycles

L1 D-cache 8KB, 4 way, 2 cycles

L2-cache 128KB, 15 cycles

Fetch, dispatch 2 wide

Register file 64 entry

Memory 50 cycles

Instruction fetch queue

2

Load/store queue 16 entry

Arithmetic units 2 integer

Complex unit 2 INT

Pipeline 12 stages

Processor speed 800 MHz

Issue Out-of-order

Table 2. RF Design specification

Process 45nm-CMOS

9 metal layers

Register

file layout area

0.009mm2

Operating Modes Active:R/W

Sleep: no data retention

Operating Voltage 0.6V~1.1V

Read Access Cycle

200MHz

to 1.1GHz

Access time typical corner (0.9V, 45 )

0.32ns

Active Power (Total) in typical corner (0.9V, 45 )

66mW

@ 800MHz

Active Leakage Power typical corner (0.9V, 45 )

15mW

Sleep Leakage Power in typical corner (0.9V, 45 )

2mW Wakeup Delay 0.42ns

Wakeup Energy per register file row (64bits)

0.42nJ

MASE (SimpleScalar 4.0)

Model MIPS-74K processor, 800 MHz MiBench and SPECint2K benchmarks compiled with Compaq compiler, -O4 flag Industrial memory compiler used

64-entry, 64bit single-ended SRAM memory in TSMC 45nm technology HotSpot to estimate thermal profiles


Results-Power Reduction(a)

0%5%

10%15%20%25%30%35%40%45%50%55%

Po

we

r R

ed

uc

tio

n %

num_partition=2 num_partition=4 num_partition=8

Mibench RF power reduction

(b)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Po

we

r R

ed

uc

tio

n %

num_partition=2 num_partition=4 num_partition=8

SPEC2K RF power reduction


Analysis of Power Reduction

Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition

Indicates that wakeup overhead is amortized for a larger number of partitions.

Some exceptions the overall power overhead associated with waking up

an idle region becomes larger as the number of partition increases.

frequent but ineffective power gating and its overhead as the number of partition increases


Peak Temperature ReductionTable 1. Peak temperature reduction for MiBench benchmarks

temperature reduction for different number of partition (C )

base

temperature

(C ) 2P 4P 8P

basicMath 94.3 3.6 4.8 5.0

bc 95.4 3.8 4.4 5.2

crc 92.8 5.3 6.0 6.0

dijkstra 98.4 6.3 6.8 6.4

djpeg 96.3 2.8 3.5 2.4

fft 94.5 6.8 7.4 7.6

gs 89.8 6.5 7.4 9.7

gsm 92.3 5.8 6.7 6.9

lame 90.6 6.2 8.5 11.3

mad 93.3 3.8 4.3 2.2

patricia 79.2 11.0 12.4 13.2

qsort 88.3 10.1 11.6 11.9

search 93.8 8.7 9.3 9.1

sha 90.1 5.1 5.4 4.5

susan_corners 92.7 4.7 5.3 5.1

susan_edges 91.9 3.7 5.8 6.3

tiff2bw 98.5 4.5 5.9 4.1

average 92.5 5.6 6.8 6.9

Table 2. Peak temperature reduction for SPEC2K integer benchmarks

temperature reduction for different number of partition (C )

base

temperature

(C ) 2P 4P 8P

bzip2 92.7 4.8 3.9 3.1

crafty 83.6 9.5 11 10.4

eon 77.3 10.6 12.4 12.5

galgel 89.4 6.9 7.2 5.8

gap 86.7 4.8 5.9 7.1

gcc 79.8 7.9 9.4 10.1

gzip 95.4 3.2 3.8 3.9

mcf 85.8 6.9 8.7 9.4

parser 97.8 4.3 5.8 4.8

perlbmk 85.8 10.6 12.3 12.6

twolf 86.2 8.8 10.2 10.5

vortex 81.7 11.3 12.5 12.9

vpr 94.6 4.9 5.2 4.4

average 87.4 7.2 8.3 8.2


Analysis of Temperature Reduction

Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition

While capturing more idle partitions and power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature


Conclusions

Showed Register File Underutilization

Studied Register file default access patterns

Propose access concentration and activity redistribution to relocate register file accesses

Results show a noticeable power and temperature reduction in the RF

RELOCATE technique can be applied when units are underutilized

as opposed to activity migration, which requires replication


Current and Future Work Extension

Formulate the Best partition selection out of available partitions for activity redistribution.

Apply activity concentration and redistribution mechanism to other hot units; example: L1 cache.

Apply Proactive NBTI Recovery to the idle partitions to improve lifetime reliability.

Trade-off NBTI recovery and power gating to simultaneously reduce power and improve lifetime reliability.

Tackle the temperature barrier in 3D stack processor design using similar activity concentration and redistribution.

Multiple Sleep Modes Leakage Control for Cache

Peripherals

Houman Homayoun, Avesta Sasan, Alexander V. Veidenbaum


On-chip Caches and Power On-chip caches in high-performance

processors are large more than 60% of chip budget

Dissipate significant portion of power via leakage

Much of it was in the SRAM cells Many architectural techniques proposed to

remedy this

Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)

In part because cell design has been optimized

Pentium M processor die photoCourtesy of intel.com


Peripherals ?

Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder

addr0

addr1

addr2

addr3

Predecoder and Global Wordline Drivers

Decoder

addr

Global WordlineLocal Wordline

Bitline BitlineAddr Input Global Drivers

Sense amp

Global Output Drivers

1

10

100

1000

10000

100000

mem

ory c

ell

INVX

INV2X

INV3X

INV4X

INV5X

INV6X

INV8X

INV12

X

INV16

X

INV20

X

INV24

X

INV32

X

( pw )

200X

6300X

Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals.

Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals


Power Components of L2 Cache

SRAM peripheral circuits dissipate more than 90% of the total leakage power

global address input drivers

11%

global data input drivers

14%

global row predecoder

1%

local row decoders

33%

others8%

local data output drivers

8%

global data output drivers

25%

0%10%20%30%40%50%60%70%80%90%

100%

amm

pap

plu

apsi art

bzi

p2

craf

tyeo

neq

uak

efa

cere

cg

alg

elg

ap gcc

gzi

plu

cas

mcf

mes

am

gri

dp

arse

rp

erlb

mk

sixt

rack

swim

two

lfvo

rtex vp

rw

up

wis

eav

erag

e

Leakage Dynamic

L2 cache leakage power dominates its dynamic power above 87% of the total


Techniques Address Leakage in SRAM Cell

Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper

Target SRAM memory cell

Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first

Drowsy Cache Keeps cache lines in low-power state, w/ data retention

Cache Decay Evict lines not used for a while, then power them down

Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.

Circuit

Architecture


Sleep Transistor Stacking Effect

Subthreshold current: inverse exponential function of threshold voltage

Stacking transistor N with slpN: The source to body voltage (VM ) of

transistor N increases, reduces its

subthreshold leakage current, when

both transistors are off

)2)2((0 FSBFTT VVV

slpN

vss

N

MV

vdd

gnV

gslpnV vss

LC

CV

Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability


Impact on Rise Time and Fall Time

The rise time and fall time of the output of an inverter is proportional to the

Rpeq * CL and Rneq * CL

Inserting the sleep transistors increases both Rneq and Rpeq

P1 P2

N1 N2

vddvdd

vssvss

010

slpN1 slpN2

slpP1 slpP2

I leakage

I leakage

Increasing in rise time

Increasing in fall time

Impact on performance

Impact on memory functionality


A Zig-Zag Circuit

Rpeq for the first and third inverters

and Rneq for the second and fourth

inverters doesn’t change. Fall time of the circuit does not

change

To improve leakage reduction and area-efficiency of the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters

Zig-Zag Horizontal Sharing Zig-Zag Horizontal and Vertical Sharing


Zig-Zag Horizontal and Vertical Sharing

vdd vdd vddvdd vdd

vss vss vss vss vss

slpN

slpPSleep signal

Sleep signal

P 11 P12 P 13 P14 P 21 P22 P 23 P24

N11 N12 N13 N14N21 N22 N23 N24

Word-line Driver line K

Word-line Driver line K +1

MV

To improve leakage reduction and area-efficiency of the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters

Zig-Zag Horizontal Sharing Minimize impact on rise time Minimize area overhead

Zig-Zag Horizontal and Vertical Sharing Maximize leakage power saving Minimize the area overhead


ZZ-HVS Evaluation : Power Result

Increasing the number of wordline rows share sleep transistors increases the leakage reduction and reduces the area overhead

Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline shares the same sleep transistors

2~10X more leakage reduction, compare to the zig-zag scheme

1

10

100

1000

1 2 3 4 5 6 7 8 9 10

log

(n

W)

baseline redundant zigzag zz-hs zz-hvs

number of wordline row

x10

x100

x2

x12

(a)


Wakeup Latency

To benefit the most from the leakage savings of stacking sleep transistors

keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible)

Drawback: impact on the wakeup latency of wordline drivers

Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the

virtual ground voltage (VM)

reduction in the circuit wakeup

delay overhead

reduction in leakage power

savings


Wakeup Delay vs. Leakage Power Reduction

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

(Footer,Header) Gate Bias Voltage Pair

No

rma

lize

d L

ea

ka

ge

Po

we

r

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

No

rmal

ized

Wak

e-U

p D

elay

Normalized leakage Normalized wake-up delay

trade-off between the wakeup overhead

and leakage power saving

Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead


Multiple Sleep Modes

power mode wakeup delay (cycle)

leakage reduction (%)

basic-lp 1 42%

lp 2 75%

aggr-lp 3 81%

ultra-lp 4 90%

Power overhead of waking up peripheral circuits Almost equivalent to the switching power of sleep

transistors Sharing a set of sleep transistors horizontally and

vertically for multiple stages of a (wordline) driver makes the power overhead even smaller


Reducing Leakage in L1 Data Cache Maximize the leakage reduction in DL1 cache

put DL1 peripheral into ultra low power mode adds 4 cycles to the DL1 latency

significantly reduces performance

Minimize Performance Degradation put DL1 peripherals into the basic low power mode requires only one cycle to wakeup and hide this latency during address computation stage

thus not degrading performance Not noticeable leakage power reduction


Motivation for Dynamically Controlling Sleep Mode

large leakage reduction benefit Ultra and aggressive low power modes

low performance impact benefit Basic-lp mode

Periods of frequent access Basic-lp mode

Periods of infrequent access Ultra and aggressive low power modes

dynamically adjust peripheral circuit sleep power mode


Reducing DL1 Wakeup Delay

Can determine whether an instruction is load or a store at least one cycle prior cache access

Accessing DL1 while its peripherals are in basic-lp mode doesn’t require an extra cycle

wake up DL1 peripherals one cycle prior to access

One cycle of wakeup delay can be hidden for all other low-power modes

Reducing the wakeup delay by one cycle

Put DL1 in basic-lp mode by default


Architectural Motivations

Architectural Motivation A load miss in L1/L2 caches takes a long time to service

prevents dependent instructions from being issued When dependent instructions cannot issue

performance is lost At the same time, energy is lost as well!

This is an opportunity to save energy


Low-end Architecture

Given the miss service time of 30 cycles likely that processor stalls during the miss service period Occurrence of additional cache misses while one DL1

cache miss is already pending further increases the chance of pipeline stall

basic-lp lp u ltra -lp

aggr-lp

D L1 m iss

P rocessor sta ll

D L1 m iss++

P ending D L1 m iss

P ending D L1 m iss /es

D L1 m iss serviced

P rocessor continue


Low Power Modes in a 2KB DL1 Cache

Fraction of total execution time DL1 cache spends in each of the power mode

85% of the time DL1 peripherals put into low power modes Most of the time spent in the basic-lp mode (58% of total

execution time)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

basic

mat

h bccr

c

dijkst

ra

djpeg fft gs

gsmla

me

mad

patric

iapgp

qsort

rijndae

l

sear

ch sha

susa

n_corn

ers

susa

n_edges

tiff2

bw

aver

age

hp trivial-lp lp aggr-lp ultra-lp


Low Power Modes in Low-End Architecture

(a)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2KB 4KB 8KB 16KB

hp basic-lp lp aggr-lp ultra-lp

Increasing the cache size reduces DL1 cache miss rate Reduces opportunities to put the cache into more aggressive

low power modes Reduces performance degradation for larger DL1 cache

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

basic

mat

h bccr

c

dijkst

ra

djpeg fft gs

gsmla

me

mad

patric

iapgp

qsort

rijndae

l

sear

ch sha

susa

n_corn

ers

susa

n_edges

tiff2

bw

aver

age

2KB 4KB 8KB 16KB

Performance degradation Frequency of different low power mode


High-end Architecture

basic-lp lp ultra-lpDL1 miss

L2 miss

Pending DL1 miss/esDL1 miss

serviced

L2 miss serviced

L2 miss

DL1 transitions to ultra-lp mode right after an L2 miss occurs

Given a long L2 cache miss service time (80 cycles) the processor will stall waiting for memory

DL1 returns to the basic-lp mode once the L2 miss is serviced


Leakage Power Reduction

DL1 leakage is reduced by 50% While ultra-lp mode occurs much less frequently compared to basic-lp mode, its leakage

reduction is comparable to the basic-lp mode. in ultra-lp mode the peripheral leakage is reduced by 90%, almost twice that of basic-lp

mode.

0%

10%

20%

30%

40%

50%

60%

70%

80%

basic

mat

hbc

crc

dijkst

ra

djpeg fft gs

gsm lam

em

ad

patric

iapgp

qsort

rijndae

l

sear

chsh

a

susa

n_corn

ers

susa

n_edges

tiff2

bw

aver

age

trivial-lp lp aggr-lp ultra-lp

0%10%20%30%40%50%60%70%80%90%

amm

p

applu

apsi ar

tbzip

2

craf

tyeo

n

equak

e

face

rec

galgel

gap gccgzip

luca

sm

cf

mes

a

mgrid

parse

r

perlb

mk

swim

twolf

vorte

xvp

r

wupwise

aver

age

trivial-lp lp ultra-lp

The average leakage reduction is almost 50%


Conclusion

Highlighted the large leakage power dissipation in SRAM peripheral circuits.

Proposed zig-zag share to reduce leakage in SRAM peripheral circuits.

Extended zig-zag share with multiple sleep modes which trade-off the leakage power reduction vs wakeup delay overhead.

Applied multiple sleep modes technique in L1 cache of an embedded processor.

Presented Leakage power reduction.

architectural and circuit-levels design techniques for power and temperature optimizations in on-...

Documents

register file hipeac

entire physical register

physical register i

register filethe goal

access redistribution

activity redistribution

number of accesses

cyclescan powergate