architectural and circuit-levels design techniques for power and temperature optimizations in on-...
TRANSCRIPT
Architectural and Circuit-Levels Architectural and Circuit-Levels Design Techniques for Power Design Techniques for Power
and Temperature Optimizations and Temperature Optimizations in On-Chip SRAM Memoriesin On-Chip SRAM Memories
Houman Homayoun
PhD Candidate
Dept. of Computer Science, UC Irvine
April 2010 – Houman Homayoun University of California Irvine 2
Outline
Past Research Low Power Design
Power Management in Cache Peripheral Circuits (CASES-2008, ICCD-2008,ICCD-2007, TVLSI, CF-2010)
Clock Tree Leakage Power Management (ISQED-2010)
Thermal-Aware Design Thermal Management in Register File (HiPEAC-2010)
Reliability-Aware Design Process Variation Aware Cache Architecture for Aggressive Voltage-
Frequency Scaling (DATE-2009, CASES-2009)
Performance Evaluation and Improvement Adaptive Resource Resizing for Improving Performance in Embedded
Processor (DAC-2008, LCTES-2008)
RELOCATERegister File Local Access Pattern
Redistribution Mechanism for Power and Thermal Management in Out-of-
Order Embedded Processor
Houman Homayoun, Aseem Gupta, Alexander V. Veidenbaum
Avesta Sasan, Fadi J. Kurdahi, Nikil Dutt
April 2010 – Houman Homayoun University of California Irvine 4
Outline
Motivation Background study Study of Register file Underutilization Study of Register file default access patterns Access concentration and activity redistribution
to relocate register file access patterns Results
April 2010 – Houman Homayoun University of California Irvine 5
Why Temperature?
Higher power densities (Watt per mm2) lead to higher operating temperatures, which(i) Increase the probability of timing violations
(ii) Reduce IC lifetime
(iii) Lower operating frequency
(iv) Increase leakage power
(v) Require expensive cooling mechanisms
(vi) Overall increase in design effort and cost
April 2010 – Houman Homayoun University of California Irvine 6
Why Register File? RF is one of the hottest units in a processor
A small, heavily multi-ported SRAM Accessed very frequently
Example: IBM PowerPC 750FX, AMD Athlon 64
AMD Athlon 64 core floorplan blocksThermal Image of AMD Athlon 64 core floorplan blocks using infrared cameras, Courtesy of Renau et al. ISCA 2007
April 2010 – Houman Homayoun University of California Irvine 7
Prior Work: Activity Migration
Reduces temperature by migrating the activity to a replicated unit.
requires a replicated unit large area overhead
leads to a large performance degradation
Tem
pera
ture
T final
T ambient
Active Period
Idle Period
T init
T crisis
time
AM AM+PG
April 2010 – Houman Homayoun University of California Irvine 8
Conventional Register Renaming
Free List
Active List
Tail pointer
Head pointer Instruction # Original code Renamed code
1 RA <- ... PR1 <- ...
2 …. <- RA .... <- PR1
3 branch to _L branch to _L
4 RA <- ... PR4 <- ...
5 ... ...
... ...
6 _ L:
_ L:
7 …. <- RA .... <- PR1
Register Renamer Register allocation-release
• Physical registers are allocated/released in a somewhat random order
April 2010 – Houman Homayoun University of California Irvine 9
Analysis of Register File Operation: Register File Occupancy
(a)
0%10%20%30%40%50%60%70%80%90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 32
32 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
(b)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 3232 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
MiBench SPECint2K
(a)
0%
5%
10%
15%
20%
25%
30%
35%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
(b)
0%
10%
20%
30%
40%
50%
60%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
Performance Degradation with a Smaller Register File
April 2010 – Houman Homayoun University of California Irvine 10
Analysis of Register File Operation
Register File Access Distribution Coefficient of variation (CV) shows a “deviation”
from average # of accesses for individual physical registers.
nai is the number of accesses to a physical register i during a specific period (10K cycles). na is the average
N, the total number of physical registers
na
nanaN
CV
n
ii
access
2
1
)(1
April 2010 – Houman Homayoun University of California Irvine 11
Coefficient of Variation
(a)
0%
2%
4%
6%
8%
10%
12%
% c
oef
fici
ent
of
vari
atio
n
(b)
0%
2%
4%
6%
8%
10%
12%
14%
% c
oef
fici
ent
of
vari
atio
n
MiBench SPEC2K
April 2010 – Houman Homayoun University of California Irvine 12
Register File Operation
Underutilization which is distributed uniformly
while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution
April 2010 – Houman Homayoun University of California Irvine 13
RELOCATE: Access Redistribution within a Register File
The goal is to “concentrate” accesses within a partition
of a RF (region) Some regions will be idle (for 10K cycles)
Can power-gate them and allow to cool down
register activity (a) baseline, (b) in-order (c) distant patterns
April 2010 – Houman Homayoun University of California Irvine 14
An Architectural Mechanism to Support Access Redistribution
Active partition: a register renamer partition currently used in register
renaming
Idle partition: a register renamer partition which does not participate in
renaming
Active region: a region of the register file corresponding to a register
renamer partition (whether active or idle) which has live registers
Idle region: a region of the register file corresponding to a register
renamer partition (whether active or idle) which has no live registers
April 2010 – Houman Homayoun University of California Irvine 15
Activity Migration without Replication
An access concentration mechanism allocates registers from only one partition
This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over
another partition (according to some algorithm) is then activated (referred to as additional active partitions or AAP )
To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.
April 2010 – Houman Homayoun University of California Irvine 16
The Access Concentration Mechanism
Partition activation order is 1-3-2-4
Free List
Active List
Free List
Active List
Free List
Active List
Free List
Active List
Partition P1
Free-list 1 full Free-list 3 full Free-list 2 full
Active List 4 emptyActive List 2 emptyActive List 3 empty
Partition P2
Partition P4
Partition P3
Free-list 4 full
Active List 1 empty
P1
P2
P3
P4
April 2010 – Houman Homayoun University of California Irvine 17
The Redistribution Mechanism
The default active partition is changed once every N cycles to redistribute the activity within the register file (according to some algorithm)
Once a new default partition (NDP) is selected, all active partitions (DAP+AAP) become idle.
The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up)
A physical register in an idle partition may be live
An idle RF region is power gated when its active list becomes empty.
April 2010 – Houman Homayoun University of California Irvine 18
Performance Impact? There is a two-cycle delay to wakeup a power gated
physical register region
The register renaming occurs in the front end of the
microprocessor pipeline whereas the register access occurs in the back end.
There is a delay of at least two pipeline stages between renaming and accessing a physical register file
Can wake up the requested region in time
Can wake up a required register file region without incurring a performance penalty
at the time of access
April 2010 – Houman Homayoun University of California Irvine 19
Experimental setupTable 1. Processor Architecture
L1 I-cache 8KB, ,4 way, 2 cycles
L1 D-cache 8KB, 4 way, 2 cycles
L2-cache 128KB, 15 cycles
Fetch, dispatch 2 wide
Register file 64 entry
Memory 50 cycles
Instruction fetch queue
2
Load/store queue 16 entry
Arithmetic units 2 integer
Complex unit 2 INT
Pipeline 12 stages
Processor speed 800 MHz
Issue Out-of-order
Table 2. RF Design specification
Process 45nm-CMOS
9 metal layers
Register
file layout area
0.009mm2
Operating Modes Active:R/W
Sleep: no data retention
Operating Voltage 0.6V~1.1V
Read Access Cycle
200MHz
to 1.1GHz
Access time typical corner (0.9V, 45 )
0.32ns
Active Power (Total) in typical corner (0.9V, 45 )
66mW
@ 800MHz
Active Leakage Power typical corner (0.9V, 45 )
15mW
Sleep Leakage Power in typical corner (0.9V, 45 )
2mW Wakeup Delay 0.42ns
Wakeup Energy per register file row (64bits)
0.42nJ
MASE (SimpleScalar 4.0)
Model MIPS-74K processor, 800 MHz MiBench and SPECint2K benchmarks compiled with Compaq compiler, -O4 flag Industrial memory compiler used
64-entry, 64bit single-ended SRAM memory in TSMC 45nm technology HotSpot to estimate thermal profiles
April 2010 – Houman Homayoun University of California Irvine 20
Results-Power Reduction(a)
0%5%
10%15%20%25%30%35%40%45%50%55%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
Mibench RF power reduction
(b)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
SPEC2K RF power reduction
April 2010 – Houman Homayoun University of California Irvine 21
Analysis of Power Reduction
Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition
Indicates that wakeup overhead is amortized for a larger number of partitions.
Some exceptions the overall power overhead associated with waking up
an idle region becomes larger as the number of partition increases.
frequent but ineffective power gating and its overhead as the number of partition increases
April 2010 – Houman Homayoun University of California Irvine 22
Peak Temperature ReductionTable 1. Peak temperature reduction for MiBench benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
basicMath 94.3 3.6 4.8 5.0
bc 95.4 3.8 4.4 5.2
crc 92.8 5.3 6.0 6.0
dijkstra 98.4 6.3 6.8 6.4
djpeg 96.3 2.8 3.5 2.4
fft 94.5 6.8 7.4 7.6
gs 89.8 6.5 7.4 9.7
gsm 92.3 5.8 6.7 6.9
lame 90.6 6.2 8.5 11.3
mad 93.3 3.8 4.3 2.2
patricia 79.2 11.0 12.4 13.2
qsort 88.3 10.1 11.6 11.9
search 93.8 8.7 9.3 9.1
sha 90.1 5.1 5.4 4.5
susan_corners 92.7 4.7 5.3 5.1
susan_edges 91.9 3.7 5.8 6.3
tiff2bw 98.5 4.5 5.9 4.1
average 92.5 5.6 6.8 6.9
Table 2. Peak temperature reduction for SPEC2K integer benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
bzip2 92.7 4.8 3.9 3.1
crafty 83.6 9.5 11 10.4
eon 77.3 10.6 12.4 12.5
galgel 89.4 6.9 7.2 5.8
gap 86.7 4.8 5.9 7.1
gcc 79.8 7.9 9.4 10.1
gzip 95.4 3.2 3.8 3.9
mcf 85.8 6.9 8.7 9.4
parser 97.8 4.3 5.8 4.8
perlbmk 85.8 10.6 12.3 12.6
twolf 86.2 8.8 10.2 10.5
vortex 81.7 11.3 12.5 12.9
vpr 94.6 4.9 5.2 4.4
average 87.4 7.2 8.3 8.2
April 2010 – Houman Homayoun University of California Irvine 23
Analysis of Temperature Reduction
Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition
While capturing more idle partitions and power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature
April 2010 – Houman Homayoun University of California Irvine 24
Conclusions
Showed Register File Underutilization
Studied Register file default access patterns
Propose access concentration and activity redistribution to relocate register file accesses
Results show a noticeable power and temperature reduction in the RF
RELOCATE technique can be applied when units are underutilized
as opposed to activity migration, which requires replication
April 2010 – Houman Homayoun University of California Irvine 25
Current and Future Work Extension
Formulate the Best partition selection out of available partitions for activity redistribution.
Apply activity concentration and redistribution mechanism to other hot units; example: L1 cache.
Apply Proactive NBTI Recovery to the idle partitions to improve lifetime reliability.
Trade-off NBTI recovery and power gating to simultaneously reduce power and improve lifetime reliability.
Tackle the temperature barrier in 3D stack processor design using similar activity concentration and redistribution.
Multiple Sleep Modes Leakage Control for Cache
Peripherals
Houman Homayoun, Avesta Sasan, Alexander V. Veidenbaum
April 2010 – Houman Homayoun University of California Irvine 27
On-chip Caches and Power On-chip caches in high-performance
processors are large more than 60% of chip budget
Dissipate significant portion of power via leakage
Much of it was in the SRAM cells Many architectural techniques proposed to
remedy this
Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)
In part because cell design has been optimized
Pentium M processor die photoCourtesy of intel.com
April 2010 – Houman Homayoun University of California Irvine 28
Peripherals ?
Data Input/Output Driver Address Input/Output Driver Row Pre-decoder Wordline Driver Row Decoder
addr0
addr1
addr2
addr3
Predecoder and Global Wordline Drivers
Decoder
addr
Global WordlineLocal Wordline
Bitline BitlineAddr Input Global Drivers
Sense amp
Global Output Drivers
1
10
100
1000
10000
100000
mem
ory c
ell
INVX
INV2X
INV3X
INV4X
INV5X
INV6X
INV8X
INV12
X
INV16
X
INV20
X
INV24
X
INV32
X
( pw )
200X
6300X
Using minimal sized transistor for area considerations in cells and larger, faster and accordingly more leaky transistors to satisfy timing requirements in peripherals.
Using high vt transistors in cells compared with typical threshold voltage transistors in peripherals
April 2010 – Houman Homayoun University of California Irvine 29
Power Components of L2 Cache
SRAM peripheral circuits dissipate more than 90% of the total leakage power
global address input drivers
11%
global data input drivers
14%
global row predecoder
1%
local row decoders
33%
others8%
local data output drivers
8%
global data output drivers
25%
0%10%20%30%40%50%60%70%80%90%
100%
amm
pap
plu
apsi art
bzi
p2
craf
tyeo
neq
uak
efa
cere
cg
alg
elg
ap gcc
gzi
plu
cas
mcf
mes
am
gri
dp
arse
rp
erlb
mk
sixt
rack
swim
two
lfvo
rtex vp
rw
up
wis
eav
erag
e
Leakage Dynamic
L2 cache leakage power dominates its dynamic power above 87% of the total
April 2010 – Houman Homayoun University of California Irvine 30
Techniques Address Leakage in SRAM Cell
Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper
Target SRAM memory cell
Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first
Drowsy Cache Keeps cache lines in low-power state, w/ data retention
Cache Decay Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.
Circuit
Architecture
April 2010 – Houman Homayoun University of California Irvine 31
Sleep Transistor Stacking Effect
Subthreshold current: inverse exponential function of threshold voltage
Stacking transistor N with slpN: The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
)2)2((0 FSBFTT VVV
slpN
vss
N
MV
vdd
gnV
gslpnV vss
LC
CV
Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability
April 2010 – Houman Homayoun University of California Irvine 32
Impact on Rise Time and Fall Time
The rise time and fall time of the output of an inverter is proportional to the
Rpeq * CL and Rneq * CL
Inserting the sleep transistors increases both Rneq and Rpeq
P1 P2
N1 N2
vddvdd
vssvss
010
slpN1 slpN2
slpP1 slpP2
I leakage
I leakage
Increasing in rise time
Increasing in fall time
Impact on performance
Impact on memory functionality
April 2010 – Houman Homayoun University of California Irvine 33
A Zig-Zag Circuit
Rpeq for the first and third inverters
and Rneq for the second and fourth
inverters doesn’t change. Fall time of the circuit does not
change
To improve leakage reduction and area-efficiency of the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters
Zig-Zag Horizontal Sharing Zig-Zag Horizontal and Vertical Sharing
April 2010 – Houman Homayoun University of California Irvine 34
Zig-Zag Horizontal and Vertical Sharing
vdd vdd vddvdd vdd
vss vss vss vss vss
slpN
slpPSleep signal
Sleep signal
P 11 P12 P 13 P14 P 21 P22 P 23 P24
N11 N12 N13 N14N21 N22 N23 N24
Word-line Driver line K
Word-line Driver line K +1
MV
To improve leakage reduction and area-efficiency of the zig-zag scheme, using one set of sleep transistors shared between multiple stages of inverters
Zig-Zag Horizontal Sharing Minimize impact on rise time Minimize area overhead
Zig-Zag Horizontal and Vertical Sharing Maximize leakage power saving Minimize the area overhead
April 2010 – Houman Homayoun University of California Irvine 35
ZZ-HVS Evaluation : Power Result
Increasing the number of wordline rows share sleep transistors increases the leakage reduction and reduces the area overhead
Leakage power reduction varies form a 10X to a 100X when 1 to 10 wordline shares the same sleep transistors
2~10X more leakage reduction, compare to the zig-zag scheme
1
10
100
1000
1 2 3 4 5 6 7 8 9 10
log
(n
W)
baseline redundant zigzag zz-hs zz-hvs
number of wordline row
x10
x100
x2
x12
(a)
April 2010 – Houman Homayoun University of California Irvine 36
Wakeup Latency
To benefit the most from the leakage savings of stacking sleep transistors
keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible)
Drawback: impact on the wakeup latency of wordline drivers
Control the gate voltage of the sleep transistors Increasing the gate voltage of footer sleep transistor reduces the
virtual ground voltage (VM)
reduction in the circuit wakeup
delay overhead
reduction in leakage power
savings
April 2010 – Houman Homayoun University of California Irvine 37
Wakeup Delay vs. Leakage Power Reduction
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
(Footer,Header) Gate Bias Voltage Pair
No
rma
lize
d L
ea
ka
ge
Po
we
r
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
No
rmal
ized
Wak
e-U
p D
elay
Normalized leakage Normalized wake-up delay
trade-off between the wakeup overhead
and leakage power saving
Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead
April 2010 – Houman Homayoun University of California Irvine 38
Multiple Sleep Modes
power mode wakeup delay (cycle)
leakage reduction (%)
basic-lp 1 42%
lp 2 75%
aggr-lp 3 81%
ultra-lp 4 90%
Power overhead of waking up peripheral circuits Almost equivalent to the switching power of sleep
transistors Sharing a set of sleep transistors horizontally and
vertically for multiple stages of a (wordline) driver makes the power overhead even smaller
April 2010 – Houman Homayoun University of California Irvine 39
Reducing Leakage in L1 Data Cache Maximize the leakage reduction in DL1 cache
put DL1 peripheral into ultra low power mode adds 4 cycles to the DL1 latency
significantly reduces performance
Minimize Performance Degradation put DL1 peripherals into the basic low power mode requires only one cycle to wakeup and hide this latency during address computation stage
thus not degrading performance Not noticeable leakage power reduction
April 2010 – Houman Homayoun University of California Irvine 40
Motivation for Dynamically Controlling Sleep Mode
large leakage reduction benefit Ultra and aggressive low power modes
low performance impact benefit Basic-lp mode
Periods of frequent access Basic-lp mode
Periods of infrequent access Ultra and aggressive low power modes
dynamically adjust peripheral circuit sleep power mode
April 2010 – Houman Homayoun University of California Irvine 41
Reducing DL1 Wakeup Delay
Can determine whether an instruction is load or a store at least one cycle prior cache access
Accessing DL1 while its peripherals are in basic-lp mode doesn’t require an extra cycle
wake up DL1 peripherals one cycle prior to access
One cycle of wakeup delay can be hidden for all other low-power modes
Reducing the wakeup delay by one cycle
Put DL1 in basic-lp mode by default
April 2010 – Houman Homayoun University of California Irvine 42
Architectural Motivations
Architectural Motivation A load miss in L1/L2 caches takes a long time to service
prevents dependent instructions from being issued When dependent instructions cannot issue
performance is lost At the same time, energy is lost as well!
This is an opportunity to save energy
April 2010 – Houman Homayoun University of California Irvine 43
Low-end Architecture
Given the miss service time of 30 cycles likely that processor stalls during the miss service period Occurrence of additional cache misses while one DL1
cache miss is already pending further increases the chance of pipeline stall
basic-lp lp u ltra -lp
aggr-lp
D L1 m iss
P rocessor sta ll
D L1 m iss++
P ending D L1 m iss
P ending D L1 m iss /es
D L1 m iss serviced
P rocessor continue
April 2010 – Houman Homayoun University of California Irvine 44
Low Power Modes in a 2KB DL1 Cache
Fraction of total execution time DL1 cache spends in each of the power mode
85% of the time DL1 peripherals put into low power modes Most of the time spent in the basic-lp mode (58% of total
execution time)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
basic
mat
h bccr
c
dijkst
ra
djpeg fft gs
gsmla
me
mad
patric
iapgp
qsort
rijndae
l
sear
ch sha
susa
n_corn
ers
susa
n_edges
tiff2
bw
aver
age
hp trivial-lp lp aggr-lp ultra-lp
April 2010 – Houman Homayoun University of California Irvine 45
Low Power Modes in Low-End Architecture
(a)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2KB 4KB 8KB 16KB
hp basic-lp lp aggr-lp ultra-lp
Increasing the cache size reduces DL1 cache miss rate Reduces opportunities to put the cache into more aggressive
low power modes Reduces performance degradation for larger DL1 cache
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
basic
mat
h bccr
c
dijkst
ra
djpeg fft gs
gsmla
me
mad
patric
iapgp
qsort
rijndae
l
sear
ch sha
susa
n_corn
ers
susa
n_edges
tiff2
bw
aver
age
2KB 4KB 8KB 16KB
Performance degradation Frequency of different low power mode
April 2010 – Houman Homayoun University of California Irvine 46
High-end Architecture
basic-lp lp ultra-lpDL1 miss
L2 miss
Pending DL1 miss/esDL1 miss
serviced
L2 miss serviced
L2 miss
DL1 transitions to ultra-lp mode right after an L2 miss occurs
Given a long L2 cache miss service time (80 cycles) the processor will stall waiting for memory
DL1 returns to the basic-lp mode once the L2 miss is serviced
April 2010 – Houman Homayoun University of California Irvine 47
Leakage Power Reduction
DL1 leakage is reduced by 50% While ultra-lp mode occurs much less frequently compared to basic-lp mode, its leakage
reduction is comparable to the basic-lp mode. in ultra-lp mode the peripheral leakage is reduced by 90%, almost twice that of basic-lp
mode.
0%
10%
20%
30%
40%
50%
60%
70%
80%
basic
mat
hbc
crc
dijkst
ra
djpeg fft gs
gsm lam
em
ad
patric
iapgp
qsort
rijndae
l
sear
chsh
a
susa
n_corn
ers
susa
n_edges
tiff2
bw
aver
age
trivial-lp lp aggr-lp ultra-lp
0%10%20%30%40%50%60%70%80%90%
amm
p
applu
apsi ar
tbzip
2
craf
tyeo
n
equak
e
face
rec
galgel
gap gccgzip
luca
sm
cf
mes
a
mgrid
parse
r
perlb
mk
swim
twolf
vorte
xvp
r
wupwise
aver
age
trivial-lp lp ultra-lp
The average leakage reduction is almost 50%
April 2010 – Houman Homayoun University of California Irvine 48
Conclusion
Highlighted the large leakage power dissipation in SRAM peripheral circuits.
Proposed zig-zag share to reduce leakage in SRAM peripheral circuits.
Extended zig-zag share with multiple sleep modes which trade-off the leakage power reduction vs wakeup delay overhead.
Applied multiple sleep modes technique in L1 cache of an embedded processor.
Presented Leakage power reduction.
April 2010 – Houman Homayoun University of California Irvine 49