loop-directed mothballing: power gating execution units using runtime loop analysis

10
................................................................................................................................................................................................................... LOOP-DIRECTED MOTHBALLING: POWER GATING EXECUTION UNITS USING RUNTIME LOOP ANALYSIS ................................................................................................................................................................................................................... POWER GATING REDUCES STATIC POWER BY EITHER DISABLING WHOLE UNITS OR DYNAMICALLY RESIZING UNITS TO MEET APPLICATION DEMANDS.THE LOOP-DIRECTED MOTHBALLING TECHNIQUE LETS USERS POWER GATE EXECUTION UNITS BY RECORDING UTILIZATION OF INDIVIDUAL UNITS IN LOOPS, AND BY POWER GATING UNITS ACCORDING TO TWO UTILIZATION THRESHOLDS. LDM OFFERS ON AVERAGE 10.3 PERCENT TOTAL POWER SAVINGS WITH LOW PERFORMANCE LOSS. ......Power dissipation will be a key issue for microprocessor technology scaling in the near future. 1 Increasing clock fre- quency and transistor count drive power consumption higher, but related problems of heat dissipation, energy costs, and battery life could make such future technologies less practical. Many existing approaches address this problem by power gating parts of a pro- cessor to reduce static power dissipation, which is present in all logic that’s powered, even if it exhibits no dynamic switching activity. 2-11 This article, which expands on an extended abstract presented at Cool Chips XIV, adds to the arsenal of power- saving techniques by presenting a simple utilization-based approach to power gating execution units that also attempts to main- tain a low impact on performance. Static power dissipation refers to energy leaked through transistors, regardless of any switching they experience, whereas dynamic power dissipation is the energy used to switch transistor states. Although dynamic power dissipation remains dominant in over- all processor power consumption, it is static power that’s predicted to limit future micro- processors as it scales more quickly than dy- namic power. 1 Policies to power-manage idle system devices such as disk drives already exist. 10 However, as the processor has be- come more power hungry, attention has shifted to power consumption during pro- cessor activity, whereby processor resources are resized or disabled to match the compu- tational requirements of executing applica- tions. Cache resizing, 9 register file resizing, 7 issue width scaling, 6 switching of out-of- order policies, 4 and execution-unit power gating 5,8,11 achieve this goal. A comprehen- sive explanation of power dissipation is avail- able elsewhere. 3 For more information on related projects, see the ‘‘Related Work in Power Gating of Execution Units’’ sidebar. This article focuses on execution-unit static power dissipation, because the units are among the most power-hungry devices in microprocessors, 5,6 and power gating of Craig A. Court Paul H.J. Kelly Imperial College London 0272-1732/11/$26.00 c 2011 IEEE Published by the IEEE Computer Society ............................................................. 29

Upload: phj

Post on 23-Sep-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

..........................................................................................................................................................................................................................

LOOP-DIRECTED MOTHBALLING:POWER GATING EXECUTION UNITSUSING RUNTIME LOOP ANALYSIS

..........................................................................................................................................................................................................................

POWER GATING REDUCES STATIC POWER BY EITHER DISABLING WHOLE UNITS OR

DYNAMICALLY RESIZING UNITS TO MEET APPLICATION DEMANDS. THE LOOP-DIRECTED

MOTHBALLING TECHNIQUE LETS USERS POWER GATE EXECUTION UNITS BY RECORDING

UTILIZATION OF INDIVIDUAL UNITS IN LOOPS, AND BY POWER GATING UNITS ACCORDING

TO TWO UTILIZATION THRESHOLDS. LDM OFFERS ON AVERAGE 10.3 PERCENT TOTAL

POWER SAVINGS WITH LOW PERFORMANCE LOSS.

......Power dissipation will be a keyissue for microprocessor technology scalingin the near future.1 Increasing clock fre-quency and transistor count drive powerconsumption higher, but related problemsof heat dissipation, energy costs, and batterylife could make such future technologies lesspractical. Many existing approaches addressthis problem by power gating parts of a pro-cessor to reduce static power dissipation,which is present in all logic that’s powered,even if it exhibits no dynamic switchingactivity.2-11 This article, which expands onan extended abstract presented at CoolChips XIV, adds to the arsenal of power-saving techniques by presenting a simpleutilization-based approach to power gatingexecution units that also attempts to main-tain a low impact on performance.

Static power dissipation refers to energyleaked through transistors, regardless of anyswitching they experience, whereas dynamicpower dissipation is the energy used toswitch transistor states. Although dynamic

power dissipation remains dominant in over-all processor power consumption, it is staticpower that’s predicted to limit future micro-processors as it scales more quickly than dy-namic power.1 Policies to power-manage idlesystem devices such as disk drives alreadyexist.10 However, as the processor has be-come more power hungry, attention hasshifted to power consumption during pro-cessor activity, whereby processor resourcesare resized or disabled to match the compu-tational requirements of executing applica-tions. Cache resizing,9 register file resizing,7

issue width scaling,6 switching of out-of-order policies,4 and execution-unit powergating5,8,11 achieve this goal. A comprehen-sive explanation of power dissipation is avail-able elsewhere.3 For more information onrelated projects, see the ‘‘Related Work inPower Gating of Execution Units’’ sidebar.

This article focuses on execution-unitstatic power dissipation, because the unitsare among the most power-hungry devicesin microprocessors,5,6 and power gating of

mmi2011060029.3d 14/11/011 14:27 Page 29

Craig A. Court

Paul H.J. Kelly

Imperial College London

0272-1732/11/$26.00 �c 2011 IEEE Published by the IEEE Computer Society

...................................................................

29

execution units is a nontrivial problem. Het-erogeneity of units requires more analysis tomatch application requirements to resources,and a poor match could result in a costly per-formance loss owing to contention for unitsor stalls while required units are powered up.

We offer Loop-Directed Mothballing(LDM), a method to analyze innermostloop bodies at runtime, which lets us makeresource requirement predictions and applythem by power gating execution units. Wepresent results showing that LDM reduces

the energy-delay product (EDP) by 10.3 per-cent on average, whereas previous methodsshow either EDP degradation or no differ-ence in EDP. We also demonstrate theimportance of limiting performance degrada-tion to improve EDP.

Loop-Directed MothballingLDM attempts to increase the accuracy of

predicting execution-unit requirements sothat units can be effectively power gated,reducing performance degradation and

mmi2011060029.3d 14/11/011 14:27 Page 30

...............................................................................................................................................................................................

Related Work in Power Gating of Execution Units

Several projects have addressed issues involved in power gating of

execution units. Ikebuchi et al. fabricated a fine-grained power-gated

MIPS R3000 CPU called Geyser-1 to switch off unused execution

units.1 Units are power gated off after completing computation, and

the unit required for the next instruction is power gated on during the

instruction fetch stage. The break-even time (BET) for power gating a

unit is known, and software directives keep a unit on if it will be used

again before the BET. Geyser-1 operates at 60 MHz with a 10-ns

wake-up delay (less than a clock cycle). As a result, instructions in the

fetch stage can trigger power-up of units that will be ready without delay.

Assuming a similar power-gating delay (10 ns) at the 1200-MHz oper-

ating frequency of the DEC Alpha, this delay is 12 cycles (in keeping with

work by Agarwal et al.2). Because of Geyser-1’s short pipeline, instruc-

tions could not be detected early enough to hide these increased power-

up delays, making the design infeasible for such a clock frequency. Loop-

Directed Mothballing (LDM) can tolerate increased delay because predic-

tions anticipate required resources before instructions enter the pipeline.

Maro, Bai, and Bahar apply power gating to execution units in the

DEC Alpha architecture by power gating entire clusters of units.3 They

partition units into one floating-point and two identical integer clusters,

each with dedicated issue and register files. They propose three tech-

niques to monitor instruction-level parallelism (ILP) and to power gate

an integer cluster:

� Functional-unit usage is monitored by recording the number of

units used per cluster. Shift registers record each cluster’s utilization

history, power gating off a consistently underused cluster and

switching the cluster back on if the remaining cluster is overly used.

� Instructions committed per clock are monitored over 512-cycle

intervals to measure the trend in ILP. Threshold values indicate

whether a cluster should be power gated off or on.

� Input dependencies are counted in the instruction window, and a

cluster is power gated off when there are many dependencies. A

cluster is power gated back on, as in the functional-unit usage

technique.

For all integer techniques, half of the floating-point cluster is power

gated off when an integer cluster is turned off, and the entire

floating-point cluster is power gated off if no floating-point instruction

is fetched for three cycles. The floating-point cluster is switched on if

new floating-point instructions are fetched.

Whereas LDM has fine-grained control over execution units, clustering

limits the ability to match available units to execution requirements. An-

other restriction is that control flow isn’t considered, resulting in

observed usage that might not represent the characteristics of the inter-

vals where power gating is actually applied. LDM overcomes this by

making predictions only for loops. As long as execution remains inside

a loop, observed utilization over the iteration should represent the char-

acteristics of the rest of the loop.

Rosner et al. overcome this same restriction by building traces of fre-

quently executed instructions and storing decoded micro-operations in a

trace cache.4 They then optimize the most frequent traces dynamically at

runtime to improve performance. Energy is consequently reduced, al-

though they don’t implement power gating. This method could be

used to power gate units by making predictions for entire traces

(including loops) after online analysis. However, as static power

becomes a more dominant contribution to power dissipation, this highly

complex design’s power overhead might reduce the technique’s efficacy.

References

1. D. Ikebuchi et al., ‘‘Geyser-1: A MIPS R3000 CPU Core with

Fine Grain Runtime Power Gating,’’ Proc. IEEE Int’l Solid-

State Circuits Conf. (ISSCC 09), IEEE Press, 2009, pp. 281-284.

2. K. Agarwal et al., ‘‘Power Gating with Multiple Sleep

Modes,’’ Proc. 7th Int’l Symp. Quality Electronic Design

(ISQED 06), IEEE Press, 2006, pp. 633-637.

3. R. Maro, Y. Bai, and R. Bahar, ‘‘Dynamically Reconfiguring

Processor Resources to Reduce Power Consumption in

High-Performance Processors,’’ Proc. 1st Int’l Workshop

Power-Aware Computer Systems, Springer, 2001, pp. 97-111.

4. R. Rosner et al., ‘‘Power Awareness through Selective

Dynamically Optimized Traces,’’ Proc. 31st Ann. Int’l Symp.

Computer Architecture (ISCA 04), IEEE CS Press, 2004,

pp. 162-173.

....................................................................

30 IEEE MICRO

...............................................................................................................................................................................................

COOL CHIPS

increasing power savings. At runtime, LDMdetects and analyzes the innermost loops todetermine unit requirements, and it powergates the superfluous units for the remainingloop iterations. LDM targets the innermostloops because successive iterations will likelyhave similar, if not identical, execution char-acteristics. Moreover, fast characterization(over a single iteration) permits power sav-ings sooner than techniques requiring largesampling intervals.

Scope for power savingsThe potential for power savings is limited

by the innermost loop coverage of programcode, so 16 traces of 1 million instructionsfrom SPEC CPU2006 benchmarks weresimulated and the coverage was measured.Figure 1 shows more than 75 percent cover-age for 10 benchmarks, which indicates highpotential for power savings. Figure 2 showsexecution-unit utilization for the most-visited loop in the samples. No loops use

mmi2011060029.3d 14/11/011 14:27 Page 31

0

20

40

60

80

100

Cov

erag

e (%

)

400.p

erlben

ch

401.b

zip2

403.g

cc

429.m

cf

433.m

ilc

435.g

romac

s

436.c

actus

ADM

437.l

eslie

3d

445.g

obmk

456.h

mmer

458.s

jeng

462.l

ibquantu

m

464.h

264r

ef

470.l

bm

482.s

phinx3

998.s

pecra

ndom

Figure 1. Coverage of executed instructions that are in the innermost loops. We measured more than 75 percent coverage

for 10 benchmarks, indicating high potential for power savings.

ALU ALU ALU ALU MUL/DIVMUL/DIV MEM MEM FP FP MUL/DIV

0

20

40

60

80

100

Util

izat

ion

(%)

400.p

erlben

ch

401.b

zip2

403.g

cc

429.m

cf

433.m

ilc

435.g

romac

s

436.c

actus

ADM

437.l

eslie

3d

445.g

obmk

456.h

mmer

458.s

jeng

462.l

ibquantu

m

464.h

264r

ef

470.l

bm

482.s

phinx3

998.s

pecra

ndom

Figure 2. Execution-unit utilization during the most-executed innermost loop. Most loops can potentially move computation

to other units without compromising performance. (ALU: arithmetic logic unit; DIV: division unit; FP: floating-point unit;

MEM: memory unit; MUL: multiplication unit.)

....................................................................

NOVEMBER/DECEMBER 2011 31

all execution units, and most can potentiallymove computation to different units withoutperformance loss.

Detecting loops and loop exitsEach time a backward branch is visited,

loop detection logic stores its address. Beforeit is stored, the new branch address is com-pared to the existing address, and if theymatch, we assume the branch address isfrom an innermost loop. Between innermostloop branches, loop detection logic recordsthe outcome of intermediate forwardbranches in a bitstream (the forward branchpattern), with 1 representing a branch takenand 0 otherwise. If subsequent branches followthis pattern, we are still in the loop. Otherwise,we assume the loop has exited to prevent pre-dictions for the innermost loop being appliedoutside the loop body.

Predicting resource requirementsFigure 3 shows an example of innermost

loop code and average unit utilization duringloop execution. Each unit is associated with acounter that increments each cycle if the unitis used. When a loop branch instruction isfirst visited, all counters reset. The secondtime we see the branch, we have valid countsfor one loop iteration, and a prediction ismade using the following rules:

� If all units of a particular type are notused, they should all be switched off.

� Otherwise, one unit of this type mustremain on, and two thresholds

(switch_on_threshold and switch_off_threshold ) are used to switch on or offthe remaining units.

The prediction mechanism appliesthresholds to the least-used unit of a typethat’s currently on, switching off an underu-tilized unit and switching on another unit ifthe least-used unit is highly utilized. A highlyutilized least-used unit could indicate thatparallelism isn’t being exploited, and there-fore performance is degraded, which couldharm EDP.

In Figure 3, switch_on_threshold andswitch_off_threshold are 70 percent and 5percent, respectively. Floating-point units(FPUs) are not used, and are both switchedoff. At least one arithmetic logic unit(ALU), multiplication unit (MUL), andmemory unit (MEM) must remain on, be-cause the loop contains instructions for exe-cution on these units. After one iteration,the least-used ALU, MUL, and MEM areused less than switch_off_threshold and canbe turned off. After another iteration, ofthe units that are on, the least-used ALUis the only unit that the prediction mecha-nism can switch off. The prediction aftertwo iterations is therefore 1100101000,where each bit describes the power statusof one of the 10 units (four ALUs, twoMULs, two MEMs, and two FPUs).

After each iteration, the prediction mech-anism applies the current prediction bypower gating units as required, and all utili-zation counters are reset. For subsequent iter-ations, the mechanism uses the thresholds tomodify the prediction, and applies changesby power gating appropriate units.

Figure 4 shows EDP savings for differentcombinations of thresholds, averaged for 16SPEC CPU2006 benchmarks. Although itshows that 70 percent and 5 percent producethe best EDP savings for these benchmarks,there are also high savings near these thresh-olds. This demonstrates that the thresholdsdon’t need to be highly tuned to achievegood EDP savings.

Nonloop codeWhen executing noninnermost loop

code, all execution units are power gatedon and the prediction is stored for future

mmi2011060029.3d 14/11/011 14:27 Page 32

LD [r6+r2] r1 (1)

ADD r1 4 r1 (2)

MUL r1 6 r1 (3)

ADD r1 r3 r3 (4)

ADD r2 4 r2 (2)

BLE r2 40 −6 (4) ALU MUL MEM FPU0

100

25

50

7570%

5%

Util

izat

ion

(%)

Figure 3. Example loop code and average execution-unit utilization for one

iteration. Brackets show the cycle each instruction is executed. Dashed

lines represent power-gating thresholds.

....................................................................

32 IEEE MICRO

...............................................................................................................................................................................................

COOL CHIPS

use in a table. This is normally triggered by aloop exit, and some of the delay of powergating can be hidden while the branch mis-prediction (to continue looping) is resolved.Returning all execution units to a fully oper-ational state highlights LDM’s contributionduring loop code, and nonloop code withpoor predictions could be detrimental to per-formance, even negating some of the savingsfrom LDM.

One complication of the technique isconditional branches within the loop fromif-then-else or case statements. This cancause false detection of loop exit, resultingin execution units being restored to the onstate even though the loop is still executing.As long as the change in control flow persistsfor several iterations, a new prediction will bemade for this particular control flow, whichwill be beneficial if resource requirementsfor different control flows are significantlydifferent. In Figure 5, two control flows areeffectively seen as different loops, each withtheir own pattern of forward branches andtailored predictions.

Energy-delay productWe chose EDP as the measurement of fit-

ness for power-saving techniques.12 It com-bines both requirements of general-purposeprocessors: energy and performance. Totalenergy for computation is preferred becausepower is only the energy-dissipation rateand doesn’t account for a technique’sincreased computation time. Total energy isalso more closely related to battery life formobile applications. Performance is includedin the measurement because power savings

often come at the expense of performance,but this is undesirable for many systems.

MethodologyHere, we describe the framework used to

simulate LDM and measure energy savings.

Simulation and power estimationWe conduct performance simulations

using SimpleScalar/Alpha,13 which we modi-fied to facilitate execution-unit power gating.To simulate power gating, we added an attri-bute to the resource objects to indicate ifeach is on. When a resource is requested dur-ing instruction issue, only those that are oncan be considered.

Finite-state machines with states pending_on, on, pending_off, and off control each exe-cution unit’s power state. Units to be power

mmi2011060029.3d 14/11/011 14:27 Page 33

0

100

25

50

75

Util

izat

ion

A,C

,D,E

(%

)

...

ALU MULMEM FPU ALU MUL MEM FPU0

100

25

50

75

Util

izat

ion

A,B

,E (

%) A

C

D

E

B

Figure 5. Example control flow graph, showing conditional branches inside a loop.

Unit utilization is shown for two paths.

9080

7060

2520

1510

50468

10

Switch on (%)

Switch off (%)

Ave

rag

e d

ecre

ase

in E

DP

(%

)

Figure 4. Percentage decrease in energy-

delay product (EDP) averaged over all

benchmarks for combinations of thresholds.

We see high savings across a range of

thresholds, showing that they don’t need

to be highly tuned to achieve good EDP

savings.

....................................................................

NOVEMBER/DECEMBER 2011 33

gated off enter the pending_off state, whichprevents further issue to the unit. When in-flight instructions complete, the off state isentered, and the unit can be switched off.To simulate the delay of returning a unitto the operational state, units enter thepending_on state for a period before the unitsbecome available for issue (on). The power-gating delay depends on many transistor-level design parameters, including maximumsupply current and current starvation affect-ing neighboring units, which can be adjustedat a low level and managed through stagedpower-up to adjust delay. Detailed analysisto select the most appropriate delay is outsidethe scope of this work, so we selected a con-servative wake-up delay of eight cycles, inline with previous research.2,3 We use thissame delay for all execution units in thesimulator.

An increased power-gating delay wouldaffect only the units being power gated onat loop exit and during prediction refine-ment, when units would remain unavailablelonger. Other units that are already onwould remain available for issue, however,so the pipeline wouldn’t stall. Also, if aloop exits because of a branch mispredictionand causes a pipeline flush, a longer delaywould have no effect if the delay is shorterthan the time to fetch new instructions.

We obtained power dissipation data usingMcPAT,14 which uses SimpleScalar statisticswith data from the 2010 International Tech-nology Roadmap for Semiconductors (ITRS) tocalculate static and dynamic power dissipa-tion.1 We obtained all data using 45-nmtechnology data at 1,200 MHz. The specifi-cation we used for both simulation andpower estimation is based on the DECAlpha.15 We power gate the ALUs, MULs,and FPUs by weighting total static poweraccording to the proportion of executiontime that they’re on. The MEM units per-form address calculation and communicatewith the load/store queue, but are combinedwith the memory subsystem in McPAT, sowe couldn’t estimate the effects of power gat-ing these units. To account for additionalpower consumption during execution-unitpower-up, we assume a linear increase ofstatic and dynamic power over an eight-cycle delay.

LDM’s prediction table (discussed later)should dissipate the most power, so trade-offs that can be made to reduce its powerconsumption are discussed in the ‘‘LDMprediction table’’ section. Because we canreduce the size considerably with only asmall impact on EDP savings, we don’t in-clude the cost in power estimations. Manyprocessors include counters for monitoringperformance, and the utilization countersrequired for LDM would be similar. Thecounters must count the number of cyclesrequired to execute an iteration of the loop,so 16-bit counters suffice and the power dis-sipated is negligible. To avoid the costly divi-sion in calculating a utilization percentage,the threshold number of cycles can be precal-culated for different iteration lengths andstored in a look-up table.

BenchmarksWe used benchmarks from the SPEC

CPU2006 suite. Only 16 programs couldbe cross compiled for SimpleScalar/Alphaand successfully executed on the simulator.

Because of time constraints and the needto repeat all benchmarks using differentcombinations of thresholds, we limited sam-ples to 1 million instructions. Although thesamples contained only a single loop forsome benchmarks, we detected up to 148loops for others. Each benchmark was fast-forwarded to warm up the caches, but toproduce a range of loops, benchmarksweren’t fast-forwarded to specific kernelsthat dominate execution. Instead, bench-marks were fast-forwarded by 5 millioncycles, and produced more than 280 differ-ent loops across the benchmarks, ranging insize and scope. For each benchmark, all testdata sets were used, and the average overthe data sets is given in the figures.

For comparison, we implemented twoexecution-unit power-gating techniques fromprevious work.8 Both techniques divide exe-cution units into two equal clusters. Forthe first, a bit is shifted into a 16-bit registerevery cycle to record whether 50 percent of acluster’s execution units aren’t being utilized.When a threshold number of bits is set, onecluster is power gated on or off. In the sec-ond technique, the authors monitor thenumber of instructions committed per cycle

mmi2011060029.3d 14/11/011 14:27 Page 34

....................................................................

34 IEEE MICRO

...............................................................................................................................................................................................

COOL CHIPS

(IPC) over a 512-cycle interval and usethreshold values to power gate clusters. Wechose these two techniques because theyshowed the most power savings and leastperformance degradation, respectively.

LDM prediction tableFor LDM, a prediction table stores re-

source requirement predictions for futureuse when loops exit (see Table 1). Branch ad-dress and branch pattern (forward branch pat-tern) uniquely identify an execution-unitrequirement prediction. At the end of theiteration, LDM uses prediction and unit-utilization counters to determine whether itshould make changes to the prediction, andwill power gate the respective units to thenew mode.

We also maintain two additional registers,containing the current backward branch ad-dress and predicted control-flow pattern.The loop detection logic uses the backwardbranch address to detect the innermostloops, as we described earlier. The loop de-tection logic compares the observed forwardbranch pattern to the current loop’s forwardbranch pattern to detect loop exits (orchanges in control flow).

Owing to out-of-order execution, someinstructions move past loop branches. Thiscan cause all units of a particular type to bepower gated off, even though they’ll beneeded in the next iteration. To catch thisevent, instructions requiring disabled unitsare detected during instruction issue, andrequired units are powered up. Any unitpower gated back on, either because of theabove event or because switch_on_thresholdis exceeded, is flagged to be ‘‘guaranteedon’’ in the prediction table entry. This pre-vents it from being removed from the predic-tion again and avoids the delay of restartingthe unit in the future.

We obtained our results using a 64-entryprediction table, which is addressed using ahash of the branch address and branch pat-tern. To accommodate the complex controlflow of the gcc benchmark, the branch pat-tern contains 128 bits, and the storage re-quirement for the table is 1,728 bytes(equivalent to 2.6 percent of the Level-1cache). Because this would contribute signif-icantly to power consumption, we iteratively

halved the number of entries to determinethe effects of a smaller table on EDP.A two-entry table (requiring only 54 bytesof storage) reduced EDP savings by lessthan 1 percent (of total EDP) for 15 bench-marks, and sjeng had the worst EDP increaseof 2.3 percent, resulting in 0.2 percent EDPdegradation on average over the 64-entrycase. Although we use the 64-entry table inthis article to demonstrate potential EDP sav-ings, it’s clear that we can make a trade-off byreducing the table size. Additionally, withoutthe control flow of gcc, the largest patternwe would require for these benchmarks is20 bits, halving the table storage requirement.

ResultsWe compared LDM, two previous tech-

niques described earlier, and the baselinecase in which no power gating is used. Forbrevity, we will refer to these as LDM,Shift, IPC, and Base.

Figure 6 shows the EDP calculated over1 million committed instructions, after theprogram is fast-forwarded for 5 millioninstructions. Results for LDM are clearly re-lated to coverage of the innermost loop codeshown in Figure 1, because the benchmarkswith low innermost loop coverage have littleopportunity for power savings. No bench-marks executed using LDM exceed Base’sEDP, however, suggesting that the techniquemight be robust for a larger set of applica-tions. On average, LDM’s low penaltyreduces EDP by 10.3 percent.

IPC’s EDP varies between savings of 14.0percent and degradation of 42.9 percent. Shiftclearly shows the worst EDP measurement,ranging from savings of 5.9 percent to degra-dation of up to 78.7 percent. IPC on averageis almost identical to Base, and Shift is onaverage 29.3 percent worse than Base. Boththe IPC and Shift averages are affected

mmi2011060029.3d 14/11/011 14:27 Page 35

Table 1. Example of a Loop-Directed Mothballing prediction table.

Branch

address

Branch

pattern Prediction Guaranteed on

0x0FF4 0x0011 0x1110110000 0x0000010000

0x0FF4 0x101 0x1100101000 0x0000000000

0x5F20 0x10011 0x1000101100 0x0000001100

. . . . . . . . . . . .

....................................................................

NOVEMBER/DECEMBER 2011 35

detrimentally by poor EDP from bench-marks such as hmmer and sphinx3, butLDM avoids these losses.

Figure 7 shows IPC and Shift exhibitingpower-saving capabilities superior to bothBase and LDM, reducing power by a mini-mum of about 7 percent and 8 percent,respectively. Figure 8 demonstrates why, de-spite large and universal power savings, IPCand Shift have poor EDP. IPC’s perfor-mance loss (as a percentage of total executiontime) is low for many benchmarks, but somebenchmarks experience up to 26.7 percent

degradation. Shift causes the worst delay, ofup to 40.8 percent, and has more than18 percent degradation for 11 benchmarks.

In sphinx3, we see a significant perfor-mance gain using LDM. Statistics providedby SimpleScalar show that using LDM, wereduce load/store queue occupancy by about20 percent and decrease load/store queue la-tency by 25 percent. Although memory unitsare not power gated for sphinx3, power gat-ing other units will change instruction order-ing in the schedule and could result in fewerspeculatively executed memory operations.

mmi2011060029.3d 14/11/011 14:27 Page 36

0.0

0.5

1.0

1.5

2.0

400.p

erlben

ch

401.b

zip2

403.g

cc

429.m

cf

433.m

ilc

435.g

romac

s

436.c

actus

ADM

437.l

eslie

3d

445.g

obmk

456.h

mmer

458.s

jeng

462.l

ibquantu

m

464.h

264r

ef

470.l

bm

482.s

phinx3

998.s

pecra

ndom

Avera

ge

Loop-Directed Mothballing IPC method Shift method

Figure 6. Normalized EDP for three power-gating techniques (LDM, Shift, and IPC). Results are normalized against

simulation without any power gating (Base).

0.6

0.7

0.8

0.9

1.0

400.p

erlben

ch

401.b

zip2

403.g

cc

429.m

cf

433.m

ilc

435.g

romac

s

436.c

actus

ADM

437.l

eslie

3d

445.g

obmk

456.h

mmer

458.s

jeng

462.l

ibquantu

m

464.h

264r

ef

470.l

bm

482.s

phinx3

998.s

pecra

ndom

Avera

ge

Loop-Directed Mothballing IPC method Shift method

Figure 7. Power dissipation for three power-gating techniques (LDM, Shift, and IPC). Results are normalized against

simulation without any power gating (Base).

....................................................................

36 IEEE MICRO

...............................................................................................................................................................................................

COOL CHIPS

Avoiding performance loss is clearly im-portant to improving EDP, and we achievethis in LDM using two features. First,power gating is performed on regular inner-most loop structures, so we can make accu-rate predictions. Second, when we can’tmake accurate predictions, either becauseconditional statements inside the loopweren’t predicted or because noninnermostloop code is executing, all units return to afull power state. Although this potentiallywastes power, it won’t waste more powerthan Base, and performance shouldn’t de-grade, either.

M odern applications demand that pro-cessor performance continues to im-

prove, but increasingly, power is becoming abarrier to performance gains. Mobile devicesrequire longer battery lives, high performancemachines face challenges in cooling theprocessor, and high volume data centersincur high energy costs. With the shifttoward static power becoming the dominantcontributor to overall processor power con-sumption, power gating provides an effectivesolution. LDM applies power gating to theexecution units in the processor to savepower, but also prevents excessive perfor-mance loss that is undesirable for the user.Our future work will extend LDM so that thetechnique can be applied to all loops in theexecuting code, in an attempt to increase

the power savings. Beyond LDM, the chal-lenge will be to power gate other units in theprocessor pipeline as effectively, to furtherincrease the energy efficiency of processors,and to enable designs that can meet thedemands of future applications. MICRO

AcknowledgmentsAn Engineering and Physical Sciences

Research Council (EPSRC) Doctoral TrainingAccounts studentship funded this work. Wethank Jeremy Cohen for his assistance incompiling the benchmarks on DEC Alphamachines, and Konstantinos Glaros for hisadvice on CMOS and power gating.

....................................................................References

1. International Technology Roadmap for Semi-

conductors, 2010, http://www.itrs.net.

2. K. Agarwal et al., ‘‘Power Gating with Multi-

ple Sleep Modes,’’ Proc. 7th Int’l Symp.

Quality Electronic Design (ISQED 06), IEEE

Press, 2006, pp. 633-637.

3. J.A. Butts and G.S. Sohi, ‘‘A Static Power

Model for Architects,’’ Proc. 33rd Ann.

IEEE/ACM Int’l Symp. Microarchitecture,

IEEE Press, 2002, pp. 191-201.

4. S. Ghiasi, J. Casmira, and D. Grunwald,

‘‘Using IPC Variation in Workloads with Ex-

ternally Specified Rates to Reduce Power

Consumption,’’ Proc. Workshop Complexity

Effective Design, 2000.

mmi2011060029.3d 14/11/011 14:27 Page 37

0.6

0.8

1.0

1.2

1.4

400.p

erlben

ch

401.b

zip2

403.g

cc

429.m

cf

433.m

ilc

435.g

romac

s

436.c

actus

ADM

437.l

eslie

3d

445.g

obmk

456.h

mmer

458.s

jeng

462.l

ibquantu

m

464.h

264r

ef

470.l

bm

482.s

phinx3

998.s

pecra

ndom

Avera

ge

Loop-Directed Mothballing IPC method Shift method

Figure 8. Execution time for three power-gating techniques (LDM, Shift, and IPC). Results are normalized against

simulation without any power gating (Base).

....................................................................

NOVEMBER/DECEMBER 2011 37

5. D. Ikebuchi et al., ‘‘Geyser-1: A MIPS R3000

CPU Core with Fine Grain Runtime Power

Gating,’’ Proc. IEEE Int’l Solid-State Circuits

Conf. (ISSCC 09), IEEE Press, 2009,

pp. 281-284.

6. A. Iyer and D. Marculescu, ‘‘Runtime Scal-

ing of Microarchitecture Resources in a Pro-

cessor for Energy Savings,’’ Kool Chips

Workshop, IEEE Press, 2000, pp. 82-85.

7. T.M. Jones et al., ‘‘Compiler Directed Early

Register Release,’’ Proc. 14th Int’l Conf.

Parallel Architectures and Compilation Tech-

niques, IEEE Press, 2005, pp. 110-122.

8. R. Maro, Y. Bai, and R. Bahar, ‘‘Dynamically

Reconfiguring Processor Resources to

Reduce Power Consumption in High-

Performance Processors,’’ Proc. 1st Int’l

Workshop Power-Aware Computer Systems,

Springer, 2001, pp. 97-111.

9. M. Powell et al., ‘‘Gated-Vdd: A Circuit Tech-

nique to Reduce Leakage in Deep-Submicron

Cache Memories,’’ Proc. Int’l Symp. Low

Power Electronics and Design, ACM Press,

2000, pp. 90-95.

10. A.R. Rawson, ‘‘PowerPC Reference Plat-

form: Architectural Aspects of Power Man-

agement,’’ white paper, IBM, Feb. 1995.

11. R. Rosner et al., ‘‘Power Awareness

through Selective Dynamically Optimized

Traces,’’ Proc. 31st Ann. Int’l Symp. Com-

puter Architecture (ISCA 04), IEEE CS

Press, 2004, pp. 162-173.

12. R. Gonzalez and M. Horowitz, ‘‘Energy Dis-

sipation in General Purpose Microproces-

sors,’’ IEEE J. Solid-State Circuits, vol. 31,

no. 9, 2002, pp. 1277-1284.

13. D. Burger and T.M. Austin, ‘‘The SimpleScalar

Tool Set, Version 2.0,’’ ACM SIGARCH Com-

puter Architecture News, vol. 25, no. 3, 1997,

pp. 13-25.

14. S. Li et al., ‘‘McPAT: An Integrated Power,

Area, and Timing Modeling Framework for

Multicore and Many-Core Architectures,’’

Proc. 42nd Ann. IEEE/ACM Int’l Symp.

Microarchitecture, IEEE Press, 2009,

pp. 469-480.

15. 21264/EV68CB and 21264/EV68DC Hard-

ware Reference Manual, Compaq, Shrews-

bury, Mass., 2001.

Craig A. Court is a PhD student in theDepartment of Computing at ImperialCollege London. His research interestsinclude computer architecture, low-powersystem design, and custom computing usingfield-programmable gate array technology.Court has an MEng in computer sciencefrom the University of Bristol.

Paul H.J. Kelly is a professor of softwaretechnology in the Department of Computingat Imperial College London, where he headsthe Software Performance Optimization Re-search Group. His research interests includecompilers, architectures, domain-specific lan-guages, and end-to-end application optimiza-tion, particularly in computational-scienceapplications. Kelly has a PhD in computerscience from the University of London.

Direct questions and comments aboutthis article to Craig A. Court, Departmentof Computing, Imperial College London,London, United Kingdom, SW7 2AZ;[email protected].

mmi2011060029.3d 14/11/011 14:27 Page 38

....................................................................

38 IEEE MICRO

...............................................................................................................................................................................................

COOL CHIPS