power reduction techniques for microprocessor …power reduction techniques for microprocessor...

Power Reduction Techniques For Microprocessor Systems

VASANTH VENKATACHALAM AND MICHAEL FRANZ

University of California, Irvine

Power consumption is a major factor that limits the performance of computers. Wesurvey the “state of the art” in techniques that reduce the total power consumed by amicroprocessor system over time. These techniques are applied at various levelsranging from circuits to architectures, architectures to system software, and systemsoftware to applications. They also include holistic approaches that will become moreimportant over the next decade. We conclude that power management is a multifaceteddiscipline that is continually expanding with new techniques being developed at everylevel. These techniques may eventually allow computers to break through the “powerwall” and achieve unprecedented levels of performance, versatility, and reliability. Yet itremains too early to tell which techniques will ultimately solve the power problem.

Categories and Subject Descriptors: C.5.3 [Computer System Implementation]:Microcomputers—Microprocessors; D.2.10 [Software Engineering]: Design—Methodologies; I.m [Computing Methodologies]: Miscellaneous

General Terms: Algorithms, Design, Experimentation, Management, Measurement,Performance

Additional Key Words and Phrases: Energy dissipation, power reduction

1. INTRODUCTION

Computer scientists have always tried toimprove the performance of computers.But although today’s computers are muchfaster and far more versatile than theirpredecessors, they also consume a lot

Parts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712and by the Office of Naval Research under grant N00014-01-1-0854.Any opinions, findings, and conclusions or recommendations expressed in this material are those of theauthors and should not be interpreted as necessarily representing the official views, policies or endorsements,either expressed or implied, of the National Science foundation (NSF), the Office of Naval Research (ONR),or any other agency of the U.S. Government.The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems thatpartially supported this work.Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University ofCalifornia at Irvine, Irvine, CA 92697-3425; email: [email protected]; Michael Franz, School of Informationand Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works requires prior specific permission and/or a fee. Permissions may berequested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212)869-0481, or [email protected]©2005 ACM 0360-0300/05/0900-0195 $5.00

of power; so much power, in fact, thattheir power densities and concomitantheat generation are rapidly approachinglevels comparable to nuclear reactors(Figure 1). These high power densitiesimpair chip reliability and life expectancy,increase cooling costs, and, for large

ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237.

196 V. Venkatachalam and M. Franz

Fig. 1. Power densities rising. Figure adapted fromPollack 1999.

data centers, even raise environmentalconcerns.

At the other end of the performancespectrum, power issues also pose prob-lems for smaller mobile devices with lim-ited battery capacities. Although one couldgive these devices faster processors andlarger memories, this would diminishtheir battery life even further.

Without cost effective solutions tothe power problem, improvements inmicro-processor technology will eventu-ally reach a standstill. Power manage-ment is a multidisciplinary field that in-volves many aspects (i.e., energy, temper-ature, reliability), each of which is complexenough to merit a survey of its own. Thefocus of our survey will be on techniquesthat reduce the total power consumed bytypical microprocessor systems.

We will follow the high-level taxonomyillustrated in Figure 2. First, we will de-fine power and energy and explain thecomplex parameters that dynamic andstatic power depend on (Section 2). Next,we will introduce techniques that reducepower and energy (Section 3), startingwith circuit (Section 3.1) and architec-tural techniques (Section 3.2, Section 3.3,and Section 3.4), and then moving on totwo techniques that are widely applied inhardware and software, dynamic voltagescaling (Section 3.5) and resource hiberna-tion (Section 3.6). Third, we will examinewhat compilers can do to manage power(Section 3.7). We will then discuss recentwork in application level power manage-ment (Section 3.8), and recent efforts (Sec-tion 3.9) to develop a holistic solution to

the power problem. Finally, we will dis-cuss some commercial power managementsystems (Section 3.10) and provide aglimpse into some more radical technolo-gies that are emerging (Section 3.11).

2. DEFINING POWER

Power and energy are commonly definedin terms of the work that a system per-forms. Energy is the total amount of worka system performs over a period of time,while power is the rate at which the sys-tem performs that work. In formal terms,

P = W/T (1)

E = P ∗ T, (2)where P is power, E is energy, T is a spe-cific time interval, and W is the total workperformed in that interval. Energy is mea-sured in joules, while power is measuredin watts.

These concepts of work, power, and en-ergy are used differently in different con-texts. In the context of computers, workinvolves activities associated with run-ning programs (e.g., addition, subtraction,memory operations), power is the rate atwhich the computer consumes electricalenergy (or dissipates it in the form of heat)while performing these activities, and en-ergy is the total electrical energy the com-puter consumes (or dissipates as heat)over time.

This distinction between power and en-ergy is important because techniques thatreduce power do not necessarily reduce en-ergy. For example, the power consumedby a computer can be reduced by halv-ing the clock frequency, but if the com-puter then takes twice as long to runthe same programs, the total energy con-sumed will be similar. Whether one shouldreduce power or energy depends on thecontext. In mobile applications, reducingenergy is often more important becauseit increases the battery lifetime. How-ever, for other systems (e.g., servers), tem-perature is a larger issue. To keep thetemperature within acceptable limits, onewould need to reduce instantaneous powerregardless of the impact on total energy.

ACM Computing Surveys, Vol. 37, No. 3, September 2005.

Power Reduction Techniques for Microprocessor Systems 197

Fig. 2. Organization of this survey.

2.1. Dynamic Power Consumption

There are two forms of power consump-tion, dynamic power consumption andstatic power consumption. Dynamic powerconsumption arises from circuit activitysuch as the changes of inputs in anadder or values in a register. It has twosources, switched capacitance and short-circuit current.

Switched capacitance is the primarysource of dynamic power consumption andarises from the charging and dischargingof capacitors at the outputs of circuits.

Short-circuit current is a secondarysource of dynamic power consumption andaccounts for only 10-15% of the total powerconsumption. It arises because circuits arecomposed of transistors having oppositepolarity, negative or NMOS and positiveor PMOS. When these two types of tran-sistors switch current, there is an instantwhen they are simultaneously on, creat-ing a short circuit. We will not deal fur-ther with the power dissipation caused bythis short circuit because it is a smallerpercentage of total power, and researchershave not found a way to reduce it withoutsacrificing performance.

As the following equation shows, themore dominant component of dynamicpower, switched capacitance (Pdynamic), de-pends on four parameters namely, supplyvoltage (V), clock frequency (f), physical

capacitance (C) and an activity factor (a)that relates to how many 0 → 1 or 1 → 0transitions occur in a chip:

Pdynamic ∼ aCV 2 f . (3)

Accordingly, there are four ways to re-duce dynamic power consumption, thoughthey each have different tradeoffs and notall of them reduce the total energy con-sumed. The first way is to reduce the phys-ical capacitance or stored electrical chargeof a circuit. The physical capacitance de-pends on low level design parameterssuch as transistor sizes and wire lengths.One can reduce the capacitance by re-ducing transistor sizes, but this worsensperformance.

The second way to lower dynamic poweris to reduce the switching activity. As com-puter chips get packed with increasinglycomplex functionalities, their switchingactivity increases [De and Borkar 1999],making it more important to develop tech-niques that fall into this category. Onepopular technique, clock gating, gatesthe clock signal from reaching idle func-tional units. Because the clock networkaccounts for a large fraction of a chip’stotal energy consumption, this is a veryeffective way of reducing power and en-ergy throughout a processor and is imple-mented in numerous commercial systems



Fig. 3. ITRS trends for leakage power dissipation.Figure adapted from Meng et al., 2005.

including the Pentium 4, Pentium M, IntelXScale and Tensilica Xtensa, to mentionbut a few.

The third way to reduce dynamic powerconsumption is to reduce the clock fre-quency. But as we have just mentioned,this worsens performance and does not al-ways reduce the total energy consumed.One would use this technique only ifthe target system does not support volt-age scaling and if the goal is to re-duce the peak or average power dissi-pation and indirectly reduce the chip’stemperature.

The fourth way to reduce dynamic powerconsumption is to reduce the supply volt-age. Because reducing the supply voltageincreases gate delays, it also requires re-ducing the clock frequency to allow the cir-cuit to work properly.

The combination of scaling the supplyvoltage and clock frequency in tandem iscalled dynamic voltage scaling (DVS). Thistechnique should ideally reduce dynamicpower dissipation cubically because dy-namic power is quadratic in voltage andlinear in clock frequency. This is the mostwidely adopted technique. A growing num-ber of processors, including the PentiumM, mobile Pentium 4, AMD’s Athlon, andTransmeta’s Crusoe and Efficieon proces-sors allow software to adjust clock fre-quencies and voltage settings in tandem.However, DVS has limitations and cannotalways be applied, and even when it canbe applied, it is nontrivial to apply as wewill see in Section 3.5.

2.2. Understanding Leakage PowerConsumption

In addition to consuming dynamicpower, computer components consumestatic power, also known as idle poweror leakage. According to the most re-cently published industrial roadmaps[ITRSRoadMap], leakage power is rapidlybecoming the dominant source of powerconsumption in circuits (Figure 3) andpersists whether a computer is active oridle. Because its causes are different fromthose of dynamic power, dynamic powerreduction techniques do not necessarilyreduce the leakage power.

As the equation that follows illustrates,leakage power consumption is the productof the supply voltage (V) and leakage cur-rent (Ileak), or parasitic current, that flowsthrough transistors even when the tran-sistors are turned off.

Pleak = V Ileak. (4)

To understand how leakage currentarises, one must understand how transis-tors work. A transistor regulates the flowof current between two terminals calledthe source and the drain. Between thesetwo terminals is an insulator, called thechannel, that resists current. As the volt-age at a third terminal, the gate, is in-creased, electrical charge accumulates inthe channel, reducing the channel’s resis-tance and creating a path along whichelectricity can flow. Once the gate volt-age is high enough, the channel’s polar-ity changes, allowing the normal flow ofcurrent between the source and the drain.The threshold at which the gate’s voltageis high enough for the path to open is calledthe threshold voltage.

According to this model, a transistor issimilar to a water dam. It is supposedto allow current to flow when the gatevoltage exceeds the threshold voltage butshould otherwise prevent current fromflowing. However, transistors are imper-fect. They leak current even when the gatevoltage is below the threshold voltage. Infact, there are six different types of cur-rent that leak through a transistor. These



include reverse-biased-junction leakage,gate-induced-drain leakage, subthresholdleakage, gate-oxide leakage, gate-currentleakage, and punch-through leakage. Ofthese six, subthreshold leakage and gate-oxide leakage dominate the total leakagecurrent.

Gate-oxide leakage flows from the gate ofa transistor into the substrate. This typeof leakage current depends on the thick-ness of the oxide material that insulatesthe gate:

Iox = K2W(

VTox

)2

e−αToxV . (5)

According to this equation, the gate-oxide leakage Iox increases exponentiallyas the thickness Tox of the gate’s oxide ma-terial decreases. This is a problem becausefuture chip designs will require the thick-ness to be reduced along with other scaledparameters such as transistor length andsupply voltage. One way of solving thisproblem would be to insulate the gate us-ing a high-k dialectric material insteadof the oxide materials that are currentlyused. This solution is likely to emerge overthe next decade.

Subthreshold leakage current flows be-tween the drain and source of a transistor.It is the dominant source of leakage anddepends on a number of parameters thatare related through the following equa-tion:

Isub = K1We−Vth

nT(1 − e

−VT

). (6)

In this equation, W is the gate widthand K and n are constants. The impor-tant parameters are the supply voltageV , the threshold voltage Vth, and thetemperature T. The subthreshold leak-age current Isub increases exponentiallyas the threshold voltage Vth decreases.This again raises a problem for futurechip designs, because as technology scales,threshold voltages will have to scale alongwith supply voltages.

The increase in subthreshold leakagecurrent causes another problem. When theleakage current increases, the tempera-

ture increases. But as the Equation (6)shows, this increases leakage further,causing yet higher temperatures. This vi-cious cycle is known as thermal runaway.It is the chip designer’s worst nightmare.

How can these problems be solved?Equations (4) and (6) indicate four ways toreduce leakage power. The first way is toreduce the supply voltage. As we will see,supply voltage reduction is a very commontechnique that has been applied to compo-nents throughout a system (e.g., processor,buses, cache memories).

The second way to reduce leakage poweris to reduce the size of a circuit because thetotal leakage is proportional to the leak-age dissipated in all of a circuit’s tran-sistors. One way of doing this is to de-sign a circuit with fewer transistors byomitting redundant hardware and usingsmaller caches, but this may limit perfor-mance and versatility. Another idea is toreduce the effective transistor count dy-namically by cutting the power supplies toidle components. Here, too, there are chal-lenges such as how to predict when dif-ferent components will be idle and how tominimize the overhead of shutting themon or off. This, is also a common approachof which we will see examples in the sec-tions to follow.

The third way to reduce leakage poweris to cool the computer. Several coolingtechniques have been developed sincethe 1960s. Some blow cold air into thecircuit, while others refrigerate theprocessor [Schmidt and Notohardjono2002], sometimes even by costly meanssuch as circulating cryogenic fluids likeliquid nitrogen [Krane et al. 1988]. Thesetechniques have three advantages. Firstthey significantly reduce subthresholdleakage. In fact, a recent study [Schmidtand Notohardjono 2002] showed that cool-ing a memory cell by 50 degrees Celsiusreduces the leakage energy by five times.Second, these techniques allow a circuit towork faster because electricity encountersless resistance at lower temperatures.Third, cooling eliminates some negativeeffects of high temperatures, namely thedegradation of a chip’s reliability and lifeexpectancy. Despite these advantages,



Fig. 4. The transistor stacking effect. The cir-cuits in both (a) and (b) are leaking current be-tween Vdd and Ground. However, Ileak1, the leak-age in (a), is less than Ileak2, the leakage in (b).Figure adapted from Butts and Sohi 2000.

there are issues to consider such as thecosts of the hardware used to cool thecircuit. Moreover, cooling techniquesare insufficient if they result in widetemperature variations in different partsof a circuit. Rather, one needs to preventhotspots by distributing heat evenlythroughout a chip.

2.2.1. Reducing Threshold Voltage. Thefourth way of reducing leakage power is toincrease the threshold voltage. As Equa-tion (6) shows, this reduces the subthresh-old leakage exponentially. However, it alsoreduces the circuit’s performance as is ap-parent in the following equation that re-lates frequency (f), supply voltage (V), andthreshold voltage (Vth) and where α is aconstant:

f ∞ (V − Vth)α

V. (7)

One of the less intuitive ways ofincreasing the threshold voltage is toexploit what is called the stacking effect(refer to Figure 4). When two or more tran-sistors that are switched off are stackedon top of each other (a), then they dis-sipate less leakage than a single tran-sistor that is turned off (b). This is be-cause each transistor in the stack inducesa slight reverse bias between the gate andsource of the transistor right below it,

Fig. 5. Multiple threshold circuits withsleep transistors.

and this increases the threshold voltageof the bottom transistor, making it moreresistant to leakage. As a result, in Fig-ure 4(a), in which all transistors are inthe Off position, transistor B leaks lesscurrent than transistor A, and transistorC leaks less current than transistor B.Hence, the total leakage current is atten-uated as it flows from Vdd to the groundthrough transistors A, B, and C. This is notthe case in the circuit shown in Figure 4(b), which contains only a single offtransistor.

Another way to increase the thresh-old voltage is to use Multiple Thresh-old Circuits With Sleep Transistors (MTC-MOS) [Calhoun et al. 2003; Won et al.2003] . This involves isolating a leaky cir-cuit element by connecting it to a pairof virtual power supplies that are linkedto its actual power supplies through sleeptransistors (Figure 5). When the circuit isactive, the sleep transistors are activated,connecting the circuit to its power sup-plies. But when the circuit is inactive, thesleep transistors are deactivated, thus dis-connecting the circuit from its power sup-plies. In this inactive state, almost no leak-age passes through the circuit because thesleep transistors have high threshold volt-ages. (Recall that subthreshold leakagedrops exponentially with a rise in thresh-old voltage, according to Equation (6).)



Fig. 6. Adaptive body biasing.

This technique effectively confines theleakage to one part of the circuit, but istricky to implement for several reasons.The sleep transistors must be sized prop-erly to minimize the overhead of acti-vating them. They cannot be turned onand off too frequently. Moreover, this tech-nique does not readily apply to memoriesbecause memories lose data when theirpower supplies are cut.

A third way to increase the thresholdis to employ dual threshold circuits. Dualthreshold circuits [Liu et al. 2004; Weiet al. 1998; Ho and Hwang 2004] reduceleakage by using high threshold (low leak-age) transistors on noncritical paths andlow threshold transistors on critical paths,the idea being that noncritical paths canexecute instructions more slowly withoutimpairing performance. This is a diffi-cult technique to implement because it re-quires choosing the right combination oftransistors for high-threshold voltages. Iftoo many transistors are assigned highthreshold voltages, the noncritical pathsin the circuit can slow down too much.

A fourth way to increase the thresh-old voltage is to apply a technique knownas adaptive body biasing [Seta et al.1995; Kobayashi and Sakurai 1994; Kimand Roy 2002]. Adaptive body biasing isa runtime technique that reduces leak-age power by dynamically adjusting thethreshold voltages of circuits dependingon whether the circuits are active. Whena circuit is not active, the technique in-creases its threshold voltage, thus savingleakage power exponentially, although atthe expense of a delay in circuit operation.When the circuit is active, the techniquedecreases the threshold voltage to avoidslowing it down.

To adjust the threshold voltage, adap-tive body biasing applies a voltage to thetransistor’s body known as a body biasvoltage (Figure 6). This voltage changesthe polarity of a transistor’s channel, de-creasing or increasing its resistance tocurrent flow. When the body bias voltageis chosen to fill the transistor’s channelwith positive ions (b), the threshold volt-age increases and reduces leakage cur-rents. However, when the voltage is cho-sen to fill the channel with negative ions,the threshold voltage decreases, allowinghigher performance, though at the cost ofmore leakage.

3. REDUCING POWER

3.1. Circuit And Logic Level Techniques

3.1.1. Transistor Sizing. Transistor siz-ing [Penzes et al. 2002; Ebergen et al.2004] reduces the width of transistorsto reduce their dynamic power consump-tion, using low-level models that relate thepower consumption to the width. Accord-ing to these models, reducing the widthalso increases the transistor’s delay andthus the transistors that lie away fromthe critical paths of a circuit are usuallythe best candidates for this technique. Al-gorithms for applying this technique usu-ally associate with each transistor a toler-able delay which varies depending on howclose that transistor is to the critical path.These algorithms then try to scale eachtransistor to be as small as possible with-out violating its tolerable delay.

3.1.2. Transistor Reordering. The ar-rangement of transistors in a circuitaffects energy consumption. Figure 7



Fig. 7. Transistor Reordering. Figureadapted from Hossain et al. 1996.

shows two possible implementations ofthe same circuit that differ only in theirplacement of the transistors marked Aand B. Suppose that the input to transis-tor A is 1, the input to transistor B is 1,and the input to transistor C is 0. Thentransistors A and B will be on, allowingcurrent from Vdd to flow through themand charge the capacitors C1 and C2.

Now suppose that the inputs change andthat A’s input becomes 0, and C’s input be-comes 1. Then A will be off while B and Cwill be on. Now the implementations in (a)and (b) will differ in the amounts of switch-ing activity. In (a), current from groundwill flow past transistors B and C, dis-charging both the capacitors C1 and C2.However, in (b), the current from groundwill only flow past transistor C; it will notget past transistor A since A is turned off.Thus it will only discharge the capacitorC2, rather than both C1 and C2 as in part(a). Thus the implementation in (b) willconsume less power than that in (a).

Transistor reordering [Kursun et al.2004; Sultania et al. 2004] rearrangestransistors to minimize their switchingactivity. One of its guiding principles isto place transistors closer to the circuit’soutputs if they switch frequently in or-der to prevent a domino effect wherethe switching activity from one transistortrickles into many other transistors caus-ing widespread power dissipation. This re-quires profiling techniques to determinehow frequently different transistors arelikely to switch.

3.1.3. Half Frequency and Half Swing Clocks.Half-frequency and half-swing clocks re-duce frequency and voltage, respectively.Traditionally, hardware events such asregister file writes occur on a risingclock edge. Half-frequency clocks synchro-nize events using both edges, and theytick at half the speed of regular clocks,thus cutting clock switching power inhalf. Reduced-swing clocks also often usea lower voltage signal and thus reducepower quadratically.

3.1.4. Logic Gate Restructuring. There aremany ways to build a circuit out of logicgates. One decision that affects power con-sumption is how to arrange the gates andtheir input signals.

For example, consider two implementa-tions of a four-input AND gate (Figure 8),a chain implementation (a), and a treeimplementation (b). Knowing the signalprobabilities (1 or 0) at each of the pri-mary inputs (A, B, C, D), one can easily cal-culate the transition probabilities (0→1)for each output (W, X, F, Y, Z). If each in-put has an equal probability of being a1 or a 0, then the calculation shows thatthe chain implementation (a) is likely toswitch less than the tree implementation(b). This is because each gate in a chainhas a lower probability of having a 0→1transition than its predecessor; its tran-sition probability depends on those of allits predecessors. In the tree implementa-tion, on the other hand, some gates mayshare a parent (in the tree topology) in-stead of being directly connected together.These gates could have the same transi-tion probabilities.

Nevertheless, chain implementationsdo not necessarily save more energy thantree implementations. There are other is-sues to consider when choosing a topology.One is the issue of glitches or spurioustransitions that occur when a gate does notreceive all of its inputs at the same time.These glitches are more common in chainimplementations where signals can travelalong different paths having widely vary-ing delays. One solution to reduce glitchesis to change the topology so that the dif-ferent paths in the circuit have similar



Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania StateUniversity Microsystems Design Laboratory’s tutorial on low power design.

delays. This solution, known as path bal-ancing often transforms chain implemen-tations into tree implementations. An-other solution, called retiming, involvesinserting flip-flops or registers to slowdown and thereby synchronize the signalsthat pass along different paths but recon-verge to the same gate. Because flip-flopsand registers are in sync with the proces-sor clock, they sample their inputs less fre-quently than logic gates and are thus moreimmune to glitches.

3.1.5. Technology Mapping. Because ofthe huge number of possibilities andtradeoffs at the gate level, designers relyon tools to determine the most energy-optimal way of arranging gates and sig-nals. Technology mapping [Chen et al.2004; Li et al. 2004; Rutenbar et al. 2001]is the automated process of constructing agate-level representation of a circuit sub-ject to constraints such as area, delay, andpower. Technology mapping for power re-lies on gate-level power models and a li-brary that describes the available gates,and their design constraints. Before a cir-cuit can be described in terms of gates, itis initially represented at the logic level.The problem is to design the circuit out oflogic gates in a way that will mimimize thetotal power consumption under delay andcost constraints. This is an NP-hard Di-rected Acyclic Graph (DAG) covering prob-lem, and a common heuristic to solve it isto break the DAG representation of a cir-cuit into a set of trees and find the optimalmapping for each subtree using standardtree-covering algorithms.

3.1.6. Low Power Flip-Flops. Flip-flopsare the building blocks of small memories

such as register files. A typical master-slave flip-flop consists of two latches, amaster latch, and a slave latch. The inputsof the master latch are the clock signaland data. The inputs of the slave latchare the inverse of the clock signal and thedata output by the master latch. Thuswhen the clock signal is high, the masterlatch is turned on, and the slave latchis turned off. In this phase, the mastersamples whatever inputs it receives andoutputs them. The slave, however, doesnot sample its inputs but merely outputswhatever it has most recently stored. Onthe falling edge of the clock, the masterturns off and the slave turns on. Thus themaster saves its most recent input andstops sampling any further inputs. Theslave samples the new inputs it receivesfrom the master and outputs it.

Besides this master-slave design areother common designs such as the pulse-triggered flip-flop and sense-amplifier flip-flop. All these designs share some commonsources of power consumption, namelypower dissipated from the clock signal,power dissipated in internal switching ac-tivity (caused by the clock signal and bychanges in data), and power dissipatedwhen the outputs change.

Researchers have proposed several al-ternative low power designs for flip-flops.Most of these approaches reduce theswitching activity or the power dissipatedby the clock signal. One alternative is theself-gating flip-flop. This design inhibitsthe clock signal to the flip-flop when the in-puts will produce no change in the outputs.Strollo et al. [2000] have proposed two ver-sions of this design. In the double-gatedflip-flop, the master and slave latcheseach have their own clock-gating circuit.



The circuit consists of a comparator thatchecks whether the current and previousinputs are different and some logic that in-hibits the clock signal based on the outputof the comparator. In the sequential-gatedflip-flop, the master and the slave sharethe same clock-gating circuit instead ofhaving their own individual circuit as inthe double-gated flip-flop.

Another low-power flip-flop is the condi-tional capture FLIP-FLOP [Kong et al. 2001;Nedovic et al. 2001]. This flip-flop detectswhen its inputs produce no change in theoutput and stops these redundant inputsfrom causing any spurious internal cur-rent switches. There are many variationson this idea, such as the conditional dis-charge flip-flops and conditional prechargeflip-flops [Zhao et al. 2004] which elim-inate unnecessary precharging and dis-charging of internal elements.

A third type of low-power flip-flop is thedual edge triggered FLIP-FLOP [Llopis andSachdev 1996]. These flip-flops are sensi-tive to both the rising and falling edgesof the clock, and can do the same workas a regular flip-flop at half the clock fre-quency. By cutting the clock frequencyin half, these flip-flops save total powerconsumption but may not save total en-ergy, although they could save energy byreducing the voltage along with the fre-quency. Another drawback is that theseflip-flops require more area than regu-lar flip-flops. This increase in area maycause them to consume more power on thewhole.

3.1.7. Low-Power Control Logic Design.One can view the control logic of a pro-cessor as a Finite State Machine (FSM).It specifies the possible processor statesand conditions for switching betweenthe states and generates the signals thatactivate the appropriate circuitry for eachstate. For example, when an instructionis in the execution stage of the pipeline,the control logic may generate signals toactivate the correct execution unit.

Although control logic optimizationshave traditionally targeted performance,they are now also targeting power. One

way of reducing power is to encode theFSM states in a way that minimizes theswitching activity throughout the proces-sor. Another common approach involvesdecomposing the FSM into sub-FSMsand activating only the circuitry neededfor the currently executing sub-FSM.Some researchers [Gao and Hayes 2003]have developed tools that automate thisprocess of decomposition, subject to somequality and correctness constraints.

3.1.8. Delay-Based Dynamic-Supply VoltageAdjustment. Commercial processors thatcan run at multiple clock speed normallyuse a lookup table to decide what supplyvoltage to select for a given clock speed.This table is use built ahead of timethrough a worst-case analysis. Becauseworst-case analyses may not reflect real-ity, ARM Inc. has been developing a moreefficient runtime solution known as theRazor pipeline [Ernst et al. 2003].

Instead of using a lookup table, Razoradjusts the supply voltage based on thedelay in the circuit. The idea is that when-ever the voltage is reduced, the circuitslows down, causing timing errors if theclock frequency chosen for that voltage istoo high. Because of these errors, some in-structions produce inconsistent results orfail altogether. The Razor pipeline peri-odically monitors how many such errorsoccur. If the number of errors exceeds athreshold, it increases the supply voltage.If the number of errors is below a thresh-old, it scales the voltage further to savemore energy.

This solution requires extra hardwarein order to detect and correct circuiterrors. To detect errors, it augmentsthe flip-flops in delay-critical regions ofthe circuit with shadow-latches. Theselatches receive the same inputs as theflip-flops but are clocked more slowly toadapt to the reduced supply voltage. Ifthe output of the flip-flop differs fromthat of its shadow-latch, then this sig-nifies an error. In the event of such anerror, the circuit propagates the outputof the shadow-latch instead of the flip-flop, delaying the pipeline for a cycle ifnecessary.



Fig. 9. Crosstalk prevention through shielding.

3.2. Low-Power Interconnect

Interconnect heavily affects power con-sumption since it is the medium of mostelectrical activity. Efforts to improve chipperformance are resulting in smaller chipswith more transistors and more denselypacked wires carrying larger currents. Thewires in a chip often use materials withpoor thermal conductivity [Banerjee andMehrotra 2001].

3.2.1. Bus Encoding And CrossTalk. A pop-ular way of reducing the power consumedin buses is to reduce the switching activitythrough intelligent bus encoding schemes,such as bus-inversion [Stan and Burleson1995]. Buses consist of wires that trans-mit bits (logical 1 or 0). For every datatransmission on a bus, the number of wiresthat switch depends on the current andprevious values transmitted. If the Ham-ming distance between these values ismore than half the number of wires, thenmost of the wires on the bus will switchcurrent. To prevent this from happening,bus-inversion transmits the inverse of theintended value and asserts a control sig-nal alerting recipients of the inversion.For example, if the current binary value totransmit is 110 and the previous was 000,the bus instead transmits 001, the inverseof 110.

Bus-inversion ensures that at most halfof the bus wires switch during a bus trans-action. However, because of the cost of thelogic required to invert the bus lines, thistechnique is mainly used in external busesrather than the internal chip interconnect.

Moreover, some researchers have ar-gued that this technique makes the sim-plistic assumption that the amount ofpower that an interconnect wire consumesdepends only on the number of bit tran-sitions on that wire. As chips become

smaller, there arise additional sources ofpower consumption. One of these sourcesis crosstalk [Sylvester and Keutzer 1998].Crosstalk is spurious activity on a wirethat is caused by activity in neighboringwires. As well as increasing delays and im-pairing circuit integrity, crosstalk can in-crease power consumption.

One way of reducing crosstalk [Tayloret al. 2001] is to insert a shield wire(Figure 9) between adjacent bus wires.Since the shield remains deasserted, noadjacent wires switch in opposite direc-tions. This solution doubles the number ofwires. However, the principle behind it hassparked efforts to develop self-shieldingcodes [Victor and Keutzer 2001; Patel andMarkov 2003] resistant to crosstalk. Asin traditional bus encoding, a value is en-coded and then transmitted. However, thecode chosen avoids opposing transitions onadjacent bus wires.

3.2.2. Low Swing Buses. Bus-encodingschemes attempt to reduce transitions onthe bus. Alternatively, a bus can transmitthe same information but at a lowervoltage. This is the principle behind lowswing buses [Zhang and Rabaey 1998].Traditionally, logical one is representedby 5 volts and logical zero is representedby −5 volts. In a low-swing system(Figure 10), logical one and zero areencoded using lower voltages, such as+300mV and −300mV. Typically, thesesystems are implemented with differen-tial signaling. An input signal is split intotwo signals of opposite polarity boundedby a smaller voltage range. The receiversees the difference between the twotransmitted signals as the actual signaland amplifies it back to normal voltage.Low Swing Differential Signaling hasseveral advantages in addition to reducedpower consumption. It is immune to



Fig. 10. Low voltage differential signaling

Fig. 11. Bus segmentation.

crosstalk and electromagnetic radiationeffects. Since the two transmitted signalsare close together, any spurious activitywill affect both equally without affectingthe difference between them. However,when implementing this technique, oneneeds to consider the costs of increasedhardware at the encoder and decoder.

3.2.3. Bus Segmentation. Another effec-tive technique is bus segmentation. In atraditional shared bus architecture, theentire bus is charged and discharged uponevery access. Segmentation (Figure 11)splits a bus into multiple segments con-nected by links that regulate the traffic be-tween adjacent segments. Links connect-ing paths essential to a communication areactivated independently, allowing most ofthe bus to remain powered down. Ideally,devices communicating frequently shouldbe in the same or nearby segments to avoidpowering many links. Jone et al. [2003]have examined how to partition a busto ensure this property. Their algorithmbegins with an undirected graph whosenodes are devices, edges connect commu-nicating devices, and edge weights dis-play communication frequency. Applyingthe Gomory and Hu algorithm [1961], theyfind a spanning tree in which the prod-uct of communication frequency and num-ber of edges between any two devices isminimal.

3.2.4. Adiabatic Buses. The precedingtechniques work by reducing activityor voltage. In contrast, adiabatic cir-cuits [Lyuboslavsky et al. 2000] reduce to-tal capacitance. These circuits reuse ex-isting electrical charge to avoid creatingnew charge. In a traditional bus, whena wire becomes deasserted, its previouscharge is wasted. A charge-recovery busrecycles the charge for wires about to beasserted.

Figure 12 illustrates a simple design fora two-bit adiabatic bus [Bishop and Ir-win 1999]. A comparator in each bitlinetests whether the current bit is equal tothe previously sent bit. Inequality signi-fies a transition and connects the bitlineto a shorting wire used for sharing chargeacross bitlines. In the sample scenario,Bitline1 has a falling transition while Bit-line0 has a rising transition. As Bitline1falls, its positive charge transfers to Bit-line0, allowing Bitline0 to rise. The powersaved depends on transition patterns. Noenergy is saved when all lines rise. Themost energy is saved when an equal num-ber of lines rise and fall simultaneously.

The biggest drawback of adiabatic cir-cuits is a delay for transfering sharedcharge.

3.2.5. Network-On-Chip. All of the abovetechniques assume an architecture in



Fig. 12. Two bit charge recovery bus.

which multiple functional units share oneor more buses. This shared-bus paradigmhas several drawbacks with respect topower and performance. The inherent busbandwidth limits the speed and volumeof data transfers and does not suit thevarying requirements of different execu-tion units. Only one unit can access thebus at a time, though several may be re-questing bus access simultaneously. Busarbitration mechanisms are energy inten-sive. Unlike simple data transfers, everybus transaction involves multiple clockcycles of handshaking protocols that in-crease power and delay.

As a consequence, researchers havebeen investigating the use of generic in-terconnection networks to replace buses.Compared to buses, networks offer higherbandwidth and support concurrent con-nections. This design makes it easier totroubleshoot energy-intensive areas of thenetwork. Many networks have sophisti-cated mechanisms for adapting to varyingtraffic patterns and quality-of-service re-quirements. Several authors [Zhang et al.2001; Dally and Towles 2001; Sgroi et al.2001] propose the application of thesetechniques at the chip level. These claimshave yet to be verified, and interconnectpower reduction remains a promising areafor future research.

Wang et al. [2003] have developedhardware-level optimizations for differ-ent sources of energy consumption inan on-chip network. One of the alterna-tive network topologies they propose isthe segmented crossbar topology whichaims to reduce the energy consumptionof data transfers. When data is trans-fered over a regular network grid, all rows

and columns corresponding to the inter-section points end up switching currenteven though only parts of these rows andcolumns are actually traversed. To elim-inate this widespread current switching,the segmented crossbar topology dividesthe rows and columns into segments de-marcated by tristate buffers. The differ-ent segments are selectively activated asthe data traverses through them, henceconfining current switches to only thosesegments along which the data actuallypasses.

3.3. Low-Power Memories and MemoryHierarchies

3.3.1. Types of Memory. One can classifymemory structures into two categories,Random Access Memories (RAM) andRead-Only-Memories (ROM). There aretwo kinds of RAMs, static RAMs (SRAM)and dynamic RAMs (DRAM) which dif-fer in how they store data. SRAMs storedata using flip-flops and DRAMs storeeach bit of data as a charge on a capac-itor; thus DRAMs need to refresh theirdata periodically. SRAMs allow faster ac-cesses than DRAMs but require morearea and are more expensive. As a result,normally only register files, caches, andhigh bandwidth parts of the system aremade up of SRAM cells, while main mem-ory is made up of DRAM cells. Althoughthese cells have slower access timesthan SRAMs, they contain fewer transis-tors and are less expensive than SRAMcells.

The techniques we will introduce are notconfined to any specific type of RAM orROM. Rather they are high-level architec-tural principles that apply across the spec-trum of memories to the extent that therequired technology is available. They at-tempt to reduce the energy dissipation ofmemory accesses in two ways, either byreducing the energy dissipated in a mem-ory accesses, or by reducing the number ofmemory accesses.

3.3.2. Splitting Memories Into Smaller Sub-systems. An effective way to reduce theenergy that is dissipated in a memory



access is to activate only the needed mem-ory circuits in each access. One way to ex-pose these circuits is to partition memo-ries into smaller, independently accessiblecomponents. This can be done as differentgranularities.

At the lowest granularity, one can de-sign a main memory system that is splitup into multiple banks each of whichcan independently transition into differ-ent power modes. Compilers and operatingsystems can cluster data into a minimalset of banks, allowing the other unusedbanks to power down and thereby save en-ergy [Luz et al. 2002; Lebeck et al. 2000].

At a higher granularity, one can parti-tion the separate banks of a partitionedmemory into subbanks and activate onlythe relevant subbank in every memory ac-cess. This is a technique that has been ap-plied to caches [Ghose and Kamble 1999].In a normal cache access, the set-selectionstep involves transferring all blocks in allbanks onto tag-comparison latches. Sincethe requested word is at a known off-set in these blocks, energy can be savedby transfering only words at that offset.Cache subbanking splits the block array ofeach cache line into multiple banks. Dur-ing set-selection, only the relevant bankfrom each line is active. Since a single sub-bank is active at a time, all subbanks ofa line share the same output latches andsense amplifiers to reduce hardware com-plexity. Subbanking saves energy with-out increasing memory access time. More-over, it is independent of program localitypatterns.

3.3.3. Augmenting the Memory HierarchyWith Specialized Cache Structures. The sec-ond way to reduce energy dissipation inthe memory hierarchy is to reduce thenumber of memory hierarchy accesses.One group of researchers [Kin et al. 1997]developed a simple but effective techniqueto do this. Their idea was to integrate aspecialized cache into the typical mem-ory hierarchy of a modern processor, ahierarchy that may already contain oneor more levels of caches. The new cachethey were proposing would sit between the

processor and first level cache. It wouldbe smaller than the first level cache andhence dissipate less energy. Yet it wouldbe large enough to store an application’sworking set and filter out many mem-ory references; only when a data requestmisses this cache would it require search-ing higher levels of cache, and the penaltyfor a miss would be offset by redirectingmore memory references to this smaller,more energy efficient cache. This was theprinciple behind the filter cache. Thoughdeceptively simple, this idea lies at theheart of many of the low-power cache de-signs used today, ranging from simple ex-tensions of the cache hierarchy (e.g., blockbuffering) to the scratch pad memoriesused in embedded systems, and the com-plex trace caches used in high-end gener-ation processors.

One simple extension is the technique ofblock buffering. This technique augmentscaches with small buffers to store the mostrecently accessed cache set. If a data itemthat has been requested belongs to a cacheset that is already in the buffer, then onedoes not have to search for it in the rest ofthe cache.

In embedded systems, scratch pad mem-ories [Panda et al. 2000; Banakar et al.2002; Kandemir et al. 2002] have beenthe extension of choice. These are soft-ware controlled memories that reside on-chip. Unlike caches, the data they shouldcontain is determined ahead of time andremains fixed while programs execute.These memories are less volatile thanconventional caches and can be accessedwithin a single cycle. They are ideal forspecialized embedded systems whose ap-plications are often hand tuned.

General purpose processors of the lat-est generation sometimes rely on morecomplex caching techniques such as thetrace cache. Trace caches were originallydeveloped for performance but have beenstudied recently for their power bene-fits. Instead of storing instructions intheir compiled order, a trace cache storestraces of instructions in their executedorder. If an instruction sequence is al-ready in the trace cache, then it neednot be fetched from the instruction cache



but can be decoded directly from thetrace cache. This saves power by re-ducing the number of instruction cacheaccesses.

In the original trace cache design, boththe trace cache and the instruction cacheare accessed in parallel on every clock cy-cle. Thus instructions are fetched from theinstruction cache while the trace cache issearched for the next trace that is pre-dicted to executed. If the trace is in thetrace cache, then the instructions fetchedfrom the instruction cache are discarded.Otherwise the program execution pro-ceeds out of the instruction cache and si-multaneously new traces are created fromthe instructions that are fetched.

Low-power designs for trace caches typi-cally aim to reduce the number of instruc-tion cache accesses. One alternative, thedynamic direction prediction-based tracecache [Hu et al. 2003], uses branch predi-tion to decide where to fetch instructionsfrom. If the branch predictor predicts thenext trace with high confidence and thattrace is in the trace cache, then the in-structions are fetched from the trace cacherather than the instruction cache.

The selective trace cache [Hu et al. 2002]extends this technique with compiler sup-port. The idea is to identify frequently exe-cuted, or hot traces, and bring these tracesinto the trace cache. The compiler insertshints after every branch instruction, indi-cating whether it belongs to a hot trace. Atruntime, these hints cause instructions tobe fetched from either the trace cache orthe instruction cache.

3.4. Low-Power Processor ArchitectureAdaptations

So far we have described the energy sav-ing features of hardware as though theywere a fixed foundation upon which pro-grams execute. However, programs exhibitwide variations in behavior. Researchershave been developing hardware structureswhose parameters can be adjusted on de-mand so that one can save energy byactivating just the minimum hardwareresources needed for the code that isexecuting.

Fig. 13. Drowsy cache line.

3.4.1. Adaptive Caches. There is awealth of literature on adaptive caches,caches whose storage elements (lines,blocks, or sets) can be selectively acti-vated based on the application workload.One example of such a cache is theDeep-Submicron Instruction (DRI)cache [Powell et al. 2001]. This cachepermits one to deactivate its individualsets on demand by gating their supplyvoltages. To decide what sets to activate atany given time, the cache uses a hardwareprofiler that monitors the application’scache-miss patterns. Whenever the cachemisses exceed a threshold, the DRI cacheactivates previously deactivated sets.Likewise, whenever the miss rate fallsbelow a threshold, the DRI deactivatessome of these sets by inhibiting theirsupply voltages.

A problem with this approach is thatdormant memory cells lose data and needmore time to be reactivated for their nextuse. Thus an alternative to inhibiting theirsupply voltages is to reduce their voltagesas low as possible without losing data.This is the aim of the drowsy cache, a cachewhose lines can be placed in a drowsymode where they dissipate minimal powerbut retain data and can be reactivatedfaster.

Figure 13 illustrates a typical drowsycache line [Kim et al. 2002]. The wakeupsignal chooses between voltages for theactive, drowsy, and off states. It alsoregulates the word decoder’s output toprevent data in a drowsy line from be-ing accessed. Before a cache access, theline circuitry must be precharged. In a



Fig. 14. Dead block elimination.

traditional cache, the precharger is contin-uously on. To save power, the drowsy cir-cuitry regulates the wakeup signal with aprecharge signal. The precharger becomesactive only when the line is woken up.

Many heuristics have been developedfor controlling these drowsy circuits. Thesimplest policy for controlling the drowsycache is cache decay [Kaxiras et al. 2001]which simply turns off unused cache linesafter a fixed interval. More sophisticatedpolicies [Zhang et al. 2002; 2003; Hu et al.2003] consider information about runtimeprogram behavior. For example, Hu etal. [2003] have proposed hardware thatdetects when an execution moves into ahotspot and activates only those cachelines that lie within that hotspot. Themain idea of their approach is to counthow often different branches are taken.When a branch is taken sufficiently manytimes, its target address is a hotspot, andthe hardware sets special bits to indicatethat cache lines within that hotspot shouldnot be shut down. Periodically, a runtimeroutine disables cache lines that do nothave these bits set and decays the profil-ing data. Whenever it reactivates a cacheline before its next use, it also reactivatesthe line that corresponds to the next sub-sequent memory access.

There are many other heuristics for con-trolling cache lines. Dead-block elimina-tion [Kabadi et al. 2002] powers downcache lines containing basic blocks thathave reached their final use. It is acompiler-directed technique that requiresa static control flow analysis to identifythese blocks. Consider the control flowgraph in Figure 14. Since block B1 exe-cutes only once, its cache lines can powerdown once control reaches B2. Similarly,the lines containing B2 and B3 can powerdown once control reaches B4.

3.4.2. Adaptive Instruction Queues. Buyu-ktosunoglu et al. [2001] developed one of

the first adaptive instruction issue queues,a 32-bit queue consisting of four equal sizepartitions that can be independently acti-vated. Each partition consists of wakeuplogic that decides when instructions areready to execute and readout logic thatdispatches ready instructions into thepipeline. At any time, only the partitionsthat contain the currently executing in-structions are activated.

There have been many heuristics devel-oped for configuring these queues. Somerequire measuring the rate at which in-structions are executed per processor clockcycles, or IPC. For example, Buyukto-sunoglu et al.’s heuristic [2001] periodi-cally measures the IPC over fixed lengthintervals. If the IPC of the current inter-val is a factor smaller than that of theprevious interval (indicating worse per-formance), then the heuristic increasesthe instruction queue size to increase thethroughput. Otherwise it may decreasethe queue size.

Bahar and Manne [2001] propose asimilar heuristic targeting a simulated8-wide issue processor that can be re-configured as a 6-wide issue or 4-wideissue. They also measure IPC over fixedintervals but measure the floating pointIPC separately from the integer IPC sincethe former is more critical for floatingpoint applications. Their heuristic, likeBuyuktosunoglu et al.’s [2001], comparesthe measured IPC with specific thresholdsto determine when to downsize the issuequeue.

Folegnani and Gonzalez [2001], on theother hand, use a different heuristic. Theyfind that when a program has little par-allelism, the youngest part of the issuequeue (which contains the most recent in-structions) contributes very little to theoverall IPC in that instructions in thispart are often committed late. Thus theirheuristic monitors the contribution of theyoungest part of this queue and deacti-vates this part when its contribution isminimal.

3.4.3. Algorithms for Reconfiguring Multi-ple Structures. In addition to this work,



researchers have also been tackling theproblem of problem of reconfiguring mul-tiple hardware units simultaneously.

One group [Hughes et al. 2001; Sasankaet al. 2002; Hughes and Adve 2004]has been developing heuristics for com-bining hardware adaptations with dy-namic voltage scaling during multime-dia applications. Their strategy is to runtwo algorithms while individual videoframes are being processed. A global algo-rithm chooses the initial DVS setting andbaseline hardware configuration for eachframe, while a local algorithm tunes var-ious hardware parameters (e.g., instruc-tion window sizes) while the frame is ex-ecuting to use up any remaining slackperiods.

Another group [Iyer and Marculescu2001, 2002a] has developed heuristics thatadjust the pipeline width and register up-date unit (RUU) size for different hotspotsor frequently executed program regions.To detect these hotspots, their techniquecounts the number of times each branchinstruction is taken. When a branch istaken enough times, it is marked as a can-didate branch.

To detect how often candidate branchesare taken, the heuristic uses a hotspotdetection counter. Initially the hotspotcounter contains a positive integer. Eachtime a candidate branch is taken, thecounter is decremented, and each time anoncandidate branch is taken, the counteris incremented. If the counter becomeszero, this means that the program execu-tion is inside a hotspot.

Once the program execution is in ahotspot, this heuristic tests the energyconsumption of different processor config-urations at fixed length intervals. It finallychooses the most energy saving configura-tion as the one to use the next time thesame hotspot is entered. To measure en-ergy at runtime, the heuristic relies onhardware that monitors usage statistics ofthe most power hungry units and calcu-lates the total power based on energy peraccess values.

Huang et al.[2000; 2003] apply hard-ware adaptations at the granularity ofsubroutines. To decide what adaptations

to apply, they use offline profiling to plotan energy-delay tradeoff curve. Each pointin the curve represents the energy and de-lay tradeoff of applying an adaptation toa subroutine. There are three regions inthe curve. The first includes adaptationsthat save energy without impairing per-formance; these are the adaptations thatwill always be applied. The third repre-sents adaptations that worsen both per-formance and energy; these are the adap-tations that will never be applied. Be-tween these two extremes are adaptationsthat trade performance for energy savings;some of these may be applied depending onthe tolerable performance loss. The mainidea is to trace the curve from the ori-gin and apply every adaptation that oneencounters until the cumulative perfor-mance loss from applying all the encoun-tered adaptations reaches the maximumtolerable performance loss.

Ponomarev et al. [2001] develop heuris-tics for controlling a configurable issuequeue, a configurable reorder buffer and aconfigurable load/store queue. Unlike theIPC based heuristics we have seen, thisheuristic is occupancy-based because theoccupancy of hardware structures indi-cates how congested these structures are,and congestion is a leading cause of perfor-mance loss. Whenever a hardware struc-ture (e.g., instruction queue) has low oc-cupancy, the heuristic reduces its size. Incontrast, if the structure is completely fullfor a long time, the heuristic increases itssize to avoid congestion.

Another group of researchers [Albonesiet al. 2003; Dropsho et al. 2002] ex-plore the complex problem of configur-ing a processor whose adaptable compo-nents include two levels of caches, integerand floating point issue queues, load/storequeues, register files, as well as a reorderbuffer. To dodge the sheer explosion of pos-sible configurations, they tune each com-ponent based only on its local usage statis-tics. They propose two heuristics for doingthis, one for tuning caches and another fortuning buffers and register files.

The caches they consider are selec-tive way caches; each of their multipleways can independently be activated or



disabled. Each way has a most recentlyused (MRU) counter that is incrementedevery time that a cache searche hits theway. The cache tuning heuristic samplesthis counter at fixed intervals to determinethe number of hits in each way and thenumber of overall misses. Using these ac-cess statistics, the heuristic computes theenergy and performance overheads for allpossible cache configurations and choosesthe best configuration dynamically.

The heuristic for controlling other struc-tures such as issue queues is occupancybased. It measures how often differentcomponents of these structures fill up withdata and uses this information to de-cide whether to upsize or downsize thestructure.

3.5. Dynamic Voltage Scaling

Dynamic voltage scaling (DVS) addressesthe problem of how to modulate a pro-cessor’s clock frequency and supply volt-age in lockstep as programs execute. Thepremise is that a processor’s workloadsvary and that when the processor has lesswork, it can be slowed down without af-fecting performance adversely. For exam-ple, if the system has only one task and ithas a workload that requires 10 cycles toexecute and a deadline of 100 seconds, theprocessor can slow down to 1/10 cycles/sec,saving power and meeting the deadlineright on time. This is presumably more ef-ficient than running the task at full speedand idling for the remainder of the pe-riod. Though it appears straightforward,this naive picture of how DVS works ishighly simplified and hides serious real-world complexities.

3.5.1. Unpredictable Nature Of Workloads.The first problem is how to predict work-loads with reasonable accuracy. This re-quires knowing what tasks will executeat any given time and the work re-quired for these tasks. Two issues com-plicate this problem. First, tasks canbe preempted at arbitrary times due touser and I/O device requests. Second,it is not always possible to accuratelypredict the future runtime of an arbi-trary algorithm (c.f., the Halting Problem

[Turing 1937]). There have been ongoingefforts [Nilsen and Rygg 1995; Lim et al.1995; Diniz 2003; Ishihara and Yasuura1998] in the real-time systems communityto accurately estimate the worst case exe-cution times of programs. These attemptsuse either measurement-based prediction,static analysis, or a mixture of both. Mea-surement involves a learning lag wherebythe execution times of various tasks aresampled and future execution times areextrapolated. The problem is that, whenworkloads are irregular, future execu-tion times may not resemble past execu-tion times. Microarchitectural innovationssuch as pipelining, hyperthreading andout-of-order execution make it difficult topredict execution times in real systemsstatically. Among other things, a compilerneeds to know how instructions are inter-leaved in the pipeline, what the probabili-ties are that different branches are taken,and when cache misses and pipeline haz-ards are likely to occur. This requires mod-eling the pipeline and memory hierarchy.A number of researchers [Ferdinand 1997;Clements 1996] have attempted to inte-grate complete pipeline and cache mod-els into compilers. It remains nontrivialto develop models that can be used ef-ficiently in a compiler and that captureall the complexities inherent in currentsystems.

3.5.2. Indeterminism And Anomalies In RealSystems. Even if the DVS algorithm pre-dicts correctly what the processor’s work-loads will be, determining how fast to runthe processor is nontrivial. The nondeter-minism in real systems removes any strictrelationships between clock frequency, ex-ecution time, and power consumption.Thus most theoretical studies on voltagescaling are based on assumptions thatmay sound reasonable but are not guar-anteed to hold in real systems.

One misconception is that total micro-processor system power is quadratic insupply voltage. Under the CMOS transis-tor model, the power dissipation of indi-vidual transistors are quadratic in theirsupply voltages, but there remains no



precise way of estimating the power dis-sipation of an entire system. One canconstruct scenarios where total systempower is not quadratic in supply voltage.Martin and Siewiorek [2001] found thatthe StrongARM SA-1100 has two supplyvoltages, a 1.5V supply that powers theCPU and a 3.3V supply that powers the I/Opins. The total power dissipation is domi-nated by the larger supply voltage; hence,even if the smaller supply voltage were al-lowed to vary, it would not reduce powerquadratically. Fan et al. [2003] found that,in some architectures where active mem-ory banks dominate total system powerconsumption, reducing the supply voltagedramatically reduces CPU power dissi-pation but has a miniscule effect on to-tal power. Thus, for full benefits, dynamicvoltage scaling may need to be combinedwith resource hibernation to power downother energy hungry parts of the chip.

Another misconception is that it ismost power efficient to run a task atthe slowest constant speed that allowsit to exactly meet its deadline. Severaltheoretical studies [Ishihara and Yasuura1998] attempt to prove this claim. Theseproofs rest on idealistic assumptions suchas “power is a convex linearly increasingfunction of frequency”, assumptions thatignore how DVS affects the system as awhole. When the processor slows down,peripheral devices may remain activatedlonger, consuming more power. An impor-tant study of this issue was done by Fanet al. [2003] who show that, for specificDRAM architectures, the energy versusslowdown curve is “U” shaped. As the pro-cessor slows down, CPU energy decreasesbut the cumulative energy consumed inactive memory banks increases. Thus theoptimal speed is actually higher than theprocessor’s lowest speed; any speed lowerthan this causes memory energy dissipa-tion to overshadow the effects of DVS.

A third misconception is that executiontime is inversely proportional to clock fre-quency. It is actually an open problemhow slowing down the processor affectsthe execution time of an arbitrary appli-cation. DVS may result in nonlinearities[Buttazzo 2002]. Consider a system with

Fig. 15. Scheduling effects. When DVS is applied(a), task one takes a longer time to enter its criti-cal section, and task two is able to preempt it rightbefore it does. When DVS is not applied (b), taskone enters its critical section sooner and thus pre-vents task two from preempting it until it finishesits critical section. This figure has been adopted fromButtazzo [2002].

two tasks (Figure 15). When the processorslows down (a), Task One takes longer toenter its critical section and is thus pre-empted by Task Two as soon as Task Twois released into the system. When the pro-cessor instead runs at maximum speed (b),Task One enters its critical section earlierand blocks Task Two until it finishes thissection. This hypothetical example sug-gests that DVS can affect the order inwhich tasks are executed. But the order inwhich tasks execute affects the state of thecache which, in turn, affects performance.The exact relationship between the cachestate and execution time is not clear. Thus,it is too simplistic to assume that execu-tion time is inversely proportional to clockfrequency.

All of these issues show that theo-retical studies are insufficient for un-derstanding how DVS will affect systemstate. One needs to develop an experi-mental approach driven by heuristics andevaluate the tradeoffs of these heuristicsempirically.



Most of the existing DVS approachescan be classified as interval-based ap-proaches, intertask approaches, or in-tratask approaches.

3.5.3. Interval-Based Approaches.Interval-based DVS algorithms mea-sure how busy the processor is oversome interval or intervals, estimate howbusy it will be in the next interval, andadjust the processor speed accordingly.These algorithms differ in how theyestimate future processor utilization.Weiser et al. [1994] developed one of theearliest interval-based algorithms. Thisalgorithm, Past, periodically measureshow long the processor idles. If it idleslonger than a threshold, the algorithmslows it down; if it remains busy longerthan a threshold, it speeds it up. Sincethis approach makes decisions basedonly on the most recent interval, it iserror prone. It is also prone to thrashingsince it computes a CPU speed for everyinterval. Govil et al. [1995] examine anumber of algorithms that extend Pastby considering measurements collectedover larger windows of time. The AgedAverages algorithm, for example, esti-mates the CPU usage for the upcominginterval as a weighted average of usagesmeasured over all previous intervals.These algorithms save more energy thanPast since they base their decisions onmore information. However, they allassume that workloads are regular. As wehave mentioned, it is very difficult, if notimpossible, to predict irregular workloadsusing history information alone.

3.5.4. Intertask Approaches. IntertaskDVS algorithms assign different speedsfor different tasks. These speeds remainfixed for the duration of each task’s exe-cution. Weissel and Bellosa [2002] havedeveloped an intertask algorithm thatuses hardware events as the basis forchoosing the clock frequencies for differ-ent processes. The motivation is that, asprocesses execute, they generate varioushardware events (e.g., cache misses, IPC)at varying rates that relate to their per-

formance and energy dissipation. Weisseland Bellosa’s technique involves moni-toring these event rates using hardwareperformance counters and attributingthem to the processes that are executing.At each context switch, the heuristicadjusts the clock frequency for the processbeing activated based on its previouslymeasured event rates. To do this, theheuristic refers to a previously con-structed table of event rate combinations,divided into frequency domains. Weisseland Bellosa construct this table througha detailed offline simulation where, fordifferent combinations of event rates, theydetermine the lowest clock frequency thatdoes not violate a user-specified thresholdfor performance loss.

Intertask DVS has two drawbacks.First, task workloads are usually un-known until tasks finish running. Thustraditional algorithms either assume per-fect knowledge of these workloads or es-timate future workloads based on priorworkloads. When workloads are irregular,estimating them is nontrivial. Flautneret al. [2001] address this problem byclassifying all workloads as interactiveepisodes and producer-consumer episodesand using separate DVS algorithms foreach. For interactive episodes, the clockfrequency used depends not only on thepredicted workload but also on a percep-tion threshold that estimates how fast theepisode would have to run to be accept-able to a user. To recover from predictionerror, the algorithm also includes a panicthreshold. If a session lasts longer thanthis threshold, the processor immediatelyswitches to full speed.

The second drawback of intertask ap-proaches is that they are unaware of pro-gram structure. A program’s structuremay provide insight into the work re-quired for it, insight that, in turn, mayprovide opportunities for techniques suchas voltage scaling. Within specific programregions, for example, the processor mayspend significant time waiting for mem-ory or disk accesses to complete. Duringthe execution of these regions, the proces-sor can be slowed down to meet pipelinestall delays.



3.5.5. Intratask Approaches. Intrataskapproaches [Lee and Sakurai 2000; Lorchand Smith 2001; Gruian 2001; Yuan andNahrstedt 2003] adjust the processorspeed and voltage within tasks. Lee andSakurai [2000] developed one of the earli-est approaches. Their approach, run-timevoltage hopping, splits each task intofixed length timeslots. The algorithm as-signs each timeslot the lowest speed thatallows it to complete within its preferredexecution time which is measured asthe worst case execution time minus theelapsed execution time up to the currenttimeslot. An issue not considered in thiswork is how to divide a task into timeslots.Moreover, the algorithm is pessimisticsince tasks could finish before their worstcase execution time. It is also insensitiveto program structure.

There are many variations on this idea.Two operating system-level intratask al-gorithms are PACE [Lorch and Smith2001] and Stochastic DVS [Gruian 2001].These algorithms choose a speed for ev-ery cycle of a task’s execution based on theprobability distribution of the task work-load measured over previous cycles. Thedifference between PACE and stochasticDVS lies in their cost functions. StochasticDVS assumes that energy is proportionalto the square of the supply voltage, whilePACE assumes it is proportional to thesquare of the clock frequency. These sim-plifying assumptions are not guaranteedto hold in real systems.

Dudani et al. [2002] have also developedintratask policies at the operating systemlevel. Their work aims to combine in-tratask voltage scaling with the earliest-deadline-first scheduling algorithm.Dudani et al. split into two subprogramseach program that the scheduler choosesto execute. The first subprogram runs atthe maximum clock frequency, while thesecond runs slow enough to keep the com-bined execution time of both subprogramsbelow the average execution time for thewhole program.

In addition to these schemes, therehave been a number of intratask poli-cies implemented at the compiler level.Shin and Kim [2001] developed one of

the first of these algorithms. They no-ticed that programs have multiple exe-cution paths, some more time consum-ing than others, and that whenever con-trol flows away from a critical path, thereare opportunities for the processor to slowdown and still finish the programs withintheir deadlines. Based on this observation,Shin and Kim developed a tool that pro-files a program offline to determine worstcase and average cycles for different exe-cution paths, and then inserting instruc-tions to change the processor frequency atthe start of different paths based on thisinformation.

Another approach, program checkpoint-ing [Azevedo et al. 2002], annotates a pro-gram with checkpoints and assigns tim-ing constraints for executing code betweencheckpoints. It then profiles the programoffline to determine the average numberof cycles between different checkpoints.Based on this profiling information andtiming constraints, it adjusts the CPUspeed at each checkpoint.

Perhaps the most well known approachis by Hsu and Kremer [2003]. They pro-file a program offline with respect to allpossible combinations of clock frequenciesassigned to different regions. They thenbuild a table describing how each combi-nation affects execution time and powerconsumption. Using this table, they selectthe combination of regions and frequen-cies that saves the most power without in-creasing runtime beyond a threshold.

3.5.6. The Implications of Memory BoundedCode. Memory bounded applicationshave become an attractive target for DVSalgorithms because the time for a memoryaccess, on most commercial systems, isindependent of how fast the processor isrunning. When frequent memory accessesdominate the program execution time,they limit how fast the program can finishexecuting. Because of this “memory wall”,the processor can actually run slowerand save a lot of energy without losingas much performance as it would if itwere slowing down a compute-intensivecode.



Fig. 16. Making the best of the memory wall.

Figure 16 illustrates this idea. For easeof exposition, we call a sequence of exe-cuted instructions a task. Part (a) depictsa processor executing three tasks. Task 1is followed by a memory load instructionthat causes a main memory access. Whilethis access is taking place, task 2 is ableto execute since it does not depend on theresults of task 1. However task 3, whichcomes after task 2, depends on the resultof task 1, and thus can execute only af-ter the memory access completes. So im-mediately after executing task 2, the pro-cessor sits idle until the memory accessfinishes. What the processor could havedone is illustrated in part (b). Here insteadof running task 2 as fast as possible andidling until the memory access completes,the processor slows down the execution oftask 2 to overlap exactly with the mainmemory access. This, of course, saves en-ergy without impairing performance.

But there are many idealized assump-tions at work here such as the assumptionthat a DVS algorithm can predict withcomplete accuracy a program’s futurebehavior and switch the clock frequencywithout any hidden costs. In reality,things are far more complex. A funda-mental problem is how to detect programregions during which the processor stalls,waiting for a memory, disk, or networkaccess to complete. Modern processorscontain counters that measure eventssuch as cache misses, but it is difficultto extrapolate the nature of these stallsfrom observing these events.

Nevertheless, a growing number ofresearchers are addressing the DVS prob-lem from this angle; that is, they are devel-oping techniques for modulating the pro-cessor frequency based on memory accesspatterns. Because it is difficult to reasonabstractly about the complex events occur-ring in modern processors, the research inthis area has a strong experimental flavor;many of the techniques used are heuristicsthat are validated through detailed simu-lations and experiments on real systems.

One recent work is by Kondo andNakamura [2004]. It is an interval-basedapproach driven by cache misses. Theheuristic periodically calculates the num-ber of outstanding cache misses andincrements one of three counters depend-ing on whether the number of outstand-ing misses is zero, one, or greater thanone. The cost function that Kondo andNakamura use expresses the memoryboundedness of the code as a weightedsum of the three counters. In particular,the third counter which is incrementedeach time there are multiple outstandingmisses, receives the heaviest weight. Atfixed length intervals, the heuristic com-pares the memory boundedness, so com-puted, with respect to an upper and lowerthreshold. If the memory boundedness isgreater than the upper threshold, thenthe heuristic decreases the frequency andvoltage by one setting; otherwise if thememory boundedness is below the lowerthreshold, then it increases the frequencyand voltage settings by one unit.



Stanley-Marbell et al. [2002] propose asimilar approach, the Power AdaptationUnit. This is a finite state machine that de-tects the program counter values for whichmemory stalls often occur. When a stall isinitially detected, the unit would enter atransient state where it counts the num-ber of stall cycles, and after the stall cy-cles exceeded a threshold, it would markthe initial program counter value as hot.Thus the next time the program counterassumes the same value, the PAU wouldautomatically reduce the clock frequencyand voltage.

Another recent work is by Choi et al.[2004]. Their work is based on an exe-cution time model where a program’s to-tal runtime is split into two parts, timethat the program spends on-chip, and timethat it spends in main memory accesses.The on-chip time can be calculated fromthe number of on-chip instructions, theaverage cycles for each of these instruc-tions, and the processor clock frequency.The off-chip time is similarly a functionof the number of off-chip (memory refer-ence) instructions, the average cycles peroff-chip instruction, and the memory clockfrequency.

Using this model, Choi et al. [2004] de-rive a formula for the lowest clock fre-quency that would keep the performanceloss within a tolerable threshold. They findthat the key to computing this formulais determining the average number of cy-cles for an on-chip instruction CPUon

avg; thisessential information determines the ra-tio of on-chip to off-chip instructions. Theyalso found that the average cycles per in-struction (CPIavg) is a linear function ofthe average stall cycles per instruction(SPIavg). This reduced their problem tocalculating SPIavg. To calculate this, theyused data-cache miss statistics providedby hardware counters. Reasoning that thecycles spent in stalls increase with re-spect to the number of cache misses, Choiet al. built a table that associated differentcache miss ranges with different valuesof SPIavg; this would allow their heuris-tic to calculate in real time the work-load decomposition and appropriate clockfrequency.

Unlike the previous approaches whichattempt to monitor memory boundedness,a recent technique by Hsu and Feng [2004]attempts to monitor CPU boundedness.Their algorithm periodically measures therate at which instructions are executed(expressed in million instructions per sec-ond) to determine how compute-intensivethe workload is. The authors find thatthis approach is as accurate as other ap-proaches that measure memory bounded-ness, but may be easier to implement be-cause of its simpler cost model.

3.5.7. Dynamic Voltage Scaling In MultipleClock Domain Architectures. A GloballyAsynchronous, Locally Synchronous(GALS) chip is split into multiple do-mains, each of which has its own localclock. Each domain is synchronous withrespect to its clock, but the differentdomains are mutually asynchronous inthat they may run at different clockfrequencies. This design has three ad-vantages. First, the clocks that powerdifferent domains are able to distributetheir signals over smaller areas, thusreducing clock skew. Second, the effectsof changing the clock frequency are feltless outside the given domain. This is animportant advantage that GALS has overconventional CPUs. When a conventionalCPU scales its clock frequency, all of thehardware structures that receive the clocksignal slow down causing widespread per-formance loss. In GALS, on the otherhand, one can choose to slow down someparts of the circuit, while allowing othersto operate at their maximum frequencies.This creates more opportunities for savingenergy. In compute bound applications,for example, a GALS system can keep thecritical paths of the circuit running asfast as possible but slow down other partsof the circuit.

Nevertheless, it is nontrivial to createvoltage scaling algorithms for GALS pro-cessors. First, these processors requirespecial mechanisms allowing the differentdomains to communicate and synchronize.Second, it is unclear how to split the pro-cessor into domains. Because communica-tion energy can be large, domains must be



chosen so that there is minimal commu-nication between them. One possibility isto cluster similar functionalities into thesame domain. In a GALS processor devel-oped by Semeraro et al. [2002], all the in-teger execution units are in the same do-main. If instead the integer issue queuewere in a different domain than the in-teger ALU, these two units would wastemore energy communicating since theyare often accessed together.

Another factor that makes it hard to de-sign algorithms is that one needs to con-sider possible interdependencies betweendomains. For example, if one domain isslowed down, data may take more time tomove through it and reach other domains.This may impact the overall congestion inthe circuit and affect performance. A DVSalgorithm would need to be aware of theseglobal effects.

Numerous researchers have tackled thecomplex problem of controlling GALS sys-tems [Magklis et al. 2003; Marculescu2004; Dropsho et al. 2004; Semeraro et al.2002; Iyer and Marculescu 2002b; Magkliset al. 2003]. We follow with two recent ex-amples, the first being an interval-basedheuristic and the second being a compiler-based heuristic.

Semeraro et al. [2002] developed oneof the first runtime algorithms for dy-namic voltage scaling in a GALS system.They observed that each domain has an in-struction issue queue that indicates howfast instructions are flowing through thepipeline, thus giving insight into the work-load. They propose to monitor the issuequeue utilization over fixed length inter-vals and to respond swiftly to suddenchanges by rapidly increasing or decreas-ing the domain’s clock frequency, but to de-crease the clock frequency very graduallywhen there are little observable changesin the workload.

One of the first compiler-based algo-rithms for controlling GALS systems isby Magklis et al. [2003]. Their main algo-rithm is called the shaker algorithm due toits resemblance to a salt shaker. It repeat-edly traverses the Directed Acyclic Graph(DAG) representation of a program fromroot to leaves and from leaves to root,

searching for operations whose energy dis-sipation exceeds a threshold. It stretchesout the workload of such operations un-til the energy dissipation is below a giventhreshold, repeating this process until ei-ther no more operations dissipate exces-sive energy or until all of the operationshave used up their slack. The informationprovided by this algorithm would be usedby a compiler to select clock frequencies fordifferent GALS domains in different pro-gram regions.

3.6. Resource Hibernation

Though switching activity causes powerconsumption, computer components con-sume power even when idle. Resource hi-bernation techniques power down compo-nents during idle periods. A variety ofheuristics have been developed to man-age the power settings of different devicessuch as disk drives, network interfaces,and displays. We now give examples ofthese techniques.

3.6.1. Disk Drives. Disk drives can ac-count for a large percentage of a system’stotal energy dissipation and many tech-niques have been developed to attack thisproblem [Gurumurthi et al. 2003; Li et al.2004]. Most of the power loss stems fromthe rotating platter during disk activityand idle periods. To save energy, operatingsystems tend to pause disk rotation aftera fixed period of inactivity and restart itduring the next disk access. The decisionof when to spin down an idle disk involvestradeoffs between power and performance.High idle time thresholds lead to more en-ergy loss but better performance since thedisk need not restart to process requests.Low thresholds allow a disk to spin downsooner when idle but increase the proba-bility of an immediate restart to processa spontaneous request. These immediaterestarts, called bumps, waste energy andcause delays.

One of the techniques used at the oper-ating system level is Predictive DynamicThreshold Adjustment. This technique ad-justs at runtime the idleness threshold, oracceptable window of time before an idle



Fig. 17. Predictive disk management.

disk can be spun down. If a disk immedi-ately spins up after spinning down, thistechnique increases the idleness thresh-old. Otherwise, it decreases the threshold.This is an example of a technique that usespast disk usage information to predictfuture usage patterns. These techniquesare less effective at dealing with irreg-ular workloads, as Figure 17 illustrates.Part (a) depicts a disk usage pattern inwhich active phases are abruptly followedby idle phases, and the idle phases keepgetting longer. Dynamic threshold adjust-ment (b) abruptly restarts the disk ev-erytime it spins it down and keeps thedisk activated during long idle phases.To improve on this technique, an operat-ing system can cluster disk requests (c),thereby lengthening idle periods (d) whenit can spin down the disk. To create op-portunities for clustering, the algorithm ofPapathanasiou and Scott [2002] delaysprocessing nonurgent disk requests andreduces the number of remaining requestsby aggressively prefetching disk data intomemory. The delayed requests accumu-late in a queue for later processing in amore voluminous transaction. Thus, the

disk may remain uninterrupted in a low-power, inactive state for longer periods.However, dependencies between disk re-quests can sometimes make it impossibleto apply clustering.

Moreover, idle periods for spinningdown the disk are not always avail-able. For example, large scale servers arecontinuously processing heavy workloads.Gurumurthi et al. [2003] have recentlyinvestigated an alternative strategy forserver environments called Dynamic RPMcontrol (DRPM). Instead of spinning downthe disk fully, DRPM modulates the ro-tational speed (in rotations per minute)according to workload demands, therebyeliminating the overheads of transition-ing between low power and fully activestates. As part of their study, Gurumurthiet al. develop power models for a disk drivethat supports 15 different RPM levels. Us-ing these models they propose a simpleheuristic for RPM modulation. The heuris-tic involves some cooperation between thedisks and a controller that oversees thedisks. The controller initially defines arange of RPM levels for each disk; thedisks modulate their RPMs within this



range based on their demands. They pe-riodically monitor their input queues, andwhenever the number of requests exceedsa threshold, they increase their RPM set-tings by 1. Meanwhile, the controller over-sees the response times in the system.When the response time is above a thresh-old, this is a sign of performance degra-dation, and the controller immediately re-quests the disks to switch to full power.When on the other hand, the responsetimes are sufficiently low, this is a signthat the disks could be slowed down fur-ther, and the controller increases the min-imum RPM threshold setting.

3.6.2. Network Interfaces. Network inter-faces such as wireless cards present aunique challenge for resource hibernationtechniques because the naive solution ofshutting them off would disconnect hostsfrom other devices with which they mayhave to communicate. Thus power man-agement strategies also include protocolsfor synchronizing the communication ofthese devices with other devices in the net-work. Kravets et al. [1998] have developeda heuristic for doing this at the operat-ing system level. One of its key featuresis that it uses timers to track the idlenessof devices. If a device should be transmit-ting but its idle timer expires, it enters alistening mode or goes to sleep. To decidehow long devices should remain idle be-fore shifting into listening mode, Kravetset al. apply an adaptive threshold algo-rithm. When the algorithm detects com-munication activity, it decreases the ac-ceptable hibernation time. When it detectsidle periods, it increases the time.

Threshold-based schemes such as thisone allow a network card to remain idlefor an extended length of time before shut-ting down. Hom and Kremer [2003] haveargued that compilers can shut down acard sooner using a model of future pro-gram behavior. Their compiler-based hi-bernation technique targets a simplifiedscenario where a mobile device’s virtualmemory resides on a server. Page faults re-quire activating the device’s wireless cardso that pages can be sent from the server

to the device. The technique identifies pro-gram regions where the card should hyber-nate based on the nature of array accessesin each region. If an accessed array is notin local memory, the card is activated totransfer it from the remote server. To de-termine what arrays are in memory dur-ing each region’s execution, the compilersimulates a least-recently-used memorybank.

3.6.3. Displays. Displays are a majorsource of power consumption. Normally,an operating system dims the display af-ter a sufficiently long interval with no userresponse. This heuristic may not coincidewith a user’s intentions.

Two examples of more advanced tech-niques that manage displays based onuser activity are face-off and zoned back-lighting. They have yet to be implementedin commercial systems.

Face-off [Dalton and Ellis 2003] is anexperimental face recognition system onan IBM T21 Thinkpad laptop running RedHat Linux. It periodically photographsthe monitor’s perspective and detects aface through large regions of skin color. Itthen controls the display using the ACPIinterface. Its developers tested it on along network transfer during which theuser looked away from the screen. Face-offsaved more power than the default displaymanager. A drawback is the repetitivepolling for face detection. For futurework, the authors suggest detecting auser’s presence through low-power heatmicrosensors. These sensors can triggerthe face recognition algorithm when theydetect a user.

Zoned backlighting [Flinn and Satya-narayanan 1999] targets future displaysystems that will allow software to selec-tively adjust the brightness of differentregions of the screen based on usage pat-terns and battery drain. Researchers atCarnegie Mellon have proposed ideas forexploiting these displays. The underlyingassumption is that users are interestedin only a small part of the screen. Oneidea is to make foreground windows withkeyboard and mouse focus brighter than



Fig. 18. Performance versus power.

back-ground windows. Another is to scaleimages to allow most of the screen to dimwhen the battery is low.

3.7. Compiler-Level Power Management

3.7.1. What Compilers Can Do. There aremany ways a compiler can help reducepower regardless of whether a proces-sor explicitly supports software-controlledpower reduction. Aside from generatingcode that reconfigures hardware unitsor activates power reduction mechanismsthat we have seen, compilers can ap-ply common performance-oriented opti-mizations that also save energy by re-ducing the execution time, optimizationssuch as Common Subexpression Elimi-nation, Partial Redundancy Elimination,and Strength Reduction. However, someperformance optimizations increase codesize or parallelism, sometimes increas-ing resource usage and peak power dissi-pation. Examples include Loop Unrollingand Software Pipelining. Researchers[Givargis et al. 2001] have developed mod-els for relating performance and power,but these models are relative to specific ar-chitectures. There is no fixed relationshipbetween performance and power across allarchitectures and applications. Figure 18

compares the power dissipation and exe-cution time of doing two additions in paral-lel (as in a VLIW architecture) (a) and do-ing them in succession (b). Doing two ad-ditions in parallel activates twice as manyadders over a shorter period. It induceshigher peak resource usage but fewer exe-cution cycles and is better for performance.To determine which scenario saves moreenergy, one needs more information suchas the peak power increase in scenario (a),the execution cycle increase in scenario(b), and the total energy dissipation percycle for (a) and (b). For ease of illustra-tion, we have depicted the peak power in(a) to be twice that of (b), but all of theseparameters may vary with respect to thearchitecture.

Compilers can reduce memory accessesto reduce energy dissipated in these ac-cesses. One way to reduce memory ac-cesses is to eliminate redundant loadand store operations. Another is to keepdata as close as possible to the processor,preferably in the registers and lower-levelcaches, using aggressive register alloca-tion techniques and optimizations improv-ing cache usage. Several loop transforma-tions alter data traversal patterns to makebetter use of the cache. Examples includeLoop Interchange, Loop Tiling, and Loop



Fusion. The assignment of data to memorylocations also influences how long data re-mains in the cache. Though all these tech-niques reduce power consumption alongone or more dimensions, they might besuboptimal along others. For example, toexploit the full bandwidth a banked mem-ory architecture provides, one may needto disperse data across multiple memorybanks at the expense of cache locality andenergy.

3.7.2. Remote Compilation and Remote Exe-cution. A new area of research for compil-ers involves compiling and executing ap-plications jointly between mobile devicesand more powerful servers to reduce exe-cution time and increase battery life. Thisis an area frought with challenges suchas the decision of what program sectionsto compile or execute remotely. Other is-sues include application partitioning be-tween multiple servers and handhelds, ap-plication migration, and fault tolerance.Though researchers have yet to addressall these issues, a number of studies pro-vide valuable insights into the benefits ofpartitioning.

Recent studies of low-power Java ap-plication partitioning are by Palm et al.[2002] and Chen et al. [2003]. Palm et al.compare the energy costs of remote andlocal compilation and optimization. Theyexamine four scenarios in which a hand-held downloads code from a server andexecutes it locally. In the first two sce-narios, the handheld downloads bytecode,compiles it, and executes it. In the re-maining scenarios, the server compiles thebytecode and the handheld downloads thenative codestream. Palm et al. find that,when a method’s compilation time exceedsthe execution time, remote compilation isfavorable to local compilation. However, ifno optimizations are used, remote compi-lation can be more expensive due to thecost of downloading code from the server.Overall, the largest energy savings are ob-tained by compiling and optimizing coderemotely but sending results to the hand-held exactly when needed to minimize idletime.

While Palm et al. [2002] restrict theirstudies to local and remote compilation,Chen et al. [2003] also examine the ben-efits of executing compiled Java code re-motely. Their testbed consists of a 750MHzSPARC workstation server and a clienthandheld running a microSPARC-IIepembedded processor. The programmer ini-tially annotates a set of candidate meth-ods that the client may execute remotely.Before calling any of these methods, theclient decides where to compile and exe-cute the method based on computation andcommunication costs. It also selects opti-mization levels for compilation. If remoteexecution is favorable, the client transfersall method data to the server via the objectserialization interface. The server uses re-flection to invoke the method and serial-ization to send back the results. Chen et al.evaluate different compilation and execu-tion strategies in this framework throughsimulation and analytical models. Theyconsider seven partitioning strategies, fiveof which are static and two of which aredynamic. While the static strategies fixthe partitioning decision for all methodsof a program, the dynamic ones decide thebest strategy for each method at call time.Chen et al. find the best static strategy tovary depending on input size and commu-nication channel quality. When the qualityis low, the client’s network interface mustsend data at a higher transmission powerto reach the server. For large inputs andgood channel quality, static remote execu-tion prevails over other static strategies.Overall, Chen et al. find the highest en-ergy savings to be achieved by a dynamicstrategy that decides where to compile andexecute each method.

The Proxy Virtual Machine [Venkat-achalam et al. 2003] reduces the energyconsumption of mobile devices by combin-ing application partitioning with dynamicoptimization. The framework (Figure 19)positions a powerful server infrastructure,the proxy, between mobile devices and theInternet. The proxy includes a just-in-time compiler and bytecode translator. Ahigh bandwidth, low-latency secure wire-less connection mediates communicationbetween the proxy and mobile devices in



Fig. 19. The proxy VM framework.

the vicinity. Users can request Internet ap-plications as they normally would througha Web browser. The proxy intercepts theserequests and reissues them to a remoteInternet server. The server sends theproxy the application in mobile code for-mat. The proxy verifies the application,compiles it to native code, and sends partof the generated code to the target. The re-mainder of the code executes on the proxyitself to significantly reduce energy con-sumption on the mobile device.

3.7.3. The Limitations of Statically Optimiz-ing Compilers. The drawback of compiler-based approaches is that a compiler’sview is usually limited to the programsit is compiling. This raises two problems.First, compilers have incomplete informa-tion about how a program will actually be-have, information such as the control flowpaths and loop iterations that will be exe-cuted. Hence, statically optimizing compil-ers often rely on profiling data that is col-lected from a program prior to execution,to determine what optimizations shouldbe applied in different program regions.A program’s actual runtime behavior maydiverge from its behavior in these “simu-lation runs”, thus leading to suboptimaldecisions.

The second problem is that compiler op-timizations tend to treat programs as ifthey exist in a vacuum. While this may

be ideal for specialized embedded systemswhere the set of programs that will ex-ecute is determined ahead of time andwhere execution behavior is mostly pre-dictable, real systems tend to be more com-plex. The events occurring within themcontinuously change and compete for pro-cessor and memory resources. For exam-ple, context switches abruptly shift exe-cution from a region of one program to acompletely different region of another pro-gram. If a compiler had generated codeso that the two regions put a device intodifferent power mode settings, then therewill be an overhead to transition betweenthese settings when execution flows fromone region to another.

3.7.4. Dynamic Compilation. Dynamiccompilation addresses some of theseproblems by introducing a feedbackloop. A program is compiled but is thenalso monitored as it executes. As thebehavior of the program changes, possiblyalong with other changes in the runtimeenvironment (e.g., resource levels), theprogram is recompiled to adapt to thesechanges. Since the program is continu-ously recompiled in response to runtimefeedback, its code is of a significantlyhigher quality than it would have been ifit had been generated by a static compiler.

There are a number of scenarios wherecontinuous compilation can be effective.



For example, as battery capacity de-creases, a continuous compiler can ap-ply more aggressive transformations thattrade the quality of data output for re-duced power consumption. (For example,a compiler can replace expensive floatingpoint operations with integer operations,or generate code that dims the display.)However, there are tradeoffs. For example,the cost function for a dynamic compilerneeds to weigh the overhead of recompil-ing a program with the energy that can besaved. Moreover, there is the issue of howto profile an executing program in a mini-mally invasive, low overhead way.

While there is a large body of workin dynamic compilation for perfor-mance [Kistler and Franz 2001; Kistlerand Franz 2003], dynamic recompilationfor power is still a young field withmany open problems. One of the earliestcommercial dynamic compilation systemsis Transmeta’s Crusoe processor [Trans-meta Corporation 2003; TransmetaCorporation 2001] which we will discussin Section 3.10. One of the earliest aca-demic studies of power-aware dynamicrecompilation was done by Unnikrishnanet al. [2002]. Their idea is to instrumentcritical program regions with sensitiv-ity lists that detect changes in variousenergy budgets and hot-swap optimizedversions of these regions with the originalversions based on these changes. Eachoptimized version is precomputed aheadof time, using offline energy estimationtechniques. The authors target a bankedmemory architecture and focus on looptransformations such as tiling, unrolling,and interchange. Their implementationis based on Dyninst, a tool that allowsprogram regions to be patched at runtime.

Though dynamic compilation for poweris still in its infancy, there is a growingbody of work in runtime power monitor-ing. The work that is most relevant fordynamic compilation is that of Isci andMartonosi [2003]. These researchers havedeveloped techniques for profiling thepower behavior of programs at runtimeusing information from hardware perfor-mance counters. They target a Pentium 4processor which allows one to sample 18

different performance counter events si-multaneously. For each of the core compo-nents of this processor Isci and Martonosihave developed power models that relatetheir power consumption to their usagestatistics. In their actual setup, they taketwo kinds of power readings in parallel.One is based on current measurementsand the other is based on the performancecounters. These readings give the totalpower as well as a breakdown of powerconsumed among the various components.Isci and Martonosi find that their perfor-mance counter-based approach is nearlyas accurate as the current measurementsand is also able to distinguish between dif-ferent phases in program execution.

3.8. Application-Level Power Management

Researchers have been exploring how togive applications a larger role in powermanagement decisions. Most of the recentwork has two goals. The first is to de-velop techniques that enable applicationsto adapt to their runtime environment.The second is to develop interfaces allow-ing applications to provide hints to lowerlayers of the stack (i.e., operating systems,hardware) and likewise exposing usefulinformation from the lower layers to appli-cations. Although these are two separategoals, the techniques for achieving themcan overlap.

3.8.1. Application Transformations and Adap-tations. As a starting point, some re-searchers [Tan et al. 2003] have adoptedan “architecture-centric” view of applica-tions that allows for some high-level trans-formations. These researchers claim thatan application’s architecture consists offundamental elements that comprise allapplications, namely processes, event han-dlers, and communication mechanisms.Power is consumed when these various el-ements interact during activities such ascontext switches and interprocess commu-nication.

They propose a low-power applica-tion design methodology. Initially, an ap-plication is represented as a softwarearchitecture graph (SAG) [Tan et al. 2003]



which captures how it has been built outof processes and events. It is then runthrough a simulator to measure its baseenergy consumption which is then com-pared to its energy consumption whentransformations are applied to the SAG,transformations that include merging pro-cesses to reduce interprocess communica-tion, replacing expensive IPC mechanismsby cheaper mechanisms, and migratingcomputations between processes.

To determine the optimal set of transfor-mations, the researchers use a greedy ap-proach. They keep applying transforma-tions that reduce energy until no moretransformations remain to be applied.This results in an optimized version of thesame application that has a more energyefficient mix of processes, events, and com-munication mechanisms.

Sachs et al. [2003] have explored a dif-ferent kind of adaptation that involvestrading the accuracy of computations forreduced energy consumption. They pro-pose a video encoder that allows one tovary its compression efficiency by selec-tively skipping the motion search and dis-crete cosine transform phases of the en-coding algorithm. The extent to which itskips these phases depends on parame-ters chosen by the adaptation algorithm.The target architecture is a processor withsupport for DVS and architectural adapta-tions (e.g., configurable caches). Two algo-rithms work side by side. One algorithmtunes the hardware parameters at thestart of each frame, while another tunesthe parameters of the video encoder whilea frame is being processed.

Yet another form of adaptation involvesreducing energy consumption by tradingoff fidelity or the quality of data presentedto users. This can take many forms. Oneexample of a fidelity reduction systemis Odyssey [Flinn and Satyanarayanan1999]. Odyssey is an operating systemand execution framework for multime-dia and Web applications. It continuouslymonitors the resources used by applica-tions and alerts applications whose re-sources fall below requested levels. Inturn, the applications lower their qual-ity of service until resources are plenti-

ful. Originally, Odyssey supported four ap-plications, a video player, speech recog-nizer, map viewer, and Web browser. Thevideo player, for instance, would down-load images from a server storing multi-ple copies at different compression levels.When bandwidth is low, it would down-load the compressed versions of imagesto transfer less data over the network. Asbattery resources become scarce, it wouldalso reduce the size of the images to con-serve energy.

One example of an operating systemthat assists application adaptation isECOSystem [Zeng et al. 2002]. It treatsenergy as a commodity assigned to the dif-ferent resources in a system. The idea isto allow resources to operate so long asthey have enough of the commodity. Thusresource management in ECOSystem re-lies on the notion of “currentcy”, an ab-straction that models the power resourceas a monetary unit. Essentially, the dif-ferent resources in a computer (e.g., ALU,memory) have prices for using them, andapplications pay these prices with cur-rentcy. Periodically, the operating systemdistributes a fixed amount of currentcyamong applications, using the desired bat-tery drain rate as the basis for deciding onhow much to distribute. Applications canuse specific resources only if they can payfor them. As an application executes, it ex-pends its currentcy. The application is in-terrupted once it depletes its currentcy.

3.8.2. Application Hints. In one of the ear-liest works on application hints, Pereiraet al. [2002] propose a software architec-ture containing two layers of API calls,one allowing applications to communicatewith the operating system, and the otherallowing the operating system to com-municate with the underlying hardware.The API layer that is exposed to appli-cations allows applications to drop hintsto the operating system such as the starttimes, deadlines, and expected executiontimes of tasks. Using these hints providedby the application, the operating systemwill have a better understanding of thepossible future behavior of programs and



will thus be able to make better DVSdecisions.

Similar to this is the work by Anandet al. [2004] which considers a settingwhere applications adapt by decidingwhen to access different I/O devices basedon the relative costs of accessing thesedevices. For example, it may be expen-sive to access data resident on a diskif the disk is powered down and wouldtake a long time to restart. In such cases,an application may choose instead to ac-cess the data through the network. To aidin these adaptations, Anand et al. pro-pose an API layer that gives applicationscontrol over all I/O devices but abstractsaway the details of these devices. TheAPI functions as a giant regulating dialthat applications can tune to specify theirdesired power/performance tradeoffs andwhich transparently manages all the de-vices to achieve these tradeoffs. The APIalso give applications more fine-grainedcontrol over devices they may want to ac-cess. One of its unique features is that itallows applications to issue ghost hints: ifan application chose not to access a de-vice that was deactivated but would havesaved more energy by accessing that de-vice if it had been activated, then it cansend the device a message to this effect.After a few of these ghost hints, the de-vice automatically becomes activated, an-ticipating that the application will need touse it.

Heath et al. [2004] propose similar workto increase opportunities for spinningdown idle hard disks and saving energy. Intheir framework, a compiler performs var-ious code transformations to cluster diskaccesses and then inserts hints that tellthe operating system how long these clus-tered disk accesses will take. The operat-ing system is then able to serve these diskrequests in a single batch and power downthe disk for longer periods of time.

3.9. Cross-Layer Adaptations

Because power consumption depends ondecisions spanning the entire stack fromtransistors to applications, a researchquestion that is becoming increasingly

popular is how to develop holistic ap-proaches that integrate information frommultiple levels (e.g., compiler, OS, hard-ware) into power management decisions.There are already a number of systems be-ing developed to explore cross-layer adap-tations. We give four examples of such sys-tems.

Forge [Mohapatra et al. 2003] is an in-tegrated power management frameworkfor networked multimedia applications.Forge targets a typical scenario whereusers request video streams for theirhandheld devices and these requests arefiltered by a remote proxy that transcodesthe streams and transmits them to theusers at the most energy efficient QoSlevels.

Forge integrates multiple levels of adap-tations. The hardware level contains a fre-quency and voltage scaling interface, anetwork card interface (allowing the net-work card to be powered down when idle),as well as interfaces allowing various ar-chitectural parameters (e.g., cache ways,register file sizes) to be tuned. A levelabove hardware are the operating systemand compiler, which control the architec-tural knobs, and a level above the oper-ating system is a distributed middlewareframe-work, part of which resides on thehandheld devices and part of which re-sides on the remote proxy. The middlewareon each handheld device monitors the en-ergy statistics of the device (through theOS), and sends feedback to the middle-ware on the proxy. The middleware on theproxy uses this feedback to decide how toregulate network traffic and at what QoSlevels to stream requested video feeds.

Different heuristics are used for differ-ent adaptations. To determine the optimalcache parameters for a video feed, one sim-ulates the feed offline at each QoS levelfor all possible variations of cache size andassociativity and determines the most en-ergy optimal configuration. This configu-ration is automatically chosen when thefeed executes. The cache configuration, inturn, affects how long it takes to decodeeach video frame. If the decoding time isless than the frame’s deadline, then theDVS algorithm slows down the frame to



exactly meet the deadline. Meanwhile onthe proxy, the middleware layer continu-ally receives requests for video feeds andinformation about the resource levels onvarious devices. When it receives a requestfrom a user on a handheld device, it checkswhether there is enough remaining energyon the device allowing it to receive thevideo feed at an acceptable quality. If thereis, then it transmits the feed at the high-est quality that does not exceed the en-ergy constraints of the device. Otherwise,it rejects the user’s request or sends thefeed at a lower quality. When the proxysends a video feedback back to the users,it transmits the feed in bursts of packets,allowing the users’ network cards to beshut down over longer idle periods to saveenergy.

Grace [Yuan and Nahrstedt 2003] is an-other cross-layer adaptation frameworkthat attempts to integrate dynamic volt-age scaling, power-aware task schedul-ing, and QoS setting. Its target applica-tions are real-time multimedia tasks withfixed periods and deadlines. Grace appliestwo levels of adaptations, global and local.Global adaptations respond to larger re-source variations and are applied less fre-quently because they are more expensive.These adaptations are controlled by a cen-tral coordinator that continuously moni-tors battery levels and application work-loads and responds to widespread changesin these variables. Local adaptations, onthe other hand, respond to smaller work-load variations within a task. These adap-tations are controlled by local adaptors.There are three local adaptors, one toset the CPU frequency, another to sched-ule tasks, and a third to adapt QoS pa-rameters. Major changes in the runtimeenvironment, such as low battery levelsor wide variations in workload, triggerthe central coordinator to decide upon anew set of global adaptations. The coor-dinator chooses the most energy-optimaladaptations by solving a constrained op-timization problem. It then broadcasts itsdecision to the local adaptors which imple-ment these changes but are free to adjustthese initial settings within the executionof specific tasks.

Another multilayer framework for mul-timedia applications has been developedby Fei et al. [2004]. Their framework con-sists of four layers divided into user spaceand system space. The system space con-sists of the target platform and the oper-ating system running on top of it. Directlyabove the operating system in user spaceis a middleware layer that decides how toapply adaptations, guided by informationit receives from the other layers. All appli-cations execute on top of the middlewarecoordinator which is responsible for decid-ing when to admit different applicationsinto the system and how to select their QoSlevels. When applications first enter thesystem, they store their meta-informationinside the middleware layer. This includesinformation about the QoS levels they of-fer and the energy consumption for eachof these levels. The middleware layer, inturn, monitors battery levels using coun-ters inside the operating system. Based onthe battery levels, the middleware decideson the most energy efficient QoS settingfor each application. It then admits appli-cations into the system according to user-defined priorities.

A fourth example is the work of AbouG-hazaleh et al. [2003] which explores howcompilers and operating systems can in-teract to save energy. This work targetsreal-time applications with fixed dead-lines and worst-case execution times. Inthis approach, the compiler instrumentsspecific program regions with code thatsaves the remaining worst-case executioncycles into a register. The operating sys-tem periodically reads this register andadjusts the speed of the processor so thatthe task finishes by its worst-case execu-tion time.

3.10. Commercial Systems

To understand what power managementtechniques industry is adopting, we exam-ine the low-power techniques used in fourwidely used processors, the Pentium 4,Pentium M, the PXA27x, and TransmetaCrusoe. We then discuss three power man-agement strategies by IBM, ARM, andNational Semiconductor that are rapidlygaining in importance.



3.10.1. The Pentium 4 Processor. Thoughits goal is high performance, the Pen-tium 4 processor also contains featuresto manage its power consumption [Gun-ther et al. 2001; Hinton et al. 2004].One of Intel’s goals in including thesefeatures was to prevent the Pentium’sinternal temperatures from becomingdangerously high due to the increasedpower dissipation. To meet this goal,the designers included a thermal de-tection and response mechanism whichinhibits the processor clock wheneverthe observed temperature exceeds a safethreshold. To monitor temperature vari-ations, the processor features a diode-based thermal sensor. The sensor sitsat the hottest area of the chip andmeasures temperature via voltage dropsacross the diode. Whenever the tempera-ture increases into a danger zone, the sen-sor issues a STOPCLOCK request, caus-ing the main clock signal to be inhibitedfrom reaching most of the processor untilthe temperature is no longer in the dan-ger zone. This is to guarantee responsetime while the chip is cooling. The temper-ature sensor also provides temperatureinformation to higher levels of softwarethrough output signals (some of which arein registers), allowing the compiler or op-erating system, for instance, to activateother techniques in response to the hightemperatures.

The Pentium 4 also supports the low-power operating states defined by the Ad-vanced Configuration and Power Interface(ACPI) specification, allowing software tocontrol the processor power modes. In par-ticular, it features a model-specific regis-ter allowing software to influence the pro-cessor clock. In addition to stopping theclock, the Pentium 4 features the IntelBasic Speedstep Technology which allowstwo settings for the processor clock fre-quency and voltage, a high setting for per-formance and a low setting for power. Thehigh setting is normally used when thecomputer is connected to a wall outlet, andthe low setting is normally used when thecomputer is running on batteries as in thecase of a laptop.

3.10.2. The Pentium M Processor. ThePentium M is the fruit of Intel’s effortsto bring the Pentium 4 to the mobile Do-main. It carefully balances performanceenhancing features with several power-saving features that increase the bat-tery lifetime [Genossar and Shamir 2003;Gochman et al. 2003]. It uses three mainstrategies to reduce dynamic power con-sumption: reducing the total instructionsand micro-operations executed, reducingthe switching activity in the circuit, andreducing the energy dissipated per tran-sistor switch.

To save energy, the Pentium M in-tegrates several techniques that reducethe total switching activity. These includehardware for predicting idle units and in-hibiting their clock signals, buses whosecomponents are activated only when dataneeds to be transferred, and a techniquecalled execution stacking which clustersunits that perform similar functions intosimilar regions so that the processor canselectively activate the parts of the circuitthat will be needed by an instruction.

To reduce the static power dissipation,the Pentium M incorporates low leakagetransistors in the caches. To further re-duce both dynamic and leakage energythroughout the processor, the Pentium Msupports an enhanced version of the In-tel SpeedStep technology, which unlike itspredecessor, allows the processor to tran-sition between 6 different frequency andvoltage settings.

3.10.3. The Intel PXA27x Processors. TheIntel PXA27x processors used in manywireless handheld devices feature theIntel Wireless Speedstep power man-ager [Intel Corporation 2004]. This powermanager uses memory boundedness asthe criterion for deciding how to managethe processor’s power modes. It receivesinformation from an idle profiler whichmonitors the system idle thread and a per-formance profiler that monitors the pro-cessor’s built-in performance counters togather statistics pertaining to cache andTLB misses, instructions executed, and



processor stalls. Using this information,the power manager determines how mem-ory bounded the workload is and decideshow to adjust the processor’s power modesas well as its clock frequency and voltage.

3.10.4. The Transmeta Crusoe Processor.Transmeta’s Crusoe processor [TransmetaCorporation 2001, 2003] saves power notonly by means of its LongRun Power Man-ager but also through its overall designwhich shifts many of the complexities ininstruction execution from hardware intosoftware. The Crusoe relies on a code mor-phing engine, a layer of software that in-terprets or compiles x86 programs intothe processor’s native VLIW instructions,saving the generated code in a Transla-tion Cache so that they can be reused.This layer of software is also responsi-ble for monitoring executing programs forhotspots and reoptimizing code on the fly.As a result, features that are typically im-plemented in hardware (e.g., out-of-orderexecution) are instead implemented insoftware. This reduces the total on-chipreal estate and its accompanying powerdissipation. To further reduce the powerdissipation, Crusoe’s LongRun modulatesthe clock frequency and voltage accordingto workload demands. It identifies idle pe-riods of the operating system and scalesthe clock frequency and voltage accord-ingly.

Transmeta’s later generation Efficeonprocessor contains enhanced versions ofthe code morphing engine and the Lon-gRun Power Manager. Some news arti-cles claim that the new LongRun includestechniques to reduce leakage, but the de-tails are proprietary.

3.10.5. IBM Dynamic Power Management.IBM has developed an extensible oper-ating system module [Brock and Raja-mani 2003] that allows power manage-ment policies to be developed for a widerange of embedded devices. For portability,this module abstracts away from the un-derlying architecture and provides a sim-ple interface allowing designers to specify

device-specific parameters (e.g., availableclock frequencies) and to integrate theirown policies in a “plug-and-play” manner.The main idea is to model the underly-ing hardware as a state machine composedof different operating points and operat-ing states. The operating points are clockfrequency and voltage settings, while theoperating states are power mode settingssuch as active, idle and deep sleep. De-signers specify the operating points andstates for their target architecture, and us-ing this abstraction, they write policies fortransitioning between the states. IBM hasexperimented with a wide range of poli-cies for the PowerPC 405LP, from simplepolicies that adjust the frequency based onwhether the processor is idle or active, tomore complex policies that integrate infor-mation about application workloads anddeadlines.

3.10.6. Powerwise and Intelligent EnergyManagement. National Semiconductorand ARM Inc. have collaborated todevelop a single end-to-end energymanagement system for embedded pro-cessors. Their system brings together twotechnologies, ARM’s Intelligent EnergyManagement Software (IEM), whichmodulates the processor speed based onworkloads, and National Semiconductor’sPowerwise Adaptive Voltage Scaling(AVS), which dynamically adjusts thesupply voltage for a given clock frequencybased on variations in temperature andother ambient system effects.

The IEM software consists of a three-level decision hierarchy of policies for de-ciding how fast the processor should run.At the bottom is a baseline algorithmthat selects the optimal frequency basedon estimating future task workloads frompast workloads measured over a slidinghistory window. At the top is an algo-rithm more suited for interactive and me-dia applications, which considers how fastthese applications would need to run todeliver smooth performance to users. Be-tween these two layers is an interface al-lowing applications to communicate their



workload requirements directly. Each ofthese three layers combines its perfor-mance prediction with a confidence rating,which indicates how to compare its deci-sion with that of other layers. Wheneverthere is a context switch, the IEM softwareanalyzes the decisions and confidence rat-ings of each layer to finally decide how fastto run the processor.

The Powerwise Adaptive Voltage Scal-ing technology [Dhar et al. 2002] aims tosave more energy than conventional volt-age scaled systems by choosing the mini-mum acceptable voltage for a given clockfrequency. Conventional DVS processorsdetermine the appropriate voltage for aclock frequency by referring to a tablethat is developed offline using a worst-casemodel of ambient hardware characteris-tics. As a result, the voltages chosen areoften higher than they need to be. Power-wise gets around this problem by addinga feedback loop. It continually monitorsvariations in temperature and other am-bient system effects through performancecounters and uses this information to dy-namically adjust the supply voltage foreach clock frequency. This results in an ad-ditional 45% energy savings over conven-tional voltage scaling.

3.11. A Glimpse Into Emerging RadicalTechnologies

Imagine a laptop that runs on hydrogenfuel or a mobile phone that is powered bythe same type of engines that propel gi-gantic aircraft. Such ideas may sound farfetched, but they are exactly the kinds ofinnovations that some researchers claimwill eventually solve the energy problem.We close our survey by briefly discussingtwo of the more radical technologies thatare likely to gain in importance over thenext decade. The focus of these techniquesis on improving energy efficiency. Thetechniques include fuel cells and MEMSsystems.

3.11.1. Fuel Cells. Fuel cells [Dyer 2004]are being developed to replace the bat-teries used in mobile devices. Batterieshave a limited capacity. Once depleted,

they must be discarded unless they arerechargeable, and recharging a batterycan take several hours and even recharge-able batteries eventually die. Moreover,battery technology has been advancingslowly. Building more efficient batteriesstill means building bigger batteries, andbigger and heavier batteries are not suitedfor small mobile devices.

Fuel cells are similar to batteries inthat they generate electricity by meansof a chemical reaction but, unlike batter-ies, fuel cells can in principle supply en-ergy indefinitely. The main componentsof a fuel cell are an anode, a cathode, amembrane separating the anode from thecathode, and a link to transfer the gener-ated electricity. The fuel enters the anode,where it reacts with a catalyst and splitsinto protons and electrons. The protonsdiffuse through the membrane, while theelectrons are forced to travel through thelink generating electricity. When the pro-tons reach the cathode, they combine withOxygen in the air to produce water andheat as by-products. If the fuel cell uses awater-diluted fuel (as some do), then thiswaste water can be recycled back to theanode.

Fuel cells have a number of advantages.One advantage is that fuels (e.g., hydro-gen) are abundantly available from a widevariety of natural resources, and many of-fer energy densities high enough to allowportable devices to run far longer thanthey do on batteries. Another advantage isthat refueling is significantly faster thanrecharging a battery. In some cases, itmerely involves spraying more fuel into atank. A third advantage is that there isno limit to how many times a fuel cell canbe refueled. As long as it contains fuel, itgenerates electricity.

Several companies are aggressively de-veloping fuel cell technologies for portabledevices including Micro Fuel Cell Inc.,NEC, Toshiba, Medis, Panasonic, Mo-torola, Samsung, and Neah Systems. Al-ready a number of prototype cells haveemerged as well as prototype mobile com-puters powered by hydrogen or alcohol-based fuels. Despite these rapid advances,some industry analysts estimate that it



will take another 5-10 years for fuel cellsto become commonplace in mobile devices.

Fuel cells also have several drawbacks.First, they can get very hot (e.g., 500-1000 celsius). Second, the metallic materi-als and mechanical components that theyare composed of can be quite expensive.Third, fuel cells are flammable. In par-ticular, fuel-powered devices will requiresafety measures as well as more flexiblelaws allowing them inside airplanes.

3.11.2. MEMS Systems. Microelectricaland Mechanical Systems (MEMS) areminiature versions of large scale devicescommonly used to convert mechanical en-ergy into electrical energy. Researchers atMIT and the Georgia Institute of Technol-ogy [Epstein 2004] are exploring a radicalway of using them to solve the energy prob-lem. They are developing prototype mil-limeter scale versions of the gigantic gasturbine engines that power airplanes anddrive electrical generators. These micro-engines, they claim, will give mobile com-puters unprecedented amounts of unteth-ered lifetime.

The microengines work using similarprinciples as their large scale counter-parts. They suck air into a compressorand ignite it with fuel. The compressed airthen spins a set of turbines that are con-nected to a generator to generate electricalpower. The fuels used could be hydrogen,diesel based, or more energy-dense solidfuels.

Made from layers of silicon wafers, thesetiny engines are supposed to output thesame levels of electrical power per unitof fuel as their large scale counterparts.Their proponents claim they have two ad-vantages. First, they can output far morepower using less fuel than fuel cells orbatteries alone. In fact, the ones underdevelopment are expected to output 10to 100 Watts of power almost effortlesslyand keep mobile devices powered for days.Moreover, as a result of their high energydensity, these engines would require lessspace than either fuel cells or batteries.

This technology, according to re-searchers, is likely to be commercialized

over the next four years. However, it is tooearly to tell whether it will in fact replacebatteries and fuel cells. One problemis that jet engines produce hot exhaustgases that could raise chip temperaturesto dangerous levels, possibly requiringnew materials for a chip to withstandthese temperatures. Other issues includeflammability and the power dissipation ofthe rotating turbines.

4. CONCLUSION

Power and energy management has growninto a multifaceted effort that brings to-gether researchers from such diverse ar-eas as physics, mechanical engineering,electrical engineering, design automation,logic and high-level synthesis, computerarchitecture, operating systems, compilerdesign, and application development. Wehave examined how the power problemarises and how the problem has beenaddressed along multiple levels rangingfrom transistors to applications. We havealso surveyed major commercial powermanagement technologies and provided aglimpse into some emerging technologies.We conclude by noting that the field isstill active, and that researchers are con-tinually developing new algorithms andheuristics along each level as well as ex-ploring how to integrate algorithms frommultiple levels. Given the wide varietyof microarchitectural and software tech-niques available today and the astound-ingly large number of techniques that willbe available in the future, it is highly likelythat we will overcome the limits imposedby high power consumption and continueto build processors offering greater levelsof performance and versatility. However,only time will tell which approaches willultimately succeed in solving the powerproblem.

REFERENCES

ABOUGHAZALEH, N., CHILDERS, B., MOSSE, D., MEL-HEM, R., AND CRAVEN, M. 2003. Energy man-agement for real-time embedded applicationswith compiler support. In Proceedings of theACM SIGPLAN Conference on Languages, Com-pilers, and Tools for Embedded Systems. ACMPress, 284–293.



ALBONESI, D., DROPSHO, S., DWARKADAS, S., FRIEDMAN,E., HUANG, M., KURSUN, V., MAGKLIS, G., SCOTT,M., SEMERARO, G., BOSE, P., BUYUKTOSUNOGLU, A.,COOK, P., AND SCHUSTER, S. 2003. Dynamicallytuning processor resources with adaptive pro-cessing. IEEE Computer Magazine 36, 12, 49–58.

ANAND, M., NIGHTINGALE, E., AND FLINN, J. 2004.Ghosts in the machine: Interfaces for betterpower management. In Proceedings of the Inter-national Conference on Mobile Systems, Applica-tions, and Services. 23–35.

AZEVEDO, A., ISSENIN, I., CORNEA, R., GUPTA, R.,DUTT, N., VEIDENBAUM, A., AND NICOLAU, A. 2002.Profile-based dynamic voltage scheduling usingprogram checkpoints. In Proceedings of the Con-ference on Design, Automation and Test in Eu-rope. 168–175.

BAHAR, R. I. AND MANNE, S. 2001. Power and energyreduction via pipeline balancing. In Proceedingsof the 28th Annual International Symposium onComputer Architecture. ACM Press, 218–229.

BANAKAR, R., STEINKE, S., LEE, B.-S., BALAKRISHNAN, M.,AND MARWEDEL, P. 2002. Scratchpad memory:Design alternative for cache on-chip memory inembedded systems. In Proceedings of the 10th In-ternational Symposium on Hardware/SoftwareCodesign. 73–78.

BANERJEE, K. AND MEHROTRA, A. 2001. Global in-terconnect warming. IEEE Circuits and DevicesMagazine 17, 5, 16–32.

BISHOP, B. AND IRWIN, M. J. 1999. Databus chargerecovery: practical considerations. In Proceed-ings of the International Symposium on LowPower Electronics and Design. 85–87.

BROCK, B. AND RAJAMANI, K. 2003. Dynamic powermanagement for embedded systems. In Proceed-ings of the IEEE International SOC Conference.416–419.

BUTTAZZO, G. C. 2002. Scalable applications forenergy-aware processors. In Proceedings of the2nd International Conference On EmbeddedSoftware. Springer-Verlag, 153–165.

BUTTS, J. A. AND SOHI, G. S. 2000. A static powermodel for architects. In Proceedings of the 33rdAnnual ACM/IEEE International Symposiumon Microarchitecture. Monterey, CA. 191–201.

BUYUKTOSUNOGLU, A., SCHUSTER, S., BROOKS, D., BOSE,P., COOK, P. W., AND ALBONESI, D. 2001. Anadaptive issue queue for reduced power at highperformance. In Proceedings of the 1st Interna-tional Workshop on Power-Aware Computer Sys-tems. 25–39.

CALHOUN, B. H., HONORE, F. A., AND CHANDRAKASAN,A. 2003. Design methodology for fine-grainedleakage control in MTCMOS. In Proceedingsof the International Symposium on Low PowerElectronics and Design. ACM Press, 104–109.

CHEN, D., CONG, J., LI, F., AND HE, L. 2004. Low-power technology mapping for FPGA architec-tures with dual supply voltages. In Proceedings

of the ACM/SIGDA 12th International Sympo-sium on Field Programmable Gate Arrays. ACMPress, 109–117.

CHEN, G., KANG, B., KANDEMIR, M., VIJAYKRISHNAN, N.,AND IRWIN, M. 2003. Energy-aware compila-tion and execution in java-enabled mobile de-vices. In Proceedings of the 17th Parallel andDistributed Processing Symposium. IEEE Press,34a.

CHOI, K., SOMA, R., AND PEDRAM, M. 2004. Dynamicvoltage and frequency scaling based on work-load decomposition. In Proceedings of the Inter-national Symposium on Low Power Electronicsand Design. ACM Press, 174–179.

CLEMENTS, P. C. 1996. A survey of architecture de-scription languages. In Proceedings of the 8th In-ternational Workshop on Software Specificationand Design. IEEE Computer Society, 16–25.

DALLY, W. J. AND TOWLES, B. 2001. Route packets,not wires: on-chip inteconnection networks. InProceedings of the 38th Conference on Design Au-tomation. ACM Press, 684–689.

DALTON, A. B. AND ELLIS, C. S. 2003. Sensing userintention and context for energy management.In Proceedings of the 9th Workshop on Hot Topicsin Operating Systems. 151–156.

DE, V. AND BORKAR, S. 1999. Technology and de-sign challenges for low power and high perfor-mance. In Proceedings of the International Sym-posium on Low Power Electronics and DesignISLPED’99 . ACM Press, 163–168.

DHAR, S., MAKSIMOVIC, D., AND KRANZEN, B. 2002.Closed-loop adaptive voltage scaling controllerfor standard-cell asics. In Proceedings of the In-ternational Symposium on Low Power Electron-ics and Design ISLPED’02. ACM Press, 103–107.

DINIZ, P. C. 2003. A compiler approach to perfor-mance prediction using empirical-based model-ing. In International Conference On Computa-tional Science. 916–925.

DROPSHO, S., BUYUKTOSUNOGLU, A., BALASUBRAMONIAN,R., ALBONESI, D. H., DWARKADAS, S., SEMERARO, G.,MAGKLIS, G., AND SCOTT, M. L. 2002. Integrat-ing adaptive on-chip storage structures for re-duced dynamic power. In Proceedings of the In-ternational Conference on Parallel Architecturesand Compilation Techniques. IEEE ComputerSociety, 141–152.

DROPSHO, S., SEMERARO, G., ALBONESI, D. H., MAGKLIS,G., AND SCOTT, M. L. 2004. Dynamically trad-ing frequency for complexity in a gals micropro-cessor. In Proceedings of the 37th InternationalSymposium on Microarchitecture. IEEE Com-puter Society, 157–168.

DUDANI, A., MUELLER, F., AND ZHU, Y. 2002. Energy-conserving feedback EDF scheduling for embed-ded systems with real-time constraints. In Pro-ceedings of the Joint Conference on Languages,Compilers, and Tools for Embedded Systems.ACM Press, 213–222.



DYER, C. 2004. Fuel cells and portable electronics.In Symposium On VLSI Circuits Digest of Tech-nical Papers (2004). 124–127.

EBERGEN, J., GAINSLEY, J., AND CUNNINGHAM, P. 2004.Transistor sizing: How to control the speed andenergy consumption of a circuit. In the 10thInternational Symposium on Asynchronous Cir-cuits and Systems. 51–61.

EPSTEIN, A. 2004. Millimeter-scale, micro-electromechanical systems gas turbine engines.J. Eng. Gas Turb. Power 126, 205–226.

ERNST, D., KIM, N., DAS, S., PANT, S., RAO, R., PHAM,T., ZIESLER, C., BLAAUW, D., AUSTIN, T., FLAUT-NER, K., AND MUDGE, T. 2003. Razor: A low-power pipeline based on circuit-level timingspeculation. In Proceedings of the 36th AnnualIEEE/ACM International Symposium on Mi-croarchitecture. IEEE Computer Society, 7–18.

FAN, X., ELLIS, C. S., AND LEBECK, A. R. 2003. Syn-ergy between power-aware memory systems andprocessor voltage scaling. In Proceedings of theWorkshop on Power-Aware Computer Systems.164–179.

FEI, Y., ZHONG, L., AND JHA, N. 2004. An energy-aware framework for coordinated dynamic soft-ware management in mobile computers. In Pro-ceedings of the IEEE Computer Society’s 12thAnnual International Symposium on Model-ing, Analysis, and Simulation of Computer andTelecommunications Systems. 306–317.

FERDINAND, C. 1997. Cache behavior prediction forreal time systems. Ph.D. thesis, Universitat desSaarlandes.

FLAUTNER, K., REINHARDT, S., AND MUDGE, T. 2001.Automatic performance setting for dynamic volt-age scaling. In Proceedings of the 7th AnnualInternational Conference on Mobile Computingand Networking. ACM Press, 260–271.

FLINN, J. AND SATYANARAYANAN, M. 1999. Energy-aware adaptation for mobile applications. InProceedings of the 17th ACM Symposium on Op-erating Systems Principles. ACM Press, 48–63.

FOLEGNANI, D. AND GONZALEZ, A. 2001. Energy-effective issue logic. In Proceedings of the 28thAnnual International Symposium on ComputerArchitecture. ACM Press, 230–239.

GAO, F. AND HAYES, J. P. 2003. ILP-based optimiza-tion of sequential circuits for low power. In Pro-ceedings of the International Symposium on LowPower Electronics and Design. ACM Press, 140–145.

GENOSSAR, D. AND SHAMIR, N. 2003. Intel PentiumM processor power estimation, budgeting, opti-mization, and validation. Intel. Tech. J. 7, 2, 44–49.

GHOSE, K. AND KAMBLE, M. B. 1999. Reducingpower in superscalar processor caches using sub-banking, multiple line buffers, and bit-line seg-mentation. In Proceedings of the InternationalSymposium on Low Power Electronics and De-sign. ACM Press, 70–75.

GIVARGIS, T., VAHID, F., AND HENKEL, J. 2001.System-level exploration for pareto-optimal con-figurations in parameterized systems-on-a-chip.In Proceedings of the IEEE/ACM InternationalConference on Computer-Aided Design. IEEEPress, 25–30.

GOCHMAN, S., RONEN, R., ANATI, I., BERKOVITS, A.,KURTS, T., NAVEH, A., SAEED, A., SPERBER, Z., AND

VALENTINE, R. 2003. The Intel Pentium M pro-cessor: Microarchitecture and performance. In-tel. Tech. J. 7, 2, 21–59.

GOMORY, R. E. AND HU, T. C. 1961. Multi-terminalnetwork flows. J. SIAM 9, 4, 551–569.

GOVIL, K., CHAN, E., AND WASSERMAN, H. 1995. Com-paring algorithms for dynamic speed-setting of alow-power CPU. In Proceedings of the 1st AnnualInternational Conference on Mobile Computingand Networking. ACM Press, 13–25.

GRUIAN, F. 2001. Hard real-time scheduling forlow-energy using stochastic data and DVS pro-cessors. In Proceedings of the International Sym-posium on Low Power Electronics and Design.ACM Press, 46–51.

GUNTHER, S., BINNS, F., CARMEAN, D., AND HALL, J.2001. Managing the impact of increasing mi-croprocessor power consumption. Intel Tech. J.

GURUMURTHI, S., SIVASUBRAMANIAM, A., KANDEMIR, M.,AND FRANKE, H. 2003. DRPM: Dynamic speedcontrol for power management in server classdisks. SIGARCH Comput. Architect. News 31, 2,169–181.

HEATH, T., PINHEIRO, E., HOM, J., KREMER, U.,AND BIANCHINI, R. 2004. Code transformationsfor energy-efficient device management. IEEETrans. Comput. 53, 8, 974–987.

HINTON, G., SAGER, D., UPTON, M., BOGGS, D., CARMEAN,D., KYKER, A., AND ROUSSEL, P. 2004. The mi-croarchitecture of the Intel Pentium 4 processoron 90nm technology. Intel Tech. J. 8, 1, 1–17.

HO, Y.-T. AND HWANG, T.-T. 2004. Low power de-sign using dual threshold voltage. In Proceed-ings of the Conference on Asia South Pacific De-sign Automation IEEE Press, (Piscataway, NJ,)205–208.

HOM, J. AND KREMER, U. 2003. Energy manage-ment of virtual memory on diskless devices.In Compilers and Operatings Systems for LowPower. Kluwer Academic Publishers, Norwell,MA. 95–113.

HOSSAIN, R., ZHENG, M. AND ALBICKI, A. 1996. Re-ducing power dissipation in CMOS circuits bysignal probability based transistor reording. InIEEE Trans. Comput.-Aided Design IntegratedCircuits Syst. 15, 3, 361–368.

HSU, C. AND FENG, W. 2004. Effective dynamicvoltage scaling through cpu-boundedness detec-tion. In Workshop on Power Aware ComputingSystems.

HSU, C.-H. AND KREMER, U. 2003. The design,implementation, and evaluation of a com-piler algorithm for CPU energy reduction. In



Proceedings of the ACM SIGPLAN Conference onProgramming Language Design and Implemen-tation. ACM Press, 38–48.

HU, J., IRWIN, M., VIJAYKRISHNAN, N., AND KANDEMIR, M.2002. Selective trace cache: A low power andhigh performance fetch mechanism. Tech. Rep.CSE-02-016, Department of Computer Scienceand Engineering, The Pennsylvania State Uni-versity (Oct.).

HU, J., VIJAYKRISHNAN, N., IRWIN, M., AND KANDEMIR,M. 2003. Using dynamic branch behavior forpower-efficient instruction fetch. In Proceedingsof the IEEE Computer Society Annual Sympo-sium on VLSI. 127–132.

HU, J. S., NADGIR, A., VIJAYKRISHNAN, N., IRWIN, M. J.,AND KANDEMIR, M. 2003. Exploiting programhotspots and code sequentiality for instructioncache leakage management. In Proceedings ofthe International Symposium on Low PowerElectronics and Design. ACM Press, 402–407.

HUANG, M., RENAU, J., YOO, S.-M., AND TORRELLAS, J.2000. A framework for dynamic energy effi-ciency and temperature management. In Pro-ceedings of the 33rd Annual ACM/IEEE Inter-national Symposium on Microarchitecture. ACMPress, 202–213.

HUANG, M. C., RENAU, J., AND TORRELLAS, J. 2003.Positional adaptation of processors: applicationto energy reduction. In Proceedings of the 30thAnnual International Symposium on ComputerArchitecture ISCA ’03. ACM Press, 157–168.

HUGHES, C. J. AND ADVE, S. V. 2004. A formal ap-proach to frequent energy adaptations for mul-timedia applications. In Proceedings of the 31stAnnual International Symposium on ComputerArchitecture (ISCA’04). IEEE Computer Society,138.

HUGHES, C. J., SRINIVASAN, J., AND ADVE, S. V. 2001.Saving energy with architectural and frequencyadaptations for multimedia applications. In Pro-ceedings of the 34th Annual ACM/IEEE In-ternational Symposium on Microarchitecture(MICRO’34). IEEE Computer Society, 250–261.

Intel Corporation. 2004. Wireless Intel SpeedstepPower Manager. Intel Corporation.

ISCI, C. AND MARTONOSI, M. 2003. Runtime powermonitoring in high-end processors: Methodologyand empirical data. In Proceedings of the 36thAnnual IEEE/ACM International Symposiumon Microarchitecture (MICRO’36). IEEE Com-puter Society, 93–104.

ISHIHARA, T. AND YASUURA, H. 1998. Voltagescheduling problem for dynamically variablevoltage processors. In Proceedings of the Inter-national Symposium on Low Power Electronicsand Design. ACM Press, 197–202.

ITRSRoadmap. The International TechnologyRoadmap for Semiconductors. Available athttp://public.itrs.net.

IYER, A. AND MARCULESCU, D. 2001. Power awaremicroarchitecture resource scaling. In Proceed-

ings of the Conference on Design, Automationand Test in Europe. IEEE Press, 190–196.

IYER, A. AND MARCULESCU, D. 2002a.Microarchitecture-level power management.IEEE Trans. VLSI Syst. 10, 3, 230–239.

IYER, A. AND MARCULESCU, D. 2002b. Power ef-ficiency of voltage scaling in multiple clock,multiple voltage cores. In Proceedings ofthe IEEE/ACM International Conference onComputer-Aided Design. ACM Press, 379–386.

JONE, W.-B., WANG, J. S., LU, H.-I., HSU, I. P., AND

CHEN, J.-Y. 2003. Design theory and imple-mentation for low-power segmented bus sys-tems. ACM Trans. Design Autom. Electr. Syst.8, 1, 38–54.

KABADI, M., KANNAN, N., CHIDAMBARAM, P., NARAYANAN,S., SUBRAMANIAN, M., AND PARTHASARATHI, R.2002. Dead-block elimination in cache: Amechanism to reduce i-cache power consump-tion in high performance microprocessors. InProceedings of the International Conference onHigh Performance Computing. Springer Verlag,79–88.

KANDEMIR, M., RAMANUJAM, J., AND CHOUDHARY, A.2002. Exploiting shared scratch pad memoryspace in embedded multiprocessor systems. InProceedings of the 39th Conference on Design Au-tomation. ACM Press, 219–224.

KAXIRAS, S., HU, Z., AND MARTONOSI, M. 2001. Cachedecay: exploiting generational behavior to re-duce cache leakage power. In Proceedings of the28th Annual International Symposium on Com-puter Architecture. ACM Press, 240–251.

KIM, C. AND ROY, K. 2002. Dynamic Vth scalingscheme for active leakage power reduction. InProceedings of the Design, Automation and Testin Europe Conference and Exhibition. IEEEComputer Society, 0163–0167.

KIM, N. S., FLAUTNER, K., BLAAUW, D., AND MUDGE,T. 2002. Drowsy instruction caches: leakagepower reduction using dynamic voltage scalingand cache sub-bank prediction. In Proceedings ofthe 35th Annual ACM/IEEE International Sym-posium on Microarchitecture. IEEE ComputerSociety Press, 219–230.

KIN, J., GUPTA, M., AND MANGIONE-SMITH, W. H. 1997.The filter cache: An energy efficient memorystructure. In Proceedings of the 30th AnnualACM/IEEE International Symposium on Mi-croarchitecture. IEEE Computer Society, 184–193.

KISTLER, T. AND FRANZ, M. 2001. Continuous pro-gram optimization: design and evaluation. IEEETrans. Comput. 50, 6, 549–566.

KISTLER, T. AND FRANZ, M. 2003. Continuous pro-gram optimization: A case study. ACM Trans.Program. Lang. Syst. 25, 4, 500–548.

KOBAYASHI AND SAKURAI. 1994. Self-adjustingthreshold voltage scheme (SATS) for low-voltage high-speed operation. In Proceedings ofthe IEEE Custom Integrated Circuits Confer-ence. 271–274.



KONDO, M. AND NAKAMURA, H. 2004. Dynamic pro-cessor throttling for power efficient computa-tions. In Workshop on Power Aware ComputingSystems.

KONG, B., KIM, S., AND JUN, Y. 2001. Conditional-capture flip-flop for statistical power reduc-tion. IEEE J. Solid State Circuits 36, 8, 1263–1271.

KRANE, R., PARSONS, J., AND BAR-COHEN, A. 1988.Design of a candidate thermal control systemfor a cryogenically cooled computer. IEEE Trans.Components, Hybrids, Manufact. Techn. 11, 4,545–556.

KRAVETS, R. AND KRISHNAN, P. 1998. Power manage-ment techniques for mobile communication. InProceedings of the 4th Annual ACM/IEEE Inter-national Conference on Mobile Computing andNetworking. ACM Press, 157–168.

KURSUN, E., GHIASI, S., AND SARRAFZADEH, M. 2004.Transistor level budgeting for power optimiza-tion. In Proceedings of the 5th InternationalSymposium on Quality Electronic Design. 116–121.

LEBECK, A. R., FAN, X., ZENG, H., AND ELLIS, C. 2000.Power aware page allocation. In Proceedings ofthe 9th International Conference on Architec-tural Support for Programming Languages andOperating Systems (2000). ACM Press, 105–116.

LEE, S. AND SAKURAI, T. 2000. Run-time voltagehopping for low-power real-time systems. In Pro-ceedings of the 37th Conference on Design Au-tomation. ACM Press, 806–809.

LI, H., KATKOORI, S., AND MAK, W.-K. 2004. Powerminimization algorithms for LUT-based FPGAtechnology mapping. ACM Trans. Design Autom.Electr. Syst. 9, 1, 33–51.

LI, X., LI, Z., DAVID, F., ZHOU, P., ZHOU, Y., ADVE, S.,AND KUMAR, S. 2004. Performance directed en-ergy management for main memory and disks.In Proceedings of the 11th International Confer-ence on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS-XI). ACM Press, New York, NY, 271–283.

LIM, S.-S., BAE, Y. H., JANG, G. T., RHEE, B.-D., MIN,S. L., PARK, C. Y., SHIN, H., PARK, K., MOON, S.-M., AND KIM, C. S. 1995. An accurate worstcase timing analysis for RISC processors. IEEETrans. Softw. Eng. 21, 7, 593–604.

LIU, M., WANG, W.-S., AND ORSHANSKY, M. 2004.Leakage power reduction by dual-vth designsunder probabilistic analysis of vth variation. InProceedings of the International Symposium onLow Power Electronics and Design ACM Press,New York, NY, 2–7.

LLOPIS, R. AND SACHDEV, M. 1996. Low power,testable dual edge triggered flip-flops. In Pro-ceedings of the International Symposium on LowPower Electronics and Design. IEEE Press, 341–345.

LORCH, J. AND SMITH, A. 2001. Improving dynamicvoltage scaling algorithms with PACE. In Pro-ceedings of the ACM SIGMETRICS Interna-

tional Conference on Measurement and Modelingof Computer Systems. ACM Press, 50–61.

LUZ, V. D. L., KANDEMIR, M., AND KOLCU, I. 2002.Automatic data migration for reducing energyconsumption in multi-bank memory systems. InProceedings of the 39th Conference on Design Au-tomation. ACM Press, 213–218.

LYUBOSLAVSKY, V., BISHOP, B., NARAYANAN, V., AND IR-WIN, M. J. 2000. Design of databus charge re-covery mechanism. In Proceedings of the Inter-national Conference on ASIC. ACM Press, 283–287.

MAGKLIS, G., SCOTT, M. L., SEMERARO, G., ALBONESI,D. H., AND DROPSHO, S. 2003. Profile-based dy-namic voltage and frequency scaling for a mul-tiple clock domain microprocessor. In Proceed-ings of the 30th Annual International Sympo-sium on Computer Architecture. ACM Press, 14–27.

MAGKLIS, G., SEMERARO, G., ALBONESI, D., DROPSHO, S.,DWARKADAS, S., AND SCOTT, M. 2003. Dynamicfrequency and voltage scaling for a multiple-clock-domain microprocessor. IEEE Micro 23, 6,62–68.

MARCULESCU, D. 2004. Application adaptive en-ergy efficient clustered architectures. In Pro-ceedings of the International Symposium on LowPower Electronics and Design. 344–349.

MARTIN, T. AND SIEWIOREK, D. 2001. Nonideal bat-tery and main memory effects on cpu speed-setting for low power. IEEE Tran. (VLSI) Syst. 9,1, 29–34.

MENG. Y., SHERWOOD, T., AND KASTNER, R. 2005. Ex-ploring the limits of leakage power reduction incaches. ACM Trans. Architecture Code Optimiz.,1, 221–246.

MOHAPATRA, S., CORNEA, R., DUTT, N., NICOLAU, A.,AND VENKATASUBRAMANIAN, N. 2003. Integratedpower management for video streaming to mo-bile handheld devices. In Proceedings of the 11thACM International Conference on Multimedia.ACM Press, 582–591.

NEDOVIC, N., ALEKSIC, M., AND OKLOBDZIJA, V. 2001.Conditional techniques for low power consump-tion flip-flops. In Proceedings of the 8th Interna-tional Conference on Electronics, Circuits, andSystems. 803–806.

NILSEN, K. D. AND RYGG, B. 1995. Worst-case exe-cution time analysis on modern processors. InProceedings of the ACM SIGPLAN Workshop onLanguages, Compilers and Tools for Real-TimeSystems. ACM Press, 20–30.

PALM, J., LEE, H., DIWAN, A., AND MOSS, J. E. B. 2002.When to use a compilation service? In Proceed-ings of the Joint Conference on Languages, Com-pilers, and Tools for Embedded Systems. ACMPress, 194–203.

PANDA, P. R., DUTT, N. D., AND NICOLAU, A. 2000. On-chip vs. off-chip memory: the data partitioningproblem in embedded processor-based systems.ACM Trans. Design Autom. Electr. Syst. 5, 3,682–704.



PAPATHANASIOU, A. E. AND SCOTT, M. L. 2002. In-creasing disk burstiness for energy efficiency.Technical Report 792 (November), Departmentof Computer Science, University of Rochester(Nov.).

PATEL, K. N. AND MARKOV, I. L. 2003. Error-correction and crosstalk avoidance in dsmbusses. In Proceedings of the International Work-shop on System-Level Interconnect Prediction.ACM Press, 9–14.

PENZES, P., NYSTROM, M., AND MARTIN, A. 2002.Transistor sizing of energy-delay-efficientcircuits. Tech. Rep. 2002003, Departmentof Computer Science, California Institute ofTechnology.

PEREIRA, C., GUPTA, R., AND SRIVASTAVA, M. 2002.PASA: A software architecture for buildingpower aware embedded systems. In Proceedingsof the IEEE CAS Workshop on Wireless Commu-nication and Networking.

POLLACK, F. 1999. New microarchitecture chal-lenges in the coming generations of CMOS pro-cess technologies. International Symposium onMicroarchitecture.

PONOMAREV, D., KUCUK, G., AND GHOSE, K. 2001.Reducing power requirements of instructionscheduling through dynamic allocation of mul-tiple datapath resources. In Proceedings of the34th Annual ACM/IEEE International Sympo-sium on Microarchitecture. IEEE Computer So-ciety, 90–101.

POWELL, M., YANG, S.-H., FALSAFI, B., ROY, K., AND

VIJAYKUMAR, T. N. 2001. Reducing leakage ina high-performance deep-submicron instructioncache. IEEE Trans. VLSI Syst. 9, 1, 77–90.

RUTENBAR, R. A., CARLEY, L. R., ZAFALON, R., AND DRAG-ONE, N. 2001. Low-power technology mappingfor mixed-swing logic. In Proceedings of the Inter-national Symposium on Low Power Electronicsand Design. ACM Press, 291–294.

SACHS, D., ADVE, S., AND JONES, D. 2003. Cross-layer adaptive video coding to reduce energy ongeneral purpose processors. In Proceedings of theInternational Conference on Image Processing.109–112.

SASANKA, R., HUGHES, C. J., AND ADVE, S. V. 2002.Joint local and global hardware adaptationsfor energy. SIGARCH Computer ArchitectureNews 30, 5, 144–155.

SCHMIDT, R. AND NOTOHARDJONO, B. 2002. High-endserver low-temperature cooling. IBM J. Res. De-vel. 46, 6, 739–751.

SEMERARO, G., ALBONESI, D. H., DROPSHO, S. G., MAGK-LIS, G., DWARKADAS, S., AND SCOTT, M. L. 2002.Dynamic frequency and voltage control for amultiple clock domain microarchitecture. In Pro-ceedings of the 35th Annual ACM/IEEE Interna-tional Symposium on Microarchitecture. IEEEComputer Society Press, 356–367.

SEMERARO, G., MAGKLIS, G., BALASUBRAMONIAN, R., AL-BONESI, D. H., DWARKADAS, S., AND SCOTT, M. L.

2002. Energy-efficient processor design usingmultiple clock domains with dynamic voltageand frequency scaling. In Proceedings of the 8thInternational Symposium on High-PerformanceComputer Architecture. IEEE Computer Society,29–40.

SETA, K., HARA, H., KURODA, T., KAKUMU, M., AND SAKU-RAI, T. 1995. 50% active-power saving withoutspeed degradation using standby power reduc-tion (SPR) circuit. In Proceedings of the IEEE In-ternational Solid-State Conference. IEEE Press,318–319.

SGROI, M., SHEETS, M., MIHAL, A., KEUTZER, K., MA-LIK, S., RABAEY, J., AND SANGIOVANNI-VENCENTELLI,A. 2001. Addressing the system-on-a-chip in-terconnect woes through communication-baseddesign. In Proceedings of the 38th Conference onDesign Automation. ACM Press, 667–672.

SHIN, D., KIM, J., AND LEE, S. 2001. Low-energyintra-task voltage scheduling using static timinganalysis. In Proceedings of the 38th Conferenceon Design Automation. ACM Press, 438–443.

STAN, M. AND BURLESON, W. 1995. Bus-invert cod-ing for low-power i/o. IEEE Trans. VLSI, 49–58.

STANLEY-MARBELL, P., HSIAO, M., AND KREMER, U.2002. A Hardware Architecture for DynamicPerformance and Energy Adaptation. In Pro-ceedings of the Workshop on Power-Aware Com-puter Systems. 33–52.

STROLLO, A., NAPOLI, E., AND CARO, D. D. 2000. Newclock-gating techniques for low-power flip-flops.In Proceedings of the International Symposiumon Low Power Electronics and Design. ACMPress, 114–119.

SULTANIA, A., SYLVESTER, D., AND SAPATNEKAR, S. 2004.Transistor and pin reordering for gate oxideleakage reduction in dual Tox circuits. In IEEEInternational Conference on Computer Design.228–233.

SYLVESTER, D. AND KEUTZER, K. 1998. Getting tothe bottom of deep submicron. In Proceedingsof the IEEE/ACM International Conference onComputer-Aided Design. ACM Press, 203–211.

TAN, T., RAGHUNATHAN, A., AND JHA, N. 2003. Soft-ware architectural transformations: A new ap-proach to low energy embedded software. InDesign, Automation and Test in Europe. 1046–1051.

TAYLOR, C. N., DEY, S., AND ZHAO, Y. 2001. Modelingand minimization of interconnect energy dissi-pation in nanometer technologies. In Proceed-ings of the 38th Conference on Design Automa-tion. ACM Press, 754–757.

Transmeta Corporation. 2001. LongRun PowerManagement: Dynamic Power Management forCrusoe Processors. Transmeta Corporation.

Transmeta Corporation. 2003. Crusoe ProcessorProduct Brief: Model TM5800. Transmeta Cor-poration.

TURING, A. 1937. Computability and lambda-definability. J. Symbolic Logic 2, 4, 153–163.



UNNIKRISHNAN, P., CHEN, G., KANDEMIR, M., AND

MUDGETT, D. R. 2002. Dynamic compilationfor energy adaptation. In Proceedings ofthe IEEE/ACM International Conference onComputer-Aided Design. ACM Press, 158–163.

VENKATACHALAM, V., WANG, L., GAL, A., PROBST, C.,AND FRANZ, M. 2003. Proxyvm: A network-based compilation infrastructure for resource-constrained devices. Technical Report 03-13,University of California, Irvine.

VICTOR, B. AND KEUTZER, K. 2001. Bus encodingto prevent crosstalk delay. In Proceedings ofthe IEEE/ACM International Conference onComputer-Aided Design. IEEE Press, 57–63.

WANG, H., PEH, L.-S., AND MALIK, S. 2003. Power-driven design of router microarchitectures in on-chip networks. In Proceedings of the 36th An-nual IEEE/ACM International Symposium onMicroarchitecture (MICRO’36). IEEE ComputerSociety, 105–116.

WEI, L., CHEN, Z., JOHNSON, M., ROY, K., AND DE, V.1998. Design and optimization of low voltagehigh performance dual threshold CMOS circuits.In Proceedings of the 35th Annual Conference onDesign Automation. ACM Press, 489–494.

WEISER, M., WELCH, B., DEMERS, A. J., AND SHENKER, S.1994. Scheduling for reduced CPU energy. InProceedings of the 1st USENIX Symposium onOperating Systems Design and Implementation.13–23.

WEISSEL, A. AND BELLOSA, F. 2002. Process cruisecontrol: event-driven clock scaling for dynamicpower management. In Proceedings of the In-ternational Conference on Compilers, Architec-ture, and Synthesis for Embedded Systems. ACMPress, 238–246.

WON, H.-S., KIM, K.-S., JEONG, K.-O., PARK, K.-T., CHOI,K.-M., AND KONG, J.-T. 2003. An MTCMOS de-sign methodology and its application to mobile

computing. In Proceedings of the 2003 Interna-tional Symposium on Low Power Electronics andDesign. ACM Press, 110–115.

YUAN, W. AND NAHRSTEDT, K. 2003. Energy-efficientsoft real-time cpu scheduling for mobile multi-media systems. In Proceedings of the 19th ACMSymposium on Operating Systems Principles(SOSP’03). 149–163.

ZENG, H., ELLIS, C. S., LEBECK, A. R., AND VAHDAT, A.2002. Ecosystem: managing energy as a firstclass operating system resource. In Proceedingsof the 10th International Conference on Archi-tectural Support for Programming Languagesand Operating Systems. ACM Press, 123–132.

ZHANG, H. AND RABAEY, J. 1998. Low-swing inter-connect interface circuits. In Proceedings of theInternational Symposium on Low Power Elec-tronics and Design. ACM Press, 161–166.

ZHANG, H., WAN, M., GEORGE, V., AND RABAEY, J. 2001.Interconnect architecture exploration for low-energy reconfigurable single-chip dsps. In Pro-ceedings of the International Symposium on Sys-tems Synthesis. ACM Press, 33–38.

ZHANG, W., HU, J. S., DEGALAHAL, V., KANDEMIR,M., VIJAYKRISHNAN, N., AND IRWIN, M. J. 2002.Compiler-directed instruction cache leakage op-timization. In Proceedings of the 35th AnnualACM/IEEE International Symposium on Mi-croarchitecture. IEEE Computer Society Press,208–218.

ZHANG, W., KARAKOY, M., KANDEMIR, M., AND CHEN, G.2003. A compiler approach for reducing datacache energy. In Proceedings of the 17th An-nual International Conference on Supercomput-ing. ACM Press, 76–85.

ZHAO, P., DARWSIH, T., AND BAYOUMI, M. 2004. High-performance and low-power conditional dis-charge flip-flop. IEEE Trans. VLSI Syst. 12, 5,477–484.

Received December 2003; revised February and June 2005; accepted September 2005


power reduction techniques for microprocessor …power reduction techniques for microprocessor...

Documents