low power principles

8/3/2019 Low Power Principles

1/58

Low Power PrinciplesLow Power Principles

Author:Author:Agatino PennisiAgatino Pennisi

[[email protected]][[email protected]]

Low Power Architectures and DesignLow Power Architectures and Design

ASTAST--Lab CataniaLab Catania


2/58

Index

1.Introduction

2.Basic Principles

2.1. Sources of Power Consumption

2.2. Switching Power2.3. Short-Circuit Power

2.4. Static Power

2.5. Power-Delay and Energy-Delay Products

3. Technology Level Optimizations

3.1. Technology Scaling

3.2. Threshold Voltage Reduction

3.3. Technology Level Conclusion

4.Layout Level Optimizations

5. Circuit Level Optimizations

5.1. Dynamic Logic

5.2. Pass-Transistor Logic

5.3. Asynchronous Logic

5.4. Transistor Sizing

5.5. Design Style

5.6. Circuit Level Conclusion

6.Logic and Architecture Level Optimizations6.1. Logic Level Optimizations

6.2. Architecture Level Optimizations

7.Software and System Level Optimizations

Conclusions

References


3/58

1. Introduction

The growing market of mobile, battery-powered electronic systems (e.g., cellular phones,personal

digital assistants, etc.) demands the design of microelectronic circuits with low power dissipation, that

can be powered by lightweight batteries with long times between re-charges.

The power consumed by a circuit is defined asp(t) = i(t)v(t), where i(t) is the instantaneous current

provided by the power supply, andv(t) is the instantaneous supply voltage. Power minimization targets

maximum instantaneous power or average power. The latter impacts battery lifetime and heat

dissipation system cost, the former constrains power grid and power supply circuits design.

It is important to stress from the outset that power minimization is never the only objective in real-

life designs. Performance is always a critical metric that cannot be neglected. Unfortunately, in most

cases, power can be reduced at the price of some performance degradation. For this reason, several

metrics for joint power-performance have been proposed in the past. In many designs, thepower-delay

product (i.e., energy) is an acceptable metric. Energy minimization rules out design choices that


4/58

heavily compromise performance to reduce power consumption. When performance has priority overpower consumption, the energy-delay product (that is equivalent to power-delay2 ) can be adopted to

tightly control performance degradation.

Besides power vs. performance, another key trade-off in VLSI design is power vs. flexibility.

Several authors have observed that application specific designs are orders of magnitude more power

efficient than general-purpose systems programmed to perform the same computation. On the other

hand, flexibility (programmability) is often an indispensable requirement, and designers must strive to

achieve maximum power efficiency without compromising flexibility.


5/58

2. Basic Principles

The three major sources of power dissipation in a digital CMOS circuit are:

P=PSwitching

+PShort-Circuit

+PLeakage

2.1. Sources of Power Consumption

(Eq. 2.1)

Fig. 2.1. Sources of power consumption


6/58

PSwitching, called alsoswitching power, is due to charging and discharging capacitors driven by the

circuit.

PShort-Circuit

, calledshort-circuit power, is caused by the short circuit currents that arise when pairs

of PMOS/NMOS transistors are conducting simultaneously.

PLeakage , originates from substrate injection and subthreshold effects. For older technologies (0.8

m and above), PSwitching was predominant. For deep-submicron processes, PLeakage becomes moreimportant.

Design for low-power implies the ability to reduce all three components

Optimizations can be achieved by facing the power problem from different perspectives: design and

technology. Enhanced design capabilities mostly impact switching and short-circuit power; technology

improvements, on the other hand, contribute to reductions of all three components.


7/58

2.2. Switching Power

Switching power for a CMOS gate working in a synchronous environment is modeled by the following

equation:

SWClockDDLSwitching EfVCP

2

2

1=

where CL

is the output load of the gate, VDD

is the supply voltage,fClock

is the clock frequency andESW

is the switching activity of the gate, defined as the probability of the gates output to make a logic

transition during one clock cycle.

Reductions ofPSwitchingare achievable by:

1. supply voltage scaling

2. frequency scaling

3. minimization of switched capacitance

(Eq. 2.2)


8/58

1. Supply Voltage Scaling

advantage: scaling VDDscaling PSwitchingquadratically

drawback: scaling VDD lower circuit speed (decreasing circuit performance)

To compensate the decrease in circuit performance introduced by reduced voltage, speed optimizationis applied first, followed by supply voltage scaling, which brings the design back to its original timing,

but with a lower power requirement.

2. Frequency Scaling

advantage: scaling fClockscaling PSwitchinglinearly

drawback: scaling fClock lower circuit speed (decreasing circuit performance)

Selective frequency scaling (as well as voltage scaling) on such units may thus be applied, at no

penalty in the overall system speed.


9/58

3. Minimization of Switched Capacitance

Optimization approaches that have a lower impact on performance, yet allowing significant power

savings, are those targeting the minimization of the switched capacitance (i.e., the product of the

capacitive load with the switching activity).

Static solutions (i.e., applicable at design time) handle switched capacitance minimization througharea optimization (that corresponds to a decrease in the capacitive load) and switching activity

reduction via exploitation of different kinds of signal correlations (temporal, spatial, spatio-temporal).

Dynamic techniques, on the other hand, aim at eliminating power wastes that may be originated by the

application of certain system workloads (i.e., the data being processed).


10/58

2.3. Short-Circuit Power

In actual designs, the assumption of the zero rise and fall times of the input wave forms is not correct.

The finite slope of the input signal causes a direct current path between VDD and GND for a short

period of time during switching, while the NMOS and the PMOS transistors are conducting

simultaneously. This is illustrated in Figure 2.2. Under the (reasonable) assumption that the resulting

current spikes can be approximated as triangles and that the inverter is symmetrical in its rising and

falling responses, we can compute the energy consumed per switching period,

Fig. 2.2. Short-circuit currents during transients

peakDDsc

scpeak

DD

scpeak

DDdp IVttI

VtI

VE =+=22

(Eq. 2.3)

as well as the average power consumption

fVCfIVtP DDscpeakDDscdp2

== (Eq. 2.4)


11/58

The short-circuit (also called direct-path) power dissipation is proportional to the switching

activity, similar to the capacitive power dissipation. tsc represents the time both devices are

conducting. For a linear input slope, this time is reasonably well approximated by Eq. (2.5) where ts

represents the 0-100% transition time.

8.0

22 )(fr

DD

TDDs

DD

TDDsc

t

V

VVt

V

VVt

= (Eq. 2.5)

Ipeak is determined by the saturation current of the devices and is hence directly proportional to the

sizes of the transistors. The peak current is also a strong function of the ratio between input and

output slopes. This relationship is best illustrated by the following simple analysis. Consider a static

CMOS inverter with a 01 transition at the input. Assume first that the load capacitance is verylarge, so that the output fall time is significantly larger than the input rise time (Figure 2.3a). Under

those circumstances, the input moves through the transient region before the output starts to change.

As the source-drain voltage of the PMOS device is approximately 0 during that period, the device

shuts off without ever delivering any current. The short-circuit current is close to zero in this case.

Consider now the reverse case, where the output capacitance is very small, and the output fall time is

substantially smaller than the input rise time (Figure 2.3b). The drain-source voltage of the PMOS

device equals VDD for most of the transition period, guaranteeing the maximal short-circuit current

(equal to the saturation current of the PMOS). This clearly represents the worst-case condition.


12/58

Fig. 2.3. Impact of load capacitance on short-circuit current

The conclusions of the above analysis are confirmed in Figure 2.4, which plots the short-circuit

current through the NMOS transistor during a low-to-high transition as a function of the load

capacitance.

Fig. 2.4. CMOS inverter short-circuit current through NMOS transistor as a

function of the load capacitance (for a fixed input slope of 500 psec).


13/58

This analysis leads to the conclusion that the short-circuit dissipation is minimized by making the

output rise/fall time larger than the input rise/fall time. On the other hand, making the output rise/fall

time too large slows down the circuit and can cause short-circuit currents in the fan-out gates. This

presents a perfect example of how local optimization and forgetting the global picture can lead to an

inferior solution.


14/58

2.4. Static Power

The static (or steady-state) power dissipation of a circuit is is expressed by Eq. (2.6), where Istat is the

current that flows between the supply rails in the absence of switching activity.

DDstatstat VIP = (Eq. 2.6)

Ideally, the static current of the CMOS inverter is equal to zero, as the PMOS and NMOS devices are

never on simultaneously in steady-state operation. There is, unfortunately, a leakage current flowing

through the reverse-biased diode junctions of the transistors, located between the source or drain and

the substrate as shown in Figure 2.5. This contribution is, in general, very small and can be ignored.

For the device sizes under consideration, the leakage current per unit drain area typically ranges

between 10-100 pA/m2 at room temperature. For a die with 1 million gates, each with a drain area of

0.5 m2 and operated at a supply voltage of 2.5 V, the worst-case power consumption due to diode

leakage equals 0.125 mW, which is clearly not much of an issue. However, be aware that the junction

leakage currents are caused by thermally generated carriers. Their value increases with increasing junction temperature, and this occurs in an exponential fashion. At 85C (a common junction

temperature limit for commercial hardware), the leakage currents increase by a factor of 60 over their

room-temperature values. Keeping the overall operation temperature of a circuit low is consequently a

desirable goal.


15/58

Fig. 2.5. Sources of leakage currents in CMOS inverter (for Vin = 0 V)

As the temperature is a strong function of the dissipated heat and its removal mechanisms, this can

only be accomplished by limiting the power dissipation of the circuit and/or by using chip packages

that support efficient heat removal.

An emerging source of leakage current is the subthreshold current of the transistors. An MOS

transistor can experience a drain-source current, even when VGS is smaller than the threshold voltage

(Figure 2.6).

The closer the threshold voltage is to zero volts, the larger the leakage current at VGS= 0 V and the

larger the static power consumption. To offset this effect, the threshold voltage of the device has

generally been kept high enough. Standard processes feature VT values that are never smaller than

0.5-0.6V and that in some cases are even substantially higher (~ 0.75V).


16/58

Fig. 2.6. Decreasing the threshold increases the subthreshold current at VGS=0

This approach is being challenged by the reduction in supply voltages that typically goes with deep-

submicron technology scaling. Scaling the supply voltages while keeping the threshold voltage

constant results in an important loss in performance, especially when VDD approaches 2VT .

One approach to address this performance issue is to scale the device thresholds down as well. This

moves the curve of Figure 3.1(dx) to the left, which means that the performance penalty for lowering

the supply voltage is reduced. Unfortunately, the threshold voltages are lower-bounded by the amount

of allowable subthreshold leakage current, as demonstrated in Figure 2.6. The choice of the threshold

voltage hence represents a trade-off between performance and static power dissipation.


17/58

The continued scaling of the supply voltage predicted for the next generations of CMOS technologies

will however force the threshold voltages ever downwards, and will make subthreshold conduction a

dominant source of power dissipation. Process technologies that contain devices with sharper turn-off

characteristic will therefore become more attractive. An example of the latter is the SOI (Silicon-on-

Insulator) technology whose MOS transistors have slope-factors that are close to the ideal 60

mV/decade.

This lower bound on the thresholds is in some sense artificial. The idea that the leakage current in a

static CMOS circuit has to be zero is a preconception. Certainly, the presence of leakage currents

degrades the noise margins, because the logic levels are no longer equal to the supply rails. As long as

the noise margins are within range, this is not a compelling issue. The leakage currents, of course,

cause an increase in static power dissipation. This is offset by the drop in supply voltage, that is

enabled by the reduced thresholds at no cost in performance, and results in a quadratic reduction in

dynamic power. For a 0.25 m CMOS process, the following circuit configurations obtain the same

performance: 3 V supply0.7 V VT; and 0.45 V supply0.1 V VT.The dynamic power consumption of the latter is, however, 45 times smaller! Choosing the correct

values of supply and threshold voltages once again requires a trade-off. The optimal operation point

depends upon the activity of the circuit.


18/58

In the presence of a sizable static power dissipation, it is essential that non-active modules are

powered down, lest static power dissipation would become dominant.Power-down , also calledstandby , can be accomplished by disconnecting the unit from the supply rails, or by lowering the

supply voltage.


19/58

2.5. Power-Delay and Energy-Delay Products

The total power consumption of the CMOS inverter is now expressed as the sum of its three

components:

( ) peakDDspeakDDDDLstatdpdyntot IVftIVVCPPPP ++=++= 102 (Eq. 2.7)

In typical CMOS circuits, the capacitive dissipation is by far the dominant factor. The direct-pathconsumption can be kept within bounds by careful design, and should hence not be an issue. Leakage

is ignorable at present, but this might change in the not too distant future.

In Chapter 1, we introduced thepower-delay product,PDP, as a quality measure for a logic gate.

pav

tPPDP =(Eq. 2.8)

The PDP presents a measure of energy, as is apparent from the units (Wsec = Joule). Assuming that

the gate is switched at its maximum possible rate of fmax = 1/(2tp), and ignoring the contributions of the

static and direct-path currents to the power consumption, we find

2

22 DDL

pmaxDDL

VC

tfVCPDP=

(Eq. 2.9)The PDP stands for the average energy consumed per switching event(this is, for a 01, or a 10

transition). Remember that earlier we had defined Eav as the average energy per switching cycle (or

per energy-consuming event). As each inverter cycle contains a 01, and a 10 transition, Eav hence

is twice the PDP.


20/58

The validity of the PDP as a quality metric for a process technology or gate topology is questionable.

It measures the energy needed to switch the gate, which is an important property for sure. Yet for agiven structure, this number can be made arbitrarily low by reducing the supply voltage. From this

perspective, the optimum voltage to run the circuit at would be the lowest possible value that still

ensures functionality. This comes at the major expense in performance, at discussed earlier. A more

relevant metric should combine a measure of performance and energy. The energy-delay product,

EDP, does exactly that.

pDDL

pavp tVC

tPtPDPEDP ===2

22

(Eq. 2.10)

It is worth analyzing the voltage dependence of the EDP. Higher supply voltages reduce delay, but

harm the energy, and the opposite is true for low voltages. An optimum operation point should hence

exist. Assuming that NMOS and PMOS transistors have comparable threshold and saturation voltages,

we can simplify the following propagation delay expression.

( ) ( ) TeDDDDL

p

DSATnTnDDDSATnn

DDLpHL

VV

VCt

VVVVkLW

VCt

n

=

2//52.0

' (Eq. 2.11)

where VTe = VT+ VDSAT/2, andtechnology parameter. Combining Eq. (2.10) and Eq.(2.11),

( )TeDDDDL

VV

VCEDP

2

32

(Eq. 2.12)


21/58

The optimum supply voltage can be obtained by taking the derivative of Eq. (2.12) with respect to VDD,

and equating the result to 0.

TeDDopt VV2

3= (Eq. 2.13)

The remarkable outcome from this analysis is the low value of the supply voltage that simultaneously

optimizes performance and energy. For sub-micron technologies with thresholds in the range of 0.5 V,

the optimum supply is situated around 1 V.


22/58

3. Technology Level Optimization

3.1. Technology Scaling

Scaling of physical dimensions is a well-known technique for reducing circuit power consumption. The

first-order effects of scaling can be fairly easily derived. Device gate capacitances are of the form:

ox

oxoxGate

tLWCLWC

==

If we scale down W, L, and tox by S, then this capacitance will scale down by S as well. Consequently,

if system data rates and supply voltages remain unchanged, this factor of S reduction in capacitance is

passed on directly to power:

SP

1voltagefixede,performancFixed

(Eq. 3.1)

(Eq. 3.2)

The effect of scaling on delays is equally promising. Based on (Eq. 3.3), the transistor current drive

increases linearly with S.

( )22

tddox

dd VVL

WCI

=

(Eq. 3.3)


23/58

As a result, propagation delays, which are proportional to capacitance and inversely proportional to

drive current, scale down by a factor of S2.

Assuming we are only trying to maintain system throughput rather than increase it, the improvement in

circuit performance can be traded for lower power by reducing the supply voltage. In particular,

neglecting Vt effects, the voltage can be reduced by a factor of S2. This results in a S4 reduction in

device currents, and along with the capacitance scaling leads to an S5 reduction in power:

5

1voltagevariablee,performancFixed

SP (Eq. 3.4)

This discussion, however, ignores many important second-order effects. For example, as scaling

continues, interconnect parasitics eventually begin to dominate and change the picture substantially.

The resistance of a wire is proportional to its length and inversely proportional to its thickness and

width. Since in this discussion we are considering the impact of technology scaling on a fixed design,

the local and global wire lengths should scale down by S along with the width and thickness of the

wire. This means that the wire resistance should scale up by a factor of S overall. The wire

capacitance is proportional to its width and length and inversely proportional to the oxide thickness.

Consequently, the wire capacitance scales down by a factor of S. To summarize:


24/58

11

wwwire

w

w

CRtS

C

SR

(Eq. 3.5)

This means that, unlike gate delays, the intrinsic interconnect delay does not scale down with physical

dimensions. So at some point interconnect delays will start to dominate over gate delays and it will no

longer be possible to scale down the supply voltage. This means that once again power is reduced

solely due to capacitance scaling:

SP

1dominatedParasitics (Eq. 3.6)

Actually, the situation is even worse since the above analysis did not consider second-order effects

such as the fringing component of wire capacitance, which may actually grow with reduced

dimensions. As a result, realistically speaking, power may not scale down at all, but instead may stay

approximately constant with technology scaling or even increase:

moreor1effectsorder-2ndIncluding P (Eq. 3.7)


25/58

The conclusion is that technology scaling offers significant benefits in terms of power only up to a

point. Once parasitics begin to dominate, the power improvements slack off or disappear completely.

So we cannot rely on technology scaling to reduce power indefinitely. We must turn to other

techniques for lowering power consumption.

3.2. Threshold Voltage Reduction

Many process parameters, aside from lithographic dimensions, can have a large impact on circuit

performance. For example, at low supply voltages the value of the threshold voltage (Vt) is extremely

important. Threshold voltage places a limit on the minimum supply voltage that can be used without

incurring unreasonable delay penalties (Fig.3.1). Based on this, it would seem reasonable to consider

reducing threshold voltages in a low-power process.

Fig. 3.1. Energy and delay as a function of supply voltage


26/58

Unfortunately, sub-threshold conduction and noise margin considerations limit how low Vt can be set.

Although devices are ideally off for gate voltages below Vt , in reality there is always some sub-

threshold conduction even for Vgs


27/58

The methodology should be applicable not only to different technologies, but also to different circuit

and logic styles. Whenever possible scaling and circuit techniques should be combined with the high-

level methodology to further reduce power consumption; however, the general low-power strategy

should not require these tricks. The advantages of scaling and low-level techniques cannot be

overemphasized, but they should not be the sole arena from which the designer can extract power

gains.


28/58

4. Layout Level Optimization

There are a number of layout-level techniques that can be applied to reduce power. The simplest of

these techniques is to select upper level metals to route high activity signals. The higher level metals

are physically separated from the substrate by a greater thickness of silicon dioxide. Since the physical

capacitance of these wires decreases linearly with increasing tox , there is some advantage to routing

the highest activity signals in the higher level metals. For example, in a typical process metal three

will have about a 30% lower capacitance per unit area than metal two. It should be noted, however,

that the technique is most effective for global rather than local routing, since connecting to a higher

level metal requires more vias, which add area and capacitance to the circuit. Still, the concept of

associating high activity signals with low physical capacitance nodes is an important one and appears

in many different contexts in low-power design.

For example, we can combine this notion with the locality theme to arrive at a general strategy for

low-power placement and routing. The placement and routing problem crops up in many different

guises in VLSI design. Place and route can be performed on pads, functional blocks, standard cells,

gate arrays, etc. Traditional placement involves minimizing area and delay. Minimizing delay, in turn,

translates to minimizing the physical capacitance (or length) of wires.


29/58

In contrast, placement for low-power concentrates on minimizing the activity-capacitance product

rather than the capacitance alone. In summary, high activity wires should be kept short or, in a

manner of speaking, local. Tools have been developed that use this basic strategy to achieve about an

18% reduction in power.

Although intelligent placement and routing of standard cells and gate arrays can help to improve their

power efficiency, the locality achieved by low-power place and route tools rarely approaches what can

be achieved by a full-custom design. Design-time issues and other economic factors, however, may in

many cases preclude the use of full-custom design. In these instances, the concepts presented here

regarding low-power placement and routing of standard cells and gate arrays may prove useful.

Moreover, even for custom designs, these low-power strategies can be applied to placement and

routing at the block level.


30/58

5. Circuit Level Optimization

In this section, we go beyond the traditional synchronous fully-complementary static CMOS circuit

style to consider the relative advantages and disadvantages of other design strategies; we will

consider five topics relating to low-power circuit design: dynamic logic, pass-transistor logic,

asynchronous logic, transistor sizing, and design style (e.g. full custom versus standard cell).

5.1. Dynamic Logic

In static logic, node voltages are always maintained by a conducting path from the node to one of the

supply rails. In contrast, dynamic logic nodes go through periods during which there is no path to the

rails, and voltages are maintained as charge dynamically stored on nodal capacitances. Figure 5.1

shows an implementation of a complex boolean expression in both static and dynamic logic. In the

dynamic case, the clock period is divided into a pre-charge and an evaluation phase. During pre-

charge, the output is charged to Vdd. Then, during the next clock phase, the NMOS tree evaluates the

logic function and discharges the output node if necessary. Relative to static CMOS, dynamic logic hasboth advantages and disadvantages in terms of power.

Historically, dynamic design styles have been touted for their inherent low-power properties. For

example, dynamic design styles often have significantly reduced device counts.


31/58

Fig. 5.1. Static and dynamic implementations of ( )CBAF +=

Since the logic evaluation function is fulfilled by the NMOS tree alone, the PMOS tree can be replaced

by a single pre-charge device. These reduced device counts result in a corresponding decrease in

capacitive loading, which can lead to power savings. Moreover, by avoiding stacked PMOS

transistors, dynamic logic is amenable to low voltage operation where the ability to stack devices is

limited. In addition, dynamic gates dont experience short-circuit power dissipation. Whenever static

circuits switch, a brief pulse of transient current flows from Vdd to ground consuming power.

Furthermore, dynamic logic nodes are guaranteed to have a maximum of one transition per clock

cycle.


32/58

Static gates do not follow this pattern and can experience a glitching phenomenon whereby output

nodes undergo unwanted transitions before settling at their final value. This causes excess powerdissipation in static gates. So in some sense, dynamic logic avoids some of the overhead and waste

associated with fully-complementary static logic.

In practice, however, dynamic circuits have several disadvantages. For instance, each of the pre-

charge transistors in the chip must be driven by a clock signal. This implies a dense clock distribution

network and its associated capacitance and driving circuitry. These components can contribute

significant power consumption to the chip. In addition, with each gate influenced by the clock, issues

of skew become even more important and difficult to handle.

Fig. 5.2. Output activities for static and dynamic logic gates (with random inputs)


33/58

Also, the clock is a high (actually, maximum) activity signal, and having it connected to the PMOS

pull-up network can introduce unnecessary activity into the circuit. For commonly used boolean logicgates, Figure 5.2 shows the probability that the outputs make an energy consuming (i.e. zero to one)

transition for random gate inputs. In all cases, the activity of the dynamic gates is higher than that of

the static gates. We can show that, in general, for any boolean signal X, the activity of a dynamically

pre-charged wire carrying X must always be at least as high as the activity of a statically-driven wire:

( ) ( )

( ) ( ) ( )

===

=

00|110:casestatic

010:casedynamic

1 Xttwire

Xwire

PXXPP

PP(Eq. 5.1)

In conclusion, dynamic logic has certain advantages and disadvantages for low-power operation. The

key is to determine which of the conflicting factors is dominant. In certain cases, a dynamic

implementation might actually achieve a lower overall power consumption. Furthermore, the savings

in terms of glitching and short-circuit power, while possibly significant, can also be achieved in static

logic through other means (discussed in Section 6). All of this, coupled with the robustness of static

logic at low voltages gives the designer less incentive to select a dynamic implementation of a low-

power system.


34/58

5.2. Pass-Transistor Logic

As with dynamic logic, pass-transistor logic offers the possibility of reduced transistor counts. Figure

5.3 illustrates this fact with an equivalent pass-transistor implementation of the static logic function of

Figure 5.1. Once again, the reduction in transistors results in lower capacitive loading from devices.

This might make pass-transistor logic attractive as a low-power circuit style.

Fig. 5.3. Complementary pass-transistor implementations of ( )CBAF +=

Like dynamic logic, however, pass-transistor circuits suffer from several drawbacks. First, pass

transistors have asymmetrical voltage driving capabilities. For example, NMOS transistors do not

pass high voltages efficiently, and experience reduced current drive as well as a Vt drop at the

output. If the output is used to drive a PMOS gate, static power dissipation can result.


35/58

These flaws can be remedied by using additional hardware - for instance, complementary transmission

gates consisting of an NMOS and PMOS pass transistors in parallel or level-restoring circuit, asshowed in Figure 5.4. Unfortunately, this forfeits the power savings offered by reduced device counts.

Also, efficient layout of pass-transistor networks can be problematic. Sharing of source/drain diffusion

regions is often not possible, resulting in increased parasitic junction capacitances.

In summary, there may be situations in which pass-transistor logic is more power efficient than fully-

complementary logic; however, the benefits are likely to be small relative to the orders of magnitude

savings possible from higher level techniques. So, again, circuit-level power saving techniques should

be used whenever appropriate, but should be subordinate to higher level considerations.

In summary, there may be situations in which pass-transistor logic is more power efficient than fully-

complementary logic; however, the benefits are likely to be small relative to the orders of magnitude

savings possible from higher level techniques. So, again, circuit-level power saving techniques should

be used whenever appropriate, but should be subordinate to higher level considerations.

Fig. 5.4. Level-restoring Circuit


36/58

5.3. Asynchronous Logic

Asynchronous logic refers to a circuit style employing no global clock signal for synchronization.

Instead, synchronization is provided by handshaking circuitry used as an interface between gates (see

Figure 5.5). While more common at the system level, asynchronous logic has failed to gain acceptance

at the circuit level. This has been based on area and performance criteria. It is worthwhile to

reevaluate asynchronous circuits in the context of low power.

Fig. 5.5. Asynchronous circuit with handshaking

Typically, asynchronous circuits are classified as showed in Figure 5.6.

Theself-timedconcept is based on an architecture in which there are registers, arithmetic logic units,

control units, control signals but no clock signal. The computations sequence is managed by local

synchronization signals (see Figure 5.7).


37/58

Fig. 5.6. Classification of asynchronous circuits

Fig. 5.7. Example of a self-timed system

Really, self-timed systems are a subset of asynchronous systems, which ones are in general no global

clock signal. Inside self-timed systems there is a set of systems, calledspeed independent, which ones


38/58

work properly independently from delays of its internal components, except delays of interconnections

(Fig. 5.8).

Fig. 5.8. Example of a speed independent system

Fig. 5.9. Example of a delay insensitive system


39/58

Inside speed independent systems there is a set of systems, calleddelay insensitive , which ones work

properly independently from delays of its internal components and interconnections (Fig. 5.9). Theprimary power advantages of asynchronous logic can be classified as avoiding waste. The clock signal

in synchronous logic contains no information; therefore, power associated with the clock driver and

distribution network is in some sense wasted. Avoiding this power consumption component might offer

significant benefits. In addition, asynchronous logic uses completion signals, thereby avoiding

glitching, another form of wasted power. Finally, with no clock signal and with computation triggered

by the presence of new data, asynchronous logic contains a sort of built in power-down mechanism for

idle periods.

While asynchronous sounds like the ideal low-power design style, several issues impede its acceptance

in low-power arenas. Depending on the size of its associated logic structure, the overhead of the

handshake interface and completion signal generation circuitry can be large in terms of both area and

power.

Since this circuitry does not contribute to the actual computations, transitions on handshake signals

are wasted. This is similar to the waste due to clock power consumption, though it is not as severe

since handshake signals have lower activity than clocks. Finally, fewer design tools support

asynchronous than synchronous, making it more difficult to design.


40/58

At the small granularity with which it is commonly implemented, the overhead of the asynchronous

interface circuitry dominates over the power saving attributes of the design style. It should beemphasized, however, that this is mainly a function of the granularity of the handshaking circuitry. It

would certainly be worthwhile to consider using asynchronous techniques to eliminate the necessity of

distributing a global clock between blocks of larger granularity. For example, large modules could

operate synchronously off local clocks, but communicate globally using asynchronous interfaces. In

this way, the interface circuitry would represent a very small overhead component, and the most

power consuming aspects of synchronous circuitry (i.e. global clock distribution) would be avoided.

5.4. Transistor Sizing

Regardless of the circuit style employed, the issue of transistor sizing for low power arises. The

primary trade-off involved is between performance and cost - where cost is measured by area and

power. Transistors with larger gate widths provide more current drive than smaller transistors.

Unfortunately, they also contribute more device capacitance to the circuit and, consequently, result inhigher power dissipation. Moreover, larger devices experience more severe short-circuit currents,

which should be avoided whenever possible.


41/58

In addition, if all devices in a circuit are sized up, then the loading capacitance increases in the same

proportion as the current drive, resulting in little performance improvement beyond the point ofovercoming fixed parasitic capacitance components. In this sense, large transistors become self-

loading and the benefit of large devices must be reevaluated. A sensible low-power strategy is to use

minimum size devices whenever possible. Along the critical path, however, devices should be sized up

to overcome parasitics and meet performance requirements.

5.5. Design Style

Another decision which can have a large impact on the overall chip power consumption is selection of

design style: e.g. full custom, gate array, standard cell, etc. Not surprisingly, full-custom design offers

the best possibility of minimizing power consumption. In a custom design, all the principles of low-

power including locality, regularity, and sizing can be applied optimally to individual circuits.

Unfortunately, this is a costly alternative in terms of design time, and can rarely be employed

exclusively as a design strategy. Other possible design styles include gate arrays and standard cells.

Gate arrays offer one alternative for reducing design cycles at the expense of area, power, and

performance. While not offering the flexibility of full-custom design, gate-array CAD tools could

nevertheless be altered to place increased emphasis on power. For example, gate arrays offer some

control over transistor sizing through the use of parallel transistor connections.


42/58

Standard cell synthesis is another commonly employed strategy for reducing design time. Current

standard cell libraries and tools, however, offer little hope of achieving low power operation. In manyways, standard cells represent the antithesis of a low-power methodology. First and foremost,

standard cells are often severely oversized. Most standard cell libraries were designed for maximum

performance and worst-case loading from inter-cell routing. As a result, they experience significant

self-loading and waste correspondingly significant amounts of power. To overcome this difficulty,

standard cell libraries must be expanded to include a selection of cells of identical functionality, but

varying driving strengths. With this in place, synthesis tools could select the smallest (and lowest

power cell) required to meet timing constraints, while avoiding the wasted power associated with

oversized transistors. In addition, the standard cell layout style with its islands of devices and

extensive routing channels tends to violate the principles of locality central to low-power design.


43/58

5.6. Circuit Level Conclusion

Clearly, numerous circuit-level techniques are available to the low-power designer. These techniques

include careful selection of a circuit style: static vs. dynamic, synchronous vs. asynchronous, fully-

complementary vs. pass-transistor, etc. Other techniques involve transistor sizing or selection of a

design methodology such as full-custom or standard cell. Some of these techniques can be applied in

conjunction with higher level power reduction techniques. When possible, designers should take

advantage of this fact and exploit both low and high-level techniques in concert. Often, however,

circuit-level techniques will conflict with the low-power strategies based on higher abstraction levels.

In these cases, the designer must determine, which techniques offer the largest power reductions. As

evidenced by the previous discussion, circuit-level techniques typically offer reductions of a factor oftwo or less, while some higher level strategies with their more global impact can produce savings of

an order of magnitude or more. In such situations, considerations imposed by the higher-level

technique should dominate and the designer should employ those circuit-level methodologies most

amenable to the selected high-level strategy.


44/58

6. Logic and Architecture Level Optimizations

Logic-level power optimization has been extensively researched in the last few years. Given the

complexity of modern digital devices, hand-crafted logic-level optimization is extremely expensive in

terms of design time and effort. Hence, it is cost-effective only for structured logic in large-volume

components, like microprocessors (e.g., functional units in the data-path). Fortunately, several

optimizations for low power have been automated and are now available in commercial logic synthesis

tools, enabling logic-level power optimization even for unstructured logic and for low-volume VLSI

circuits. During logic optimization, technology parameters such as supply voltage are fixed, and the

degrees of freedom are in selecting the functionality and sizing the gates implementing a given logic

specification. As for technology and circuit-level techniques, power is never the only cost metric of

interest. In most cases, performance is tightly constrained as well.

6.1. Logic Level Optimizations

A common setting is constrained power optimization, where a logic network can be transformed to

minimize power only if critical path length is not increased. Under this hypothesis, an effective

technique is based on path equalization.


45/58

Path equalization ensures that signal propagation from inputs to outputs of a logic network follows

paths of similar length. When paths are equalized, most gates have aligned transitions at their inputs,

thereby minimizing spurious switching activity (which is created by misaligned input transitions). This

technique is very helpful in arithmetic circuits, such as adders of multipliers.

Glue logic and controllers have much more irregular structure than arithmetic units, and their gate-

level implementations are characterized by a wide distribution of path delays. These circuits can be

optimized for power by resizing. Resizing focuses on fast combinational paths. Gates on fast paths

are down-sized, thereby decreasing their input capacitances, while at the same time slowing down

signal propagation. By slowing down fast paths, propagation delays are equalized, and power is

reduced by joint spurious switching and capacitance reduction. Resizing does notalways imply down-

sizing.Power can be reduced also by enlarging (or buffering) heavily loaded gates, to increase their

output slew rates. Fast transitions minimize short-circuit power of the gates in the fan-out of the gate

which has been sized up, but its input capacitance is increased. In most cases, resizing is a complex

optimization problem involving a tradeoff between output switching power and internal short-circuit

power on several gates at the same time.

Other logic-level power minimization techniques are re-factoring, remapping, phase assignment and

pin swapping. All these techniques can be classified as local transformations. They are applied on

gate netlists, and focus on nets with large switched capacitance.


46/58

Most of these techniques replace a gate, or a small group of gates, around the target net, in an effort

to reduce capacitance and switching activity. Similarly to resizing, local transformations must

carefully balance short circuit and output power consumption.

Fig. 6.1. Local transformations: (a) re-mapping, (b) phase assignment, (c) pin swapping

Figure 6.1 shows three examples of local transformations. In (a) a re-mapping transformation is

shown, where a high-activity node (marked with x ) is removed thanks to a new mapping onto an

AND-OR gate. In (b), phase assignment is exploited to eliminate one of the two high-activity nets

marked with x. Finally, pin swapping is applied in (c) to connect a high-activity net with the input

pin of the 4-input NAND with minimum input capacitance.


47/58

6.2. Architecture Level Optimizations

Complex digital circuits usually contain units (or parts thereof) that are not performing useful

computations at every clock cycle. Think, for example, of arithmetic units or register files within a

microprocessor or, more simply, to registers of an ordinary sequential circuit. The idea, known for a

long time in the community of IC designers, is to disable the logic which is not in use during some

particular clock cycles, with the objective of limiting power consumption. In fact, stopping certain

units from making useless transitions causes a decrease in the overall switched capacitance of the

system, thus reducing the switching component of the power dissipated. Optimization techniques based

on the principle above belong to the broad class ofdynamic power management(DPM) methods.

The natural domain of applicability of DPM is system-level design; therefore, it will be discussed ingreater detail in the next section. Nevertheless, this paradigm has also been successfully adopted in

the context of architectural optimization.

Clock gatingprovides a way to selectively stop the clock, and thus force the original circuit to make no

transition, whenever the computation to be carried out by a hardware unit at the next clock cycle is

useless. In other words, the clock signal is disabled in accordance with the idle conditions of the unit.

As an example of use of the clock-gating strategy, consider the traditional block diagram of a

sequential circuit, shown on the left of Figure 6.2.


48/58

Fig. 6.2. Example of gated clock architecture

It consists of a combinational logic block and an array of state registers which are fed by the next-

state logic and which provide some feed-back information to the combinational block itself through the

present-state input signals. The corresponding gated-clock architecture is shown on the right of the

figure. The circuit is assumed to have a single clock, and the registers are assumed to be edge-

triggered flip-flops. The combinational blockFa is controlled by the primary inputs, the present-state

inputs, and the primary outputs of the circuit, and it implements the activation function of the clock

gating mechanism. Its purpose is to selectively stop the local clock of the circuit when no state or

output transition takes place. The block namedL is a latch, transparent when the global clock signal

CLKis inactive. Its presence is essential for a correct operation of the system, since it takes care of

filtering glitches that may occur at the output of blockFa.


49/58

The clock management logic is synthesized from the Boolean function representing the idle conditions

of the circuit. It may well be the case that considering all such conditions results in additional circuitry

that is too large and power consuming. It may then be necessary to synthesize a simplified function,

which dissipates the minimum possible power, and stops the clock with maximum efficiency. Because

of its effectiveness, clock-gating has been applied extensively in real designs and it has lately found its

way in industry-strength CAD tools (e.g., Power Compiler by Synopsys).

Power savings obtained by gating the clock distribution network of some hardware resources come at

the price of a global decrease in performance. In fact, resuming the operation of an inactive resource

introduces a latency penalty that negatively impacts system speed. In other words, with clock gating

(or with any similar DPM technique), performance and throughput of an architecture are traded for

power.


50/58

7. Software and System Level Optimizations

Electronic systems and subsystems consist of hardware platforms with several software layers. Many

system features depend on the hardware/software interaction, e.g., programmability and flexibility,

performance and energy consumption. Software does not consume energy per se, but it is the

execution and storage of software that requires energy consumption by the underlying hardware.

Software execution corresponds to performing operations on hardware, as well as accessing and

storing data.

Thus, software execution involves power dissipation for computation, storage, and communication.

Moreover, storage of computer programs in semiconductor memories requires energy (refresh of

DRAMs, static power for SRAMs).

The energy budget for storing programs is typically small (with the choice of appropriate components)

and predictable at design time. Thus, we will concentrate on energy consumption of software during

its execution. Nevertheless, it is important to remember that reducing the size of program, which is a

usual objective in compilation, correlates with reducing their energy storage costs. Additional

reduction of code size can be achieved by means of compression techniques. The energy cost of

executing a program depends on its machine code and on the hardware architecture parameters.

Th hi d i d i d f th d f il ti T i ll th t f th


51/58

The machine code is derived from the source code from compilation. Typically, the energy cost of the

machine code is affected by the back-end of software compilation, that controls the type, number and

order of operations, and by the means of storing data, e.g., locality (registers vs. memory arrays),

addressing, order. Nevertheless, some architecture independent optimizations can be useful in general

to reduce energy consumption, e.g., selective loop unrolling and software pipelining.

Software instructions can be characterized by the number of cycles needed to execute them and by the

energy required per cycle. The energy consumed by an instruction depends weakly on the state of the

processor (i.e., by the previously executed instruction).

On the other hand, the energy varies significantly when the instruction requires storage in registers or

in memory (caches).

The traditional goal of a compiler is to speed up the execution of the generated code, by reducing the

code size (which correlates with the latency of execution time) and minimizingspills to memory.

Interestingly enough, executing machine code of minimum size would consume the minimum energy, if

we neglect the interaction with memory and we assume a uniform energy cost of each instruction.

Energy-efficient compilation strives at achieving machine code that requires less energy as compared

to a performance-driven traditional compiler, by leveraging the disuniformity in instruction energy

cost, and the different energy costs for storage in registers and in main memory due to addressing and

address decoding. Nevertheless, results are sometimes contradictory.

Whereas for some architectures energy efficient compilation gives a competitive advantage as


52/58

Whereas for some architectures energy-efficient compilation gives a competitive advantage as

compared to traditional compilation, for some others the most compact code is also the most

economical in terms of energy, thus obviating the need of specific low-power compilers.

Power-aware operating systems (OSs) trade generality for energy efficiency. In the case of embedded

electronic systems, OSs are streamlined to support just the required applications. On the other hand,

such an approach may not be applicable to OSs for personal computers, where the user wants to

retain the ability of executing a wide variety of applications.

Energy efficiency in an operating system can be achieved by designing an energy aware task

scheduler. Usually, a scheduler determines the set of start times for each task, with the goal of

optimizing a cost function related to the completion time of all tasks, and to satisfy real time

constraints, if applicable. Since tasks are associated with resources having specific energy models, the

scheduler can exploit this information to reduce run-time power consumption.

Operating systems achieve major energy savings by implementing dynamic power management (DPM)

of the system resources. DPM dynamically reconfigures an electronic system to provide the requested

services and performance levels with a minimum number of active components or a minimum load on

such components. Dynamic power management encompasses a set of techniques that achieve energy-

efficient computation by selectively shutting down or slowing down system components when they are

idle (or partially unexploited). DPM can be implemented in different forms including, but not limited

to clock gating clock throttling supply voltage shut down and dynamically varying power supplies


53/58

to, clock gating, clock throttling, supply voltage shut-down, and dynamically varying power supplies.

Several system-level design trade-offs can be explored to reduce energy consumption. Some of these

design choices belong to the domain of hardware/software co-design, and leverage the migration of

hardware functions to software or vice versa. For example, the Advanced Configuration and Power

Interface (ACPI) standard, initiated by Intel, Microsoft and Toshiba, provides a portable hw/sw

interface that makes it easy to implement DPM policies for personal computers in software.

C l i


54/58

Conclusions

Electronic design aims at striking a balance between performance and power efficiency. Designing

low power applications is a multi-faceted problem, because of the plurality of embodiments that a

system specification may have and the variety of degrees of freedom that designers have to cope with

power reduction. In this brief tutorial, we showed different design options and the corresponding

advantages and disadvantages. We tried to relate general-purpose low-power design solutions to a few

successful chips that use them to various extents. Even though we described only a few samples of

design techniques and implementations, we think that our samples are representative of the state of the

art of current technologies and can suggest future developments and improvements.

R f


55/58

References

[1] J. Rabaey and M. Pedram,Low Power Design Methodologies. Kluwer, 1996.

[2] J. Mermet and W. Nebel,Low Power Design in Deep Submicron Electronics. Kluwer, 1997.

[3] A. Chandrakasan and R. Brodersen,Low-Power CMOS Design. IEEE Press, 1998.

[4] T. Burd and R. Brodersen, Processor Design for Portable Systems, Journal of VLSI Signal

Processing Systems, vol. 13, no. 23, pp. 203221, August 1996.

[5] D. Ditzel, Transmetas Crusoe: Cool Chips for Mobile Computing, Hot Chips Symposium,

August 2000.

[6] J. Montanaro, et al., A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor, IEEE Journal of

Solid-State Circuits, vol. 31, no. 11, pp. 17031714, November 1996.

[7] V. Lee, et al., A 1-V Programmable DSP for Wireless Communications, IEEE Journal of Solid-

State Circuits, vol. 32, no. 11, pp. 17661776, November 1997.

[8] M. Takahashi, et al., A 60-mW MPEG4 Video Coded Using Clustered Voltage Scaling with

Variable Supply-Voltage Scheme, IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1772

1780, November 1998.

[9] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, Low-Power CMOS Digital Design,IEEE

Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473484, April 1992.

[10] F. Najm, A Survey of Power Estimation Techniques in VLSI Circuits, IEEE Transactions on


56/58

[10] F. Najm, A Survey of Power Estimation Techniques in VLSI Circuits , ansactions on

VLSI Systems, vol. 2, no. 4, pp. 446455, December 1994.

[11] M. Pedram, Power Estimation and Optimization at the Logic Level, International Journal of

High-Speed Electronics and Systems, vol. 5, no. 2, pp. 179202, 1994.

[12] P. Landman, High-Level Power Estimation,ISLPED-96: ACM/IEEE International Symposium

on Low Power Electronics and Design, pp. 2935, Monterey, California, August 1996.

[13] E. Macii, M. Pedram, F. Somenzi, High-Level Power Modeling, Estimation, and Optimization,

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 11, pp.

10611079, November 1998.

[14] S. Borkar, Design Challenges of Technology Scaling, IEEE Micro, vol. 19, no. 4, pp. 2329,

July-August 1999.

[15] S. Thompson, P. Packan, and M. Bohr, MOS Scaling: Transistor Challenges for the 21st

Century,Intel Technology Journal, Q3, 1998.

[16] Z. Chen, J. Shott, and J. Plummer, CMOS Technology Scaling for Low Voltage Low Power

Applications, ISLPE-98: IEEE International Symposium on Low Power Electronics, pp. 5657, San

Diego, CA, October 1994.

[17] Y. Ye, S. Borkar, and V. De, A New Technique for Standby Leakage Reduction in High-

Performance Circuits, 1998 Symposium on VLSI Circuits, pp. 4041, Honolulu, Hawaii, June 1998.

[18] M. Pedram, Power Minimization in IC Design: Principles and Applications,ACM Transactions


57/58

[ ] , g p pp ,

on Design Automation of Electronic Systems, vol. 1, no. 1, pp. 356, January 1996.

[19] B. Chen and I. Nedelchev, Power Compiler: A Gate Level Power Optimization and Synthesis

System,ICCD97: IEEE International Conference on Computer Design, pp. 7479, Austin, Texas,

October 1997.

[20] L. Benini, P. Siegel, and G. De Micheli, Automatic Synthesis of Gated Clocks for Power

Reduction in Sequential Circuits,IEEE Design and Test of Computers, vol. 11, no. 4, pp. 3240,

December 1994.

[21] Y. Yoshida, B.-Y. Song, H. Okuhata, T. Onoye, and I. Shirakawa, An Object Code Compression

Approach to Embedded Processors,ISLPED-98: ACM/IEEE International Symposium on Low Power

Electronics and Design, pp. 265268, Monterey, California, August 1997.

[22] L. Benini, A. Macii, E. Macii, and M. Poncino, Selective Instruction Compression for Memory

Energy Reduction in Embedded Systems,ISLPED-99: ACM/IEEE 1999 International Symposium on

Low Power Electronics and Design, pp. 206211, San Diego, California, August 1999.

[23] H. Lekatsas and W. Wolf, Code Compression for Low Power Embedded Systems,DAC-37:ACM/IEEE Design Automation Conference, pp. 294299, Los Angeles, California, June 2000.

[24] S. Segars, K. Clarke, and L. Goudge, Embedded Control Problems, Thumb and the

ARM7TDMI,IEEE Micro, vol. 15, no. 5, pp. 2230, October 1995.

[25] D. Brooks, et al., Power-Aware Microarchitecture: Design and Modeling Challenges for Next-


58/58

g g g

Generation Microprocessors,IEEE Micro, vol. 20, No. 6, pp. 2644, November 2000.

[26] L. Benini and G. De Micheli,Dynamic Power Management: Design Techniques and CAD Tools.

Kluwer, 1997.

[27] Intel, SA-1100 Microprocessor Technical Reference Manual. 1998.

[28] L. Benini, A. Bogliolo, and G. De Micheli, A Survey of Design Techniques for System-Level

Dynamic Power Management,IEEE Transactions on VLSI Systems, vol. 8, no. 3, pp. 299316,

June 2000.

Some Interesting Links

1) Center for Low Power Electronics:

http://clpe.ece.arizona.edu/

2) Bibliography on Dynamic Power Management:

http://www.cse.unsw.edu.au/~danielp/cs1/power/files/bib.shtml

3) European Low Power Initiative for Electronic System Design:

http://www.ddtc.dimes.tudelft.nl/LowPower/index_f.html

4) Low Power IP Library:

http://www.ee.ed.ac.uk/~SLIg/iplibrary.html

low power principles

Documents