power efficient technology decomposition and mapping under an extended power consumption model

8/3/2019 Power Efficient Technology Decomposition and Mapping Under an Extended Power Consumption Model

http://slidepdf.com/reader/full/power-efficient-technology-decomposition-and-mapping-under-an-extended-power 1/13

I l l0 IEEE TRANSA CTlONS ON COMP UTER-AIDE D DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13. NO. 9, SEPTEMBER 1994

Power Efficient Technology

Decomposition and Mapping Under an

Extended Power Consumption ModelC hi - Y i ng Tsui, Massoud P e d r a m , Member, IEEE, and A l v i n M. Despain, Member, IEEE

Abstract-We propose a new power consumption model thataccounts for the power consumption at the internal nodes ofa CMOS gate. Next, we address the problem of minimizing theaverage power consumption during the technology dependentphase of logic synthesis. Our approach consists of two steps. Inthe first step, we generate a NAND decomposition of an optimizedBoolean network such that the sum of average switching ratesfor all nodes in the network is minimum. In the second step,we perform a power efficient technology mapping that finds aminimal power ma pping for given timing constraints (subject tothe unknown load problem).

I. INTRODUCTION

ORTABLE applications have shifted from conventionalP ow performance products such as wristwatches and cal-

culators to high throughput and computationally intensive

products such as notebook computers and cellular phones.

The new applications require high speed yet low power

consumption, because for such products longer battery life

translates to extended use and better marketability. With

the convergence of telecommunications, computers, consumer

electronics, and biomedical technologies, the number of low-

power applications is expected to grow. Another driving force

behind design for low power is that excessive power con-

sumption is becoming the limiting factor in integrating more

transistors on a single- or multi-chip module due to cooling,packaging, and reliability problems. Exploring the trade-offs

between area, performance, and power during synthesis and

design is thus demanding more attention.

PRIORWORK

Many researchers (in solid state circuits and technology

areas) have been studying low power/ low voltage design tech-

niques. For example, research is being conducted in low power

DRAM and SRAM designs, aggressive voltage scaling, and

process optimization for active logic circuits, device modeling

and simulation tools for low power circuits, low power analog

circuit design, etc. Other researchers (in computer architecturearea) are exploring instruction set architectures and novel

memory management schemes for low-power processor designManuscript received August 16, 1993; revised March 18, 1994. This

research was supported in part by the NSF’s Research Initiation Award undercontract No. MIW211668 and DAPRA under contract No. J-FBI-91-194.This paper was recommended by Associate Editor R. E. Bryant.

The authors are with the Department of Electrical Engineering-System,

University of Southern California, Los Angeles, CA 90089 USA.IEEE Log Number 94 0 1870.

using self-clocking, static, and dynamic power management

strategies, etc. The computer- aided design community has

recently started paying more attention to power estimation

and low-power design.

Power consumption is a function of the switching of the load

capacitances, which in turn depends on the sequence of logic

values applied to the circuit inputs. It is impractical to exhaus-

tively simulate all possible input sequ ences, hence probabilistic

approaches are developed to estimate the switching activities.

If the set of all possible transitions at the inputs of the circuit

is known, we can consider each transition as an event with a

certain probability. Over this probability space, the transition

at an internal node of the circuit becomes a stochastic process[161. The statistical information about the node activities can

then be captured by the expected number of transitions which

in turn depends on the signal and switching probabilities of

the node.

Several researchers have studied the problem of estimating

power consumption. Cirit [6] estimates the dynamic power

consumption at the transistor level based on the probability

of the drain of a transistor making a power-consuming tran-

sition. Burch et al. [4] introduce the concept of probability

waveforms. Given such waveforms at the primary inputs and

with some convenient partitioning of the circuit, they examine

every sub-circuit and derive corresponding waveforms at theinternal circuit nodes. Najm et al . [15] use the notions of

transition probabilities and probabilistic simulation to estimate

the average current drawn by a circuit. These methods assume

inputs to sub-circuits are independent and thus do not account

for the reconvergent fanout and input correlations.

Ghosh et al . [9] propose symbolic simulation in order to

produce a set of Boolean functions that represent conditions for

switching at each gate in the.circuit. A general delay mod el that

correctly computes the Boolean conditions causing glitchings

is used and correlations due to the reconvergence of input

signals are taken into account.

Recently, researches on synthesis for low power have beencarried out. Algorithms that minimize power consumption

during multi-level logic minimization [191, [ 171, state assign-ment for finite state machines [17], and placement [22] were

proposed.

All of the above power-estimation tools and low power

synthesis and design algorithms are based on a simple-power

consumption model where power is consumed only during thecharging and discharging of the external capacitances driven

02784l070/94$04.00 0 994 IEEE



TSUI et al.: POWER EFF'ICIENT TECHNOLOGY DECOMPOSITION AND MAPPING

~

1111

by the gates. This assumption surely underestimates the power

consumption of the circuit since it does not account for the

power consumption due to the charging and discharging of

the source/drain capacitance of the internal nodes of the gates.

Power consumption due to internal capacitances contributes

a significant portion of the total power consumption in the

circuit (see Section V). Any power-estimation tool ignoring

this factor will thus be inaccurate.In this paper, we propose a new power-consumption model

that considers the internal power consumption and present a

technique to estimate the average internal power consumption

for dynamic DOMINO CMOS as well as static CMOS circuits.

We also propose a low-power technology decomposition pro-

cedure that produces a decomposed network with minimum

switching activities and a low-power technology mapping

algorithm that hides the high switching nodes inside gates

and reduces the fanout load driven by high switching activity

nodes.

Calculation of Switching Prob abilities

Estimation of switching probabilities depends on the delay

model used. Under a zero delay model where gate delays are

assumed to be zero [9], signal transitions due to glitching are

ignored. This tends to underestimate the power consumption.

There is, however, a good correlation between the estimated

power under a zero delay model and a real delay model [191.

In the following, we assume a zero delay model (see Section

VI, however, for more discussion on issues regarding the use

of a real delay model).

Input correlations also affect the calculation of signal prob-

abilities. In combinational circuits such as datapath modules,

the input patterns are usually random. Thus, the primary inputs

can be assumed to be spatially and temporally uncorrelated.

What this means is that each circuit input is independent of

the others, and furthermore, the present value of the input is

independent of its previous value. In sequential circuits suchas finite state machines, the present state bits are spatially

and temporally correlated, thus the independence assumption

does not hold. Reference [21] describes exact and approximate

methods to calculate signal and transition probabilities in

such circuits. In the remainder of this paper, we assume that

the circuit inputs are spatially uncorrelated (see Section VI,

however, for a method to approximate the spatial correlations

among circuit inputs).

For n-type dynamic circuits, the output is precharged to 1

and hence the switching probability is given by

where PO=o s the probability that signal 0 assumes value 0.

0 and hence the switching probability is given byFo r p-type dynamic circuits, the output is predischarged to

where Po=l is the probability that signal 0 assumes value I .

For static circuits, if we assum e that the present input signal

value is independent of the previous value, then the switching

probabilities are given by

-~ o L ( z - > u ) ~ 0 2 ( ~ ~ > L ~o ~ = ~ / o * = y ~ o ~ = u ( o ~ = v4)

where Pol=zloz=ys the probability that signal 01 assumes xgiven that signal 02 assumes y. The signal probability at the

output of a node is calculated by first building an Ordered

Binary-Decision Diagrams (OBDD) 3 ] corresponding to the

global function of the node and then performing a linear

traversal of the OBDD using the procedure given in [14].

Organization of the Paper

The rest of this paper is organized as follows. In Section

I1 we review the conventional power-consumption models

and present an extended model that accounts for the intemai

power consumption. We also introduce the notion of charging

and discharging functions for an internal node of a gate and

present techniques that use these functions to estimate the

internal power consumption. In Section 111, we describe aprocedure for con structing a NAN D-decomposed network that

has minimum total switching activity. Section IV presents a

technology mapping paradigm for low power that exploits the

power/delay trade-off curves to generate a mapped network

with minimum power consumption subject to given timing

constraints. Experimental results are presented in Section V .

We discuss some issues and extensions in Section VI and give

conclusions in Section VII.

11. POWER CONSUMPTION MODELING N D ANALYSIS

A Simple Power Consumption Model

Power consumption in CMOS circuits is caused by three

sources: the charging and discharging of capacitive loadsduring output switchings, the short circuit current that flows

during output transitions, and leakage current. The last two

sources can be made small with proper device and circuit de-

sign techniques [ 2 3 ] .We therefore concentrate on the dynamic

power consumption. -The average power consumption for a gate g in a syn-

chronous CMOS circuit is given by

where

and CL is the gate capacitance of j , Cwire s the wiring

capacitance of the output net, V& s the supply voltage, Tcycle

is the clock cycle time, and E(switching) is the expected

number of transitions at the output of g per clock cycle.

E ( s w i t c h i n g ) is a function of the design style, the logic

function being computed, the delay model and switching

probabilities of the primary inputs.



1112 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO. 9, SEFTEMBER 1994

1->o i2 4r Edischarged at t2

ou t 0->o

i1i i2i\Fig. I. Charging and discharging of an internal node of a 2-input NOR gate.

Since the gate capacitance is proportional to the gate area,

the traditional approach for minimizing the power consump-

tion has been to minimize the total gate area. However, the

power consumption also depends on the switching activities

of the gates. Indeed, we have to minimize the total weighted

switching activity in the circuit i.e. (EC~,,J3(switching)) .Therefore, the minimum area solution is in general different

from the minimum power solution.

Power Consumption ut Internal Nodes of U Gate

In a typical 0.8-pm technology, the diffusion capacitance

is about 20% of the gate capacitance [I]. The charging anddischarging in the sourceldrain capacitance of transistors in

a gate thus makes a considerable contribution to the power

consump tion of the gate. In particular, the internal capacitancesmay be charged and discharged without any change in the gate

output. This is illustrated in Fig. 1. For a two-input NOR gate,

if the inputs are changing from 01 to 10, the output remains

unchanged. However, when the inputs are 01, the capacitance

of node A is charged then discharged when the inputs arechanged to 10. This contributes to the power consumption in

the circuit. The amount of power consumption depends on thefrequency of charging and discharging and the sourceldrain

capacitances (hence sizes of the transistors).

Internal power consum ption is especially important for com-

plex gates that have more internal nodes with higher parasitic

capacitances. A complex gate can be logically decomposed

into a network consisting of only two input AND/OR gates

and inverters. It may appear that the power consumption of

internal nodes of the complex gate can be accounted for

by calculating the power consumption of the external nodes

in the corresponding decomposed network. This is incorrect,

however, for the following reasons:1) A complex gate may be decomposed in different ways

(see Fig. 2(b)).

2 ) There does not exist a one-to-one correspondence be-tween internal nodes of a complex gate and the external

nodes of any given decomposition (internal node A of

the four input NAND gate shown in Fig. 2(a) is one

such node).

Configuration A

(a) (b)

Fig. 2decompositions.

(a) Four-input N A N D gates and (b) two-input AN D and inverter

3) The internal nodes of a complex gate may float while

the external nodes of a &en decompo sition are always

either low or high.

The power consumption at the internal node n of gate g in

a static circuit is given by

p::i(n) = 0 . 5 1 / d 2 d C ~ D E I I ( c h a I . g e / d i s c h ~ I . g e ) 7)Tcyc le

where CED is the sourceldrain capacitance at n and

E , (chargeldischarge) is the expected number of charge

and discharge events at n per clock cycle.'

CSnDand E,(ch nrge /disch arge ) for internal nodes of gates

in dynamic circuits is somewhat different and deserves more

explanation. We will consider n-type dynamic circuits. Similar

results can be derived for p-type dynamic circuits. Here, output

nodes are charged to 1 during the precharge period while all the

gate inputs that are fed from the previous stage are maintained

at 0. Therefore, only the precharged output nodes have direct

charging path to V&. Consider Fig. 3. During the evaluationperiod, if i l is 1 and i2 is 0, there will be charge sharing

between output node ou t and internal node n and the voltage

seen at n will be equal to c,:;;&,,Vdd.

The power consumption due to charge sharing is thus equal

to

where C r f f = ( )2C, is the effective capacitance seen

at n.

The effective capacitance of an internal node depends on

how many internal nodes are connected to the prechargednode at the same time. In Fig. 3, if i~ is also 1, then m is

connected to n and out. The effective capacitance seen at n

I Equation (7) assumes that the threshold voltage \? IS negligible comparedto I &d . In general , I f the internal node 71 is in the nmos (or pmos) chainof the gate, then In (7) should be replaced by ( \ i d - \kN ' (or

(t2d- 1y p ) where bkN (or 1k , ) IS one threshold voltage of the nmos(or pmos) transistor



TSUI er al . : POWER EFFICIENT TECHNOLOGY DECOMPOSITION AND MAPPING 1113

c lk

Calculation of Expected Number of ChargelDischarge Events

The internal power consumption depends on C iD ,which

is determined by the size of transistors in the gate, &d on

E ; ( c h a r g e / d i s c h a r g e ) ,which depends on the global function

of i in terms of the primary inputs, the switching probabilities

of the primary inputs, and the internal structure of the gate.

The main theorem for calculating theEi ( c harge ld i sc harge ) is given next.

Theorem: The expected number of charge/discharge events

per clock cycle at some internal node i that is not precharged

(or predischarged) is given by

G nd

Fig. 3. Charge sharing in dynamic circuits.

is thus smaller and is given by

)C.^f = (eout c, +c,

Cout

In general, the effective capacitance of a node can be

approximated as

(9)

where Cj is the sourceldrain capacitance of internal node j

that lies on the path from n to the precharged node. Charge

sharing creates a drop in the output voltage and if this drop is

greater than a threshold voltage, it may create a logic error or

loss of noise margin in the subsequent stages. Therefore, for

a dynamic circuit to operate properly, the voltage drop due to

charge sharing must be small. This implies that Gout >> C;

an d C M C,.

When the charge sharing problem is severe, designers often

precharge the internal nodes of the pulldown NFET chain for

n-type circuits. In this case, for each precharged node n, since

it is precharged every cycle, E , ( c harge ld i sc harge ) is equal

to 4 o w ( n ) .

An Extended Power Consumption Model

tances, ( 5 ) must be augmented as follows:

To capture the power consumption due to internal capaci-

where P:Ek(i) is given by either (7) or (8).

where Phigh i ) andplow i ) denote probabilities of i being

charged to V& and discharged to G n d .Proof: We find E(n) , the expected number of

patternsh(f*)l in a string of characters consisting of n

elements from the alphabet { h , f , l }where h , f ,1 denote the

input combinations that give rise to charging, floating, and

discharging conditions at node i respectively, and f* meanszero or more f characters.

E ( n ) s composed of two parts. Th e first part is the expected

number of patterns h ( f * ) Z n a string of n - 1 characters (i.e.

E ( n - 1)).The second part is the product of P(h ) and the

probability P ( ( f * ) l , - 1) hat a string with n - 1 charactersbegins with pattern ( f * ) l . Th e reason for this is that if the first

character is an 1 or f, t will not create any pattern in addition

to those included in E ( n- 1); f the first character is an h,

then any string of n - 1 characters that begins with pattern

( f* ) l will add one additional h( f * ) l attern to E(n ) . ince

P( ( f * ) l , n 1)= P(1)+ P ( f ) P ( l ) . . + P( f ) " - 2P( l )

E(n) s given by the following recurrence formula

E(n)= E ( . - 1)+P(h)P(( f*) l ,n 1)

This recurrence formula can be solved to yield

Each character represents an input vector, and n input

vectors means n cycles have elapsed. So, the expected number

of charge/discharge events per clock cycle is given by

as n approaches 00 . Because

.



1114 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO . 9, S E ~ E M B E R 994

this is equal to

0

In the following, we show how to calculatePhi,h(i) and

Pl,,(i). For both dynamic and static circuits, P h l g h ( n ) andPlow (n) are given by

where F , ( n ) and F d ( n ) are Boolean functions representing

conditions for 71 to be charged and discharged, respectively.

Le t TG(N ,E ) be transistor graph of gate g where each

node represents an internal node of the gate. For dynamic

circuits, V d d (or G71d) is replaced by the precharged (or

predischarged) node. Each edge that represents an n-transistor

is labeled with the signal name of the gate of the correspondingtransistor. Each edge that represents a p-transistor is labeled

with the inverted gate signal of the corresponding p-transistor.From the transistor graph, logic functions that charge/discharge

the internal nodes can be obtained as stated below.

Theorem: Le t 0 denote the gate output node, G be the

logic function of 0,puths(n1;n2 ) denote the set of all simplepaths connecting n1 and 722 in TG (N , E ) , n d L ( P ) =

z1A x2 A . . .A x,where P is a simple path consisting of edges

e l , e2 , . . . ,en and xi is the label of edge e,. The charging and

discharging functions for internal node n are given as follows

(proof omitted):

n-type dynamic:

v L ( P ) , ( 1 7 )dy namz cC N ( n ) = G A

PEpaths(n,precharged-node)

PEpaths (n .Gnd)

Vdd1 'y2

TG: TransistorGraphCF: hargingFunction

DF: Discharging Function

Fig. 4.

NOR gate.

The transistor graph and charg inddischarg ing functions for a 3-input

static PMOS:

FdF!OCs(n)= E A v L(P) (24)PEpaths (n ,O)

where A and V denote Boolean AN D and OR operations.

functions for a 3-input NO R gate.

Fig. 4 shows an example of the charging and discharging

111. POWEREFFICIENTECHNOLOGYECOMPOSITION

It is difficult to come up with a decomposed network thatleads to a minimum power-consumption implementation after

power-efficient technology mapping since gate loading and

mapping information are unknown at this stage. Nevertheless,

we have observed that a decomposition scheme that minimizes

the sum of the switching activities at the internal nodes of the

network is a good starting point for power-efficient tech nology

mapping. We illustrate this point with a simple example (Fig.

5(a)). A four-input AN D gate can be decomposed into a tree

of 2-input AN D gates in two ways. These two decompositions

have different total switching activities. Assuming n-type dy -

namic circuit and independent inputs, let P(u)= 0.7 , P(b)=0 .3,P(c)=0.3 and P(d)=0.7. The total switching activities for

configurations A and B are 2.317 and 2.464, respectively.

Configuration A appears to be better than configuration Bsince the sum of the switching activities at its internal nodes

is smaller. Furthermore, if we assume that the cell library has

2- and 3-input AN D gates, the minimum power mapping with

a power value of 2.107 is obtained from configuration A (Fig.



TSUI et al.: POWER EFRCIENT TECHNOLOGY DECOMPOSITION AND MAPPING 1115

iD’“Fig. 5. An example showing the effect

MgurationB csw- ContigurationB

(0 ) (b)

of technology decomposition on total switching activity.

5(b)). In this example, decomposition with lower switching

activity leads to mapping with lower power consumption.

We denote the problem of generating a NAND-decomposed

network with minimum total switching activity as the MIN-

POWER decomposition. The performance-oriented version of

the above problem requires that the increase in the height ofthe decomposed network (compared to the undecomposed net-

work) be bounded. We refer to this problem as the BOUNDED-HEIGHT MINPOWER decomposition.

Tree Decomposition

We describe algorithms for solving the MINPOWER decom-

position for a fanout-free logic function (i.e. a function that

has a tree realization).

The basic algorithm is similar to Huffman’s algorithm

[lo] for constructing a binary tree with minimum average

weighted path length. We denote the leaves of a binary tree

by ~ 1 , 2 1 2 , . U,, the “path length” from the root to U, by I, ,

and the weight of leaf w; by w,. Assuming that the root is at

level zero (the highest level), leaf U; is at level 1; . Given a set

of weights w;, there is a simple O(n1ogn)-time algorithm due

to Huffman for constructing a binary tree such that the costfunction w;l; is minimum .

In a more general setting, a weight combination function

F ( z , ) (which is any symmetric function chosen as a binary

operator on the weight space U ) s used to produce the weight

W of internal nodes during tree construction. For each tree

T , tree costfunction G( W 1,Wz, . . ,Wn-l)gives the cost.2

Parker [111 characterized a wide class of weight combination

functions for which Huffman’s algorithm produces optimal

trees under the corresponding tree cost functions. We give

some definitions first and then state Parker’s main theorem.

Definition: A weight combination function F is quasi-

linear if F ( z , ) = q5-l(XI$(z) XI$(y)) where X is a nonzeroconstant and I$ is a real-valued invertible function on the

weight space U . (Note F is symmetric, and conjugate underI#J to the linear map X(x + y).)

Definition: A tree cost function G: U”-’ --+ R is Schur

concave if

21n Huffman’s onginal paper, F(s, ) = T + y and G =XI‘=;‘ t IS

easy to show that G = E:=, f z lL .

f o r a l l x ; , x j E U , z , j E 1 . . . ,n - 1 .Theorem: If the weight combination functionF is quasi-

linear and the corresponding function q5 is convex, positive,

(or concave,negative) and stiictly monotone and A> 1, then

the Huffman tree will have the least cost when G is any Schur

concave function of the internal node weights [111.

If F ( z ,y) does not satisfy the above conditions, then Huff-

man’s algorithm may not produce the optimal solution. Wepropose the following O (n2 10 gn ) greedy algorithm to solve

the decomposition problem for general weight combination

functions.

Algorithm 3.1: For every pair w; ndwj of the n nonneg-

ative weights w1 ,WZ, . . . w,, compute F;j(w ;, wj) nd storein list L. Find the smallest Fi j , say F ~ z . eplace the two

nodes by a single node having the weight W , = F ~ znd two

sons with weight w1 an d wz. liminate all F l k ( w 1 ,wk) nd

F2k(wg, wk) from L. ompute F l j ( W 1 ,w j ) and insert it intoL . Do this recursively for the n - 1 weights W,, wg , . . .,w,.

The final single node with weight Wn-l is then the root ofthe binary tree.

To solve the MINPOWER tree decomposition problem, we

must use appropriate weight combination and tree cost func-tions. We thus distinguish among two cases as follows.

Dynamic Circuits: Fo r (n-type circuits), dynamic gate out-

puts are precharged to 1 and switching occurs when the output

changes to 0 during the evaluation phase. For a 2-input AND

gate composed of a 2-input NAND gate and a static inverter,

the inverter output is 0 during the precharge period and the

switching probability (without input signal correlations) is

given by

.

WO= w;,w;, (26) ’

where WO nd w;, alues are probabilities of the output and

inputs assuming value 1.

Fo r p-type circuits, dynamic gate outputs are predischarged

to 0 and transition occurs when output evaluates to 1. Th ecorresponding formula for the switching probability for a 2-

input AN D gate composed of a 2-input NAND gate and a static

inverter is then given by

W, = 1- (1- q-)(1- q) (27)

where the WZ and y values are probabilities of the output

and inputs assuming value 0.



1116 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 1994

The weight combination functions F( w Z l ,w Z 2 ) WO r

F ( w c .q) W, are used during the AN D decomposition.The corresponding tree cost function G is given by

n-1

G = C W i ,

i= l

Lemma: WO and W,, given in (26) and (2 7) above, arequasi-linear functions.Proof: For (26), since w, is within the range of [0,1] we

can take 4(x) = - l o g ( z ) , which is a convex, positive, and

decreasing function and A= 1. This shows that WO is quasi-

linear.Similarly, for (27), since w~ is within the range of [0,1] we

can take $(z) = -Iog(l - x) which is a convex, positive

and increasing function and A= 1. This shows that Wz is

0

quasi-linear. 0

Proof: Since = E, w2W ~ ) ( S e)0,

G is Schur concave.Theorem: MINPOWER tree decomposition for dynamic CMOS

circuits with uncorrelated input signals can be solved opti-mally by Huffman's algorithm using the weight combination

functions (26) and (27).30

If the input signals to the AN D gate are correlated, (26) and

(27) cannot be used as the weight combination function. The

switching probabilities for n-type and p-type circuits are then

given by

Lemma: G given in (28) above is Schur concave.

Proofi Follows from Lemma 3.2 and 3.3.

WO = ~Z1WZZIZ1. (29)

W, = 1- (1- w,)(l- w-2121 ) ! (30)

respectively, where wislil( w ~ ~ , )s the conditional probability

of i 2 (G)iven 2 1 .

Lemma: WO nd W, given in (29) and (30) are not quasi-linear functions.

Proof: We present the proof for (29). Proof for the other

case is similar.

One criterion for a function F to be quasi-linear is in-

creasingness, i.e. F(u , z ) < ( > ) F ( u , ; y )f 2 < (>)y [ l l ] .

However, in (29), we could have wiz 5 wi3, yet wi21il >711i:31i1, thus F(wi,,i 2 )> F ( w i , ,w,,). O oes not satisfy

the increasingness property and hence is not quasi-linear.

Sinc e WO 's given in (29) and (30) are not quasi-linear, we

use Algorithm 3.1 to solve the decomposition problem.

Static Circuits: For static CMOS circuits, we need to mini-

mize sum of the probabilities for output switching from 0 to

1 and 1 to 0. Thus, the weight combination function WOor a

2-input AND gate is equal to WO,->, + W,,->,where-' o o - > l - W i l o - > l ~ i 2 0 - > l w i l l- > 1 w i2 0 - > 1

+ Wile- > 1wi21- >1 > (31)

WOl->, = ~ f l i l l - > l w i 2 ~ - > o w i l l _ > 0 w i 2 1 - > I

+ ~ i l l - > " ~ , 2 1 - > 0 ~ (32)

Indeed, a chain-like tree decomposition will be obtained.

If input signals are correlated, conditional probabilities

between the input signal transitions have to be used in order to

calculate the output transition probability. W OO -> , nd Wol-,o

are then given by

WO, - >o = w 11- > 1w221- > 1211- > 1 + wz11- " wzz1- > 1 121 - 20

+ wz11- > o w,zl ->0 1211- > O f (34)

Note that for static circuits, every internal node is assigned

multiple weights (i.e. 7 ~ , ~ - > ~ , w , ~ - > ~ ) .he weightcombination uses all these weights through (3 1)-(34). T hisgeneral class of problems cannot be solved using the Huff-

man's algorithm, which applies when each internal node is

given a single weight. Therefore, we resort to Algorithm 3.1.

Under one temporal independence assumption for one gate

inputs, (31) and (32) can be replaced by + WO, - ,=

2WZ1Wz2(1-W21W22)while (33) and (34) can be replaced byWOO+,WO,- , = 2Wz1Wz2~,1(1 Wz1W,2111).The resultingweight combination functions F W2 1Wt2) are not quasi-linear,

and hence Algorithm 3.1 must be used.

Bounded-Height Tree Decomposition

Here, the objective is to construct a MINPOWER binary tree

for a given list of weights (switching probabilities) with the

restriction that the height of each weight (defined as muxi

I , ) does not exceed a given integer L . The best known

algorithm for solving BOUNDED-HEIGHTMINSUM problem is an

O ( n L ) algorithm due to Larmore and Hirschberg [131. Their

approach (based on the PACKAGE-MERGE algorithm) transforms

the BOUNDED-HEIGHTMINSUM tree decomposition problem to

an instance of the COIN COLLECTOR'S p r ~ b l e m . ~he PACKAGEstep in this algorithm uses the Huffman's Algorithm to merge

two items at level i to form a new item at level i + 1. The

MERGE step merges the newly formed items with the originalitems at each level. Details of the algorithm can be found in

For the general weight combination functions, the Larmore-

Hirschberg's algorithm has to be modified as follows: In the

PACKAGE step, replace the Huffman's algorithm by Algo-

rithm 3.1; the MERGE step is unchanged. Using the modified

Larmore-Hirschb erg s algorithm, the BOUNDED-HEIGHT MIN-

POWER tree decomposition can be solved heuristically for the

general weight combination functions.

~ 3 1 .

IV. POW ER FFICIENT ECHNOLOGY APPINGThe problem of technology mapping for low power can

then be stated as follows: Given a Boolean network repre-

senting a combinational logic circuit optimized by technologyindependent synthesis procedures and a target library, we bind

'An instance ( I .S) f the COIN COLLECTOR'S problem is defined asitems, each of which has a wi d t h and i w i g h f , find aiven a set I of

subset of S f I whose widths su m to S nd has m inimum total weight.



TSUI er al.: POWER EFFICIENT TECHNOLOGY DECOMPOSITION AND MAPPING 1117

nodes in the network to gates in the library such that average

power consumption of the final implementation is minimized

and timing constraints are satisfied. A successful and efficient

solution to the minimum area mapping problem was suggested

in [12] and implemented in programs such as DAGON and

MIS. The idea is to reduce technology mapping to DAGcovering and to approximate DA G covering by a sequence of

tree coverings that can be performed optimally using dynamicprogramming.

Traditional goal for technology mapping is to minimize the

total area (or delay) of the mapped circuit. In [ 5 ] ,a near-

optimal solution is presented for finding the minimum area

solution under delay constraints. Their approach consists oftwo steps: In the first step, delay functions (which capture

arrival time-gate area trade-offs) at all nodes in the network

are generated. In the second step, a mapping solution based

on the computed delay functions and the required times at

the primary outputs is found. For a NAND-decomposed tree,

subject to load calculation errors, this two-step approach finds

the minimum area mapping satisfying all delay constraints if

such solution exists.

Our low-power technology mapper follows a procedure

similar to the above, with the difference that the objective

is to minimize the sum over all gates of the average power

consumed in the gate. The approach also consists of two steps:

First, a postorder traversal is used to determine a set of possible

amval times and accumulated power consumptions at each

node of the network. Once the user specifies a single required

time, a second preorder traversal starting from the primary

outputs is performed to determine the mapping solution that

minimizes the average power subject to the required time

constraints.

Terminology

Consider a match y at node n of a NAND-decomposed

tree. The inputs to node n consist of nodes n, that fanoutto node n (that is, n = ni + n i if n has two inputs

or n = ni if 71 has a single input). The nodes that are

covered by match y are denoted by m e r g e d ( n ,9). The nodesthat are not in m e r g e d ( r i , y ) but fanin to merged(n,g) are

denoted by znputs(n, 9 ) . Th e mapped-parent(n,) is some

node n for which there exists a matching gate g such that

ri{ E ~npu t s ( n ,). Note that given node n and gate ,y matching

at n, mpufs(n . ) are uniquely determined. However, ri, ma y

have many distinct mapped parents.

With each node in the network we store a power-delay

curve. A point on the curve represents the arrival time at

the output of the node and the average power consumed in

its mapped transitive fanin cone (but excluding the power

consumed at the load driven by the node that has yet to bedecided by a later mapping). In addition to the power and

delay values, the matching gate and input bindings for the

match are also stored with each point on the curve. Points on

the curve represent vanous mapping solutions with different

trade-offs between average power and speed. A point ( f l , 1 )

is inferior if there exists another point ( t 2 > p 2 ) n the curve

such that either p l > pp and t l 2 f 2 or pl 2 p2 and

tl > t 2 . All inferior points are dropped from the power-delay

curve.

Arrival Time and Power Cost Calculation

We have adopted the pin-dependent SIS library delay model

for the calculation of the arrival time and power consumption.

Suppose that gate g has matched at node 71, then the outputarrival time at TL is given by

where r;,g s the intrinsic gate delay from input i to output of

g, Ri.g is the drive resistance of g corresponding to a signal

transition at input i, C, is the load capacitance seen at 71,

A(n; g; , Cn,) s the arrival time at input i corresponding to

load Cnt seen at that input;’and g; is the best match found

at input i .

Under the simple power consumption model, the power cost

for gatey matching at node n is given by

(36)where E,& is the expected switching activity at node n , and

Pauy(nz,gz)s the accumulated power cost at 7 t L assuming a

gate matching 9% .

The input to the technology mapper is a Zinput N A N D

decomposed network of the optimized network from the

technology-independent phase. The switching activity E,, of

each node n of the decomposed network is calculated prior

to the mapping.

Under the extended power consumption model, the power

cost is given by

( 3 7 )

where P l $ ( j ) is given by (7).The typical gate library model contains information such as

input capacitance for each input pin, pin-dependent intrinsic

gate delay and the pin-dependent output drive. To support the

calculation of the internal power consumption, we add to this

data the following information: 1) the source/drain capacitance

fo r each internal node and 2) the charging and dischargingfunctions in terms of the gate inputs for each internal node.

When calculating the power cost at node 71, the power

contribution from the output load driven by 71 is not included.

We have the following lemma, however, for the power cost

calculation:

Lemma: The power cost of a match g at noden given by

( 3 7 ) s the exact power cost under a zero delay model.

’



1118 IEEE TRAhSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13. NO. 9, SEPTEMBER 1994

Pror$: Under a zero delay model, the switching activity atnode 11 only depends on the global function of the node in

terms of the circuit primary inputs and not its particular im-

plementation. Hence, E , is independent of the gate matching

at 7i. Furthermore, note that since the power consumption due

to output load of 71 is calculated only when the fanout gate of

n is known, (37) is nor subject to the unknown load problem.

The average power contribution of the n’s output load willbe included at mapped-par-ent(n).When the mapping reaches

a primary output, every point on the power-delay curve has a

p o u w value equal to the total average power consumption of

the mapped tree minus the power consumption at the primary

output load. The power consumption at the primary output

load depends on E, , at the output and the load capacitance

connected to it, which are both independent of the mapping

configuration. Hence, it only causes a fixed shift of the curve

along the power axis; it does not affect the selection of the0ptimal point from the power-delay curve.

Tree Mapping

In this section, we focus on tree mapping. Later, we shalldescribe the extension to DA G mapping. In particular, wc

describe two tree-traversal operations applied to a NAND-

decomposed tree in order to obtain a technology mapping

solution that minimizes the average power consumption while

satisfying th e timing constraints.

Tree Traversals: On the first traversal, we begin at the

leaf nodes of the NAND-decomposed tree. Since each node n

possesses a set of possible arrival time-average power points

that are reflected in its power-delay function, the power-delay

function at any mapped-parent( 1) must also reflect these

possible arrival time-average power trade-offs. A postorder

traversal of the NAND-decomposed tree is perform ed, w here

for each node 71. and for each gate g matching at 7 t , a new

power-delay function is produced by appropriately merging

the power-delay functions at the , i r ~ p , u t s ( n ,) . Merging mustoccur in the common region in order to ensure that the

resulting function reflects feasible matches at the / n p ~ t s ( n ,).

The power-delay functions for successive gates g matching atri are then merged by applying a lower-bound merge operation

on the corresponding power-delay functions (see [ 5 ] or details

of these operations). At a given node 71, the resulting power-

delay function will describe the arrival time-average power

trade-offs in propagating a signal from the tree inputs to the

The second traversal begins when the mapping reaches

the root (primary output). The user is allowed to select the

arrival time-average pow er trade-off that is most suitable

for his application. Given the required time t at the root of

the tree, a suitable ( t . p ) point on the power-delay functionfor the root node is chosen. The gate y matching at theroot that corresponds to this point and 2 7 ~ p 1 ~ t s ( ~ 0 0 t ,) are,

thus, identified. The required times ti at inputs(root , .9)

are computed from t and ,y. The preorder traversal resumes

at , i 7 r p u t s ( , / , r / r ~ t .)where t , is the constraining factor and amatching gate ,yz with minimum power consumption satisfying

t ; is sought.

Output Of 71 .

Optimality of the Tree Mapping Algo rithm: The following

theorem can be stated.Theorem: Under a zero delay model, the tree mapping

algorithm finds the minimum power consumption solution for

given delay constraints (subject to the error due to unknown

loads during the arrival time calculation).

Proof: From Lemma 4.1, under a zero delay model, the

power consumption calculation is exact and not subject to theunknown load problem and hence is similar in nature to the

area calculation. The proof of optimality of the tree mapping

algorithm is thus similar to that of the area-delay mapping

algorithm.

DA G Mapping

Most of the real circuits are not trees, but general DAG’S.

The problem of mapping a DA G even for the unit fanout

model is NP-hard [2]. Therefore, we resort to heuristics. D uring

the power-delay curve computation step, nodes are visited in

postorder. For each node, we compute the power-delay curve

as in case of trees. However, if the input for a candidate

match at the node is coming from a multiple fanout node,

we divide the average power contribution of that input by

the fanout count of the input node . By reducing the average

power contribution, we favor a solution in which multiple

fanout nodes are preserved after mapping, which reduces logic

duplication and improves the final mapped average power.

This heuristic that permits tree boundary crossing only for

nodes with relatively few fanouts was also adopted by the

MIS mapper [7]. During the gate selection step, if we come to

a node that has already been mapped, we check if the mappedsolution at the node satisfies the timing requirement. If so,

we keep the mapping; otherwise, we replace it with another

solution from the power-delay curve that satisfies the current

timing requirement and has minimum average power.

V. EXPERIMENTALESULTS

Table I shows the ratio of the internal power consumption

(due to the parasitic source/drain capacitances) to the power

consumption due to the external gate capacitances for a subset

of ISCAS-89 and MCNC-91 benchmarks. It is seen that if

internal power consumption is disregarded, then the power

consumption will be underestimated by an average of 16.9%.

To evaluate the effectiveness of the technology decomposi-

tion and mapping for low power, we applied our algorithmson the above benchmarks. Static CMOS circuits were used in

the experiments and all primary inputs were assumed to be

spatial and temporally independent. The delay was calculatedusing the augmented pin-dependent SI S library delay model as

described in Section IV . We used an industrial library with44 gates.

Table I1 shows the total switching activity of the de-

composed networks for different technology decompositionmethods. It is shown that the low power technology decom-

position reduces the total switching activity in the networks

by 5% over the conventional balanced-tree decomposition

method.



1119

power

256110

16312439

152888417146888488

1418991

141167

411

TSUI el al.: POWER EFFICIENT TECHNOLOGY DECOMPOSITION AN D MAPPING

Barea

69452429

52462990921

437121583

7951753203120562139199333762297226439414025

10509

TABLE I

INTERNAL POWER CONSUMFTION VERSUS EXTERNALOWER

CONSUMFTION.: POWER ONSUMITIONUE O GATECAPACITANCE,U: POWERONSUMPTIONUETO INTERNALAPACITANCE

TABLE 111

TECHNOLOGYECOMPOSITIONN D MAPPINGESULTS. : a d m a p WITHBALANCEDECOMPOSITION;: p d m a p WITHBALANCEDECOMPOSJTION-

ower

282131

20014247

1761001

5283649898

101164108108178204

485

-

-

delay-6.1

30.7

26.011.76.4

11.328.2

7.012.611.410.815.211.020.42-1.121.514.39.4

13.4-

delay-5.4

30.1

29.011.66.69.8

29.86.1

13.111.711.318.412.215.820.220.011.08.1

11.5-

Aarea

60912278

42662734820

382218380

8531637157918291785181928692137211934503683

9200

cir-cuit

C1908C432

alu2apex7cordicex2pairparity53495386540054205444551056415713s832x l

x3

-1

49.018.0

31.017.05.318.0128.07.212.34.812.212.412.421.510.810.718.520.5

58.1

-92.0

132.0107.034.5134.0760.034.759.441.875.872.076.4120.379.080.3122.7147.3

353.0

207.0

(1/11)*100

23.619.5

23.415.815.313.416.820.720.711.416.017.216.217.813.613.315.013.9

16.416.9

circuit

C432

alu2apex7cordic

ex2pair

paritys349s3865400s4209444s510s6415713s832x l

x3

c1908

avg.

TABLE IV

TECHNOLOGYECOMPOSITIONND MAPPING ESULTS. : p d m a p

WITHminpower&iecomp; D : p d m a p WITH hminpower-tdecornpTABLE I1

FORDEFERENTDECOMPOSITIONCHEMES.: BALANCEDTHETOTALWITCHING ACTIVITYN TH E NETWORK

DECOMPOSITION,1: POWER EFFICIENTECOMPOSITIONcir-cuitC1908C432a h 2apex7cordicex2pairparity

5349s3865400s42054445510564157135832x lx3

delay-6.8

30.329.011.98.3

10.229.6

6.1

13.111.111.119.412.115.219.721.811.38.1

14.2

delay-5.6

29.128.511.57.2

12.229.7

6.1

13.110.511.418.612.115.520.120.110.68.9

12.7

-ower

25411015912438

15087841

7245878188

138

.8 890

138161409

-rea7226250654513123

8664522

2 1803795

175421 1421122290204334422512251342234293

10691

area7058245653753090

8794426

21896795

1765203620342243199933442372233740244125

10643

power24910616112337

15086741

7145868188

1388687

134159406

-6723354223941194732628

1102071252472752524103003093854991224

--1657318389385110457

2491

1102061092392582453922822843524531192

-

-

%-a2.235.077.822.287.563.385.21

00.4812.803.246.182.784.396.008.098.579.222.615.15

circuit

C432a h 2

apex7cordic

ex2pair

paritys349s38654205400s444s5109641s7135832x lx3

avg.

c1908

(minpower- tdecomp), while method D uses the BOUNDED-

HEIGHT MINPOWER decomposition ( b h m i n p o w e r J d e c o m p ) .

To see the impact of the p d m u p on the average power

consumption, we compare results of methods A and B . It isseen that with p d m u p , the power consumption is improved

by an average of 15.9% over ad-map . The active cell area is

increased by an average of 12.5% while the circuit delay is

improved by 2.8%.

To illustrate the impact of the (minpower- tdecomp) on

the average power consumption, we compare the results of

methods of B and C. The power consumption is improved by

Tables I11 and IV contain our ex perimental results using dif-

ferent technology decomposition and mapping combinations.All methods have the same starting point, that is, circuits

optimized by the SI S rugged script [18]. Method A uses the

area-delay mapping (ad-map) algorithm of [5 ] and methods Bto D use the power-delay mapping (pd-map) algorithm. Meth-

od s A and B take in a NAND-decomposed network generated

by the conventional balanced-tree decomposition algorithm.

Method c uses the MINPOWER technology decomposition



1120 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 1994

an average of 2.5% at the expense of 3.6% increase in area and

3.4% degradation in performance. From B and D, we see that

hhminpower- tdecomp improves the power consumption by

an average of 1.2% over the conventional decomposition atthe expense of 1.2% increase in area and 1.8% degradation

in performance (note that bhminpower- tdecomp improvesperformance by an average of 1.3% over minpower-tdecomp).

The best overall result is achieved when pd-map is appliedafter the network is decomposed using minpowerddecomp.

The average power consumption is reduced by an average

17.9% at the expense of 16.5% increase in area while the

average circuit delay is almost unchanged.

Comparing Tables I1 and IV , we observe that although the

reduction in power consumption after mapping is not directly

proportional to the reduction in the switching activities of

the network before mapping, networks with lower switching

activity often result in a mapped circuit with lower powerconsumption. We believe the small gain of minpower-tdecomp

is due to the fact that most nodes in the optimized networkare relatively simple because of the fast extract and quick

decomposition operations performed prior to the technology

decomposition step. Therefore, the minpower-tdecomp doesnot have much freedom in improving the power efficiency

through NAND decomposition.

It is noteworthy that in most cases, in order to reduce the

power consumption, circuit area is increased. The explanation

is as follows. To minimize power consumption, p d m u p hides

the high switching nodes inside the complex gates and/or

reduces the load capacitance of these nodes. In general, these

two operations do not produce minimum area mapping.

To see how power consumption is actually reduced by

p d m u p , we selected the s832 circuit and plotted the number

of nets and the average loading on nets versus the switching

rate. Figs. 6 and 7 summarize the results. From Fig. 6, we

see that pd-map tends to reduce the number of high switching

activity nets at the expense of increasing the number of low

switching activity nets. From Fig. 7, we see that for theremaining high switching activity nets, pd-map tries to reduce

the average loading on nets. By doing the se, pd-map minimizes

the total weighted switching activities and hence the total

power consumption in the circuit.

VI . DISCUSSIONSND EXTENSIONS

We assumed a zero-delay model (which does not account

for power consumption due to glitches) during the technology

mapping phase. This shortcoming can be overcome by using a

real delay model (such as the SI S pin-dependent library delay

model) when calculating the expected number of transitions.Using a real delay model, however, has several drawbacks.

The first is the huge computational effort to calculate thepossible glitches for each nodes. In [9], symbolic simulation

is used to produce a set of Boolean functions that represent

conditions for switching at each gate in the circuit at a

specific time instance. This procedure is exact, however,

requires exponential computation time and storage space. In

[20], a faster method of estimating the power consumption

including glitches is described using the notion of transition

6o + no.ofaeb

n

h i t :a32

amadday mapping

0 ower-delay mnpp hg

n

O!1 0.2 0.3 (y.4 5 sw%hingrate

Fig. 6p d m a p .

Number of nets versus switching rate for s832 using a d m a p versus

Average Load

clrcult:s832

= rea-delay mnpplng

0 ower-delay mapplng

hhg rate

Fig. 7.versus p d m a p .

Average load per net versus switching rate for s832 using a d m a p

probability. The transition probabilities are propagated from

the primary inputs to the circuit outputs using a linear time

algorithm. Although the later method significantly reduces the

computation time and space requirement, it is still costly.

The second drawback is that under a real delay model, the

dynamic programming-based tree-mapping algorithm does not

guarantee to find an optimum solution, even for a tree. The

mapping algorithm assumes that the current best solution isderived from the best solutions stored at the fanin nodes of

the matching gate. This is true for power estimation under a

zero delay model, but not for that under a real delay model.

Consider Fig. 8 as an example. Let SI nd 52 be two mapping

candidates at n with the same delay. Let n have a larger

E,(switching) value for SI.t appears that S2 is superior

to SI nd thus SI s dropped from the power-delay curve



TSUI ef al.: POWER EFFICIENT TECHNOLOGY DECOMPOSITION AND MAPPING 1121

d S-1 is an inferior solution for

_ -

r-+s-2n

uL3YA- 11

(b)

Fig. 8. An example to illustrate complica tions due to a real delay model

(Fig 8(a)). Howe ver, when doing mapping at m-the mapped

parent of n - d u e to the difference in the transition waveform

timing for SI nd S,, he best solution at m may be coming

from SI instead of S2 (Fig. 8(b)). So, if SI is dropped from

the power-delay curve, the optimal solution may not be found.

The reason is that glitches depend on the relative timing ofvarious events on the transition waveforms of the gate inputs.

It is very expensive to store all partial solutions, so we have to

drop the locally inferior solutions. The dynamic programming

approach only acts as a heuristic approach (even for trees).

Another assumption we made was that primary inputs

are spatially uncorrelated. If this assumption is relaxed, we

must use the conditional probabilities to calculate the signal

probability at the ou tputs of intermediate nodes. It is, however,

too costly to consider conditional probabilities among all

subsets of primary inputs. Instead, we can consider only the

pairwise conditional probabilities and use the correlation coef-

ficient method proposed by Ercolani et al . [8 ] to approximate

the conditional probabilities among other subsets of primary

inputs.Several extensions can be made on the pd-map to improve

its effectiveness and versatility. First, pin permutation can be

considered during the gate matching procedure. In general,

library gates have pins that are functionally equivalent, which

means that inputs can be permuted on those pins without

changing function of the gate output. These equivalent pins

may have different input pin loads and pin-dependent delays.

If high switching activity inputs are matched with pins that

have low input load, the power consumption can be reduced.

In particular, the internal power consumption depends on

the switching activities and the pin assignment of the inputsignals. To find the minimum power pin assignment for gate g

matching at node n, we must solve the following optimization

problem:

cjEzn t e rna l -node s (g )

where PP(71,g) is the set of all valid pin permutations

for gate g matching at node n , p i n s ( g ) is the set of pins

of y, ~ ( i )s the input that is mapped to pin z under pin

permutation x, nd E , ( chnrge / dz scharg , T ) s the expected

number of charge/discharge events for an internal node j under

pin permutation T . The optimum solution can be found by

enumerating all valid pin permutations. This is not so costly

since the number of pins for typical library gates is small (< 6) .

The pin permutation can be directly incorporated in the

technology mapping process as follows. For each match g, al lequivalent pin permutations are generated and the correspond-

ing power and delay costs are calculated. Permutations that

result in inferior power-delay trade-off points will be dropped.

Permutations that result in non-inferior solutions are stored in

the power delay trade-off curve along with the corresponding

power and delay values. This may lead to a large number of

points on the trade-off curve. In that case, we can restrict the

number of points on the curves to a user defined number Mand thus manage the space complexity (see Section 4.5).

The concept of power delay trade-off curve can be extended

to include the area trade-off. Instead of generating a set

of (power, delay) values, the trade-off curve can consist

of a set of (power, delay, area) values. One drawback of

the three dimensional trade-off space is that the number ofpoints that must be generated in order to find good mapping

solutions, become very large. A balance between optimality

and computation cost has to be reached in order to determine

how many trad-eoff points should be stored.

-

VII. CONCLUDINGEMARKS

We first described an extended power consumption model

that includesthe power consumption at the internal nodes of

a gate. We then showed how to calculate the internal power

consumption based on the notion of charging and discharging

probabilities and transistor graphs. Experimental results show

that the internal power consumption is about 16.5% of the

power consumption due to gate capacitances and hence is

significant. Based on the extended power consumption model,we presented algorithms for low power technology decompo-

sition and mapping. In particular, we generate networks with

minimum total switching activity and feed them to a delay

constrained power efficient technology mapper that hides the

highly active nodes inside the mapped gates. Experimental

results show that significant reduction in power consumption

is achieved without performance degradation.

'



I122 IEEE TRANSACTIONS ON COMPUTER-A.IDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 13, NO. 9, SEWEMBER 1994

Technology mapping for low power under a real delaymodel was also addressed where we showed that even for

trees, Optimal mapping

we considered signal-to-Din assignment for low Dower.

[22] H. Vaishnav and M. Pedram, “Pcube: A performance driven placementalgorithm for low power designs,” Prnc. EURO-DAC, pp. 72-77, Sept.1993.

[23] H. J. M. Veendrick, “Short-circuit dissipation of static CMOS circuitryand its impact on the design of buffer circuits.” IEEE J . Sdid StateCircuits, vol. SC-19, pp. 468-473, Aug. 1984.

cannot be Obtained-

REFERENCES

MOSIS data sheet, Information Sciences Institute, University of South-ern California, 1992.R. K. Brayton, G. D. Hachtel, and A. L. Sangiovanni-Vincentelli,

“Multilevel logic synthesis,” Proc. IEEE, vol. 78, pp. 264-300, Feb.1990.R. Bryant, “Graph-based algorithms for { B Joolean function m anipula-tion,” IEEE Trans. Comput., vol. C-35, pp. 677-691, Aug. 1986.A. R. Burch, F. Najm, P. Yang, and D. Hocevar, “Pattern independentcurrent estimation for reliability analysis of CM OS c ircuits,” in Prnc.

25th Design Automation Conf., June 1988 , pp. 294-299.K. Chaudhary and M. Pedram, “A near-optimal algorithm for technologymapping minimizing area under delay constraints,” in Proc. 29th DesignAutomation Conf..,June 1992, pp. 4924 98 .M. A. Cirit, “Estimating dynamic power consumption of CMOS cir-cuits,” in Proc. IEEE Int . Con&Computer-Aided Design, Nov. 1987, pp.534-537.E. Detjens, G. Cannot , R . Rudell, A. Sangiovanni-Vincentelli, ndA. Wang, “Technology mapping in (MIS] ,” in Proc. IEEE Inr. Conf.

Computer-Aided Design, Nov. 1987, pp. 116-1 19.S . Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Ricco, “Esti-

mate of signal probability in combinational logic networks,” in Proc.European Test Conf., 1989, pp. 132-138.A. A. Ghosh, S . Devadas, K. Keutzer, and J. White, “Estimation ofaverage switching activity in combinational and sequential circuits,” in

Proc. 29th Design Automation Con f., June 1992, pp. 253-259.D. A. Huffman, “A method for the construction of minimum redundancycodes,” in Prnc. IRE, vol. 40. pp. 1098-1 101, Sept. 1952.D. S . Parker, Jr. “Conditions for optimality of the [ Huffman] algo-rithm,” SIAM J Computing, 9(3 ):47 W8 9, August 1980.K. Keutzer, “Dagon: technology binding and local optimization by dagmatching,” in Proc. 24th Design Automation Con f., pp. 341-347, June1987.L. Larmore and D. . Hirschberg, “A fast algorithm for optimal length-limited Huffman codes,” J. Assoc. fo r Computing Machinery, vol. 37,no. 3, pp. 464-473, 1990.F. Najm, “Transition density, a stochastic measure of activity in digitalcircuits,’’ in Proc. 28th Design Automation Conf., pp, -9 June1991.F. N. Najm, R. Burch, P. Yang, and I. Hajj, “Probabilistic simulationfor reliability analysis of CMO S VLSI circuits,” IEEE Trans. Computer-

Aided Design, vol. 9, no. 4, pp. 439 45 0, Apr. 1990.A. P apoulis, Probabiliiy. Random Variables and Stochastic Processes..McGraw-Hill, 1984.S. C. Prasad and K. Roy, “Circuit activity driven multilevel logicoptimization for low power reliable operation,” in Proc. EuropeanDesign Automation Conf., pp. 368-372, Feb. 1993.H. Savoj and H. Y. Wang, “Improved scripts in MIS-I1 fur logicminimization of combinational circuits,” in Proc. Int. Workshop on LogicSynthesis, May 1991.A. A. Shen, A. Ghosh, S . Devadas, and K. Keutzer, “On average powerdissipation and random pattern testability of CMOS combinational logicnetworks,” in Proc. IEEE Int . Cot$ Computer-Aided D esign, Nov. 1992,p p. 4 0 2 4 0 3 .C. Y. Tsui, M. Pedram, and A. Despain, “Efficient estimation of dynamicpower consumption under real delay model,’’ Proc. IEEE Int. Conf.

Computer-Aided D esign, Nov. 1993, pp. 224-228.C. Y. Tsui, M. Pedram, and A. Despain, “Exact and approximatemethods for calculating signal and transition probabilities in FSMs,”in Proc. 31st Design Automation C o n f , pp. 18-23, June 1994.

Chi-ying Tsui received his B.Sc. degree in electricalengineering from the University of Hong Kong in1982 and the M.Sc. degree in computer engineeringfrom the University of Sou them California in 1989.Currently, he is a Ph.D. candidate in computerengineering in the U niversity of Southern Califor-nia working on VLSI design and CAD algorithmsfor high performance low power microprocessordesign. His research interests include power anal-ysis and optimization for CMO S circuits and hard-wadsoftware co-design for high performance lowpower processors.

Massoud P e d r a m (S’88-M’90) is an assistant pro-fessor of Electrical Engineering - Systems at the

University of Southern California. He received hi s

B.S. degree in Electrical Engineering from the Cali-fornia Institute of Technology in 1986 and his M.S.and Ph.D. degrees in Electrical Engineering andComputer Sc iences from the U niversity of Califor-nia, Berkeley in 1989 and 1991, respectively. Dr.Pedram received the best paper award in the CADtrack at the IEEE Int’l Conf. on Computer Design(1989) and the National Science Foundation’s Re -

search Initiation Award (1992). He has served on the Technical ProgramCommittee of the ACMD EEE Design Automation Conference since 1992 andwas the General Chair of the 1994 International Workshop on Low PowerDesign. His research interests and contributions have been in the areas oflogic synthesis, physical design and CAD for low power.

Alvin Despain (S’58-M’65) received his B.S.(1960), M.S. (1962), and Ph.D. (1966) degrees inElectrical Engineering from The University of Utah.He has been an assistant research professor at The

University of Utah, an associate professor at UtahState University, a visiting associate professor atStanford University, a professor at The Universityof California at Berkeley and has been at USCsince 1989. Dr. Despain is a pioneer in the study ofhigh performance computer systems for symboliccalculations. His research group builds experimental

software and hardware systems including com pilers, custom VLSI processors,and multiprocessor systems. The goal is to determine principles for thedesign of high performance com puters systems. His research interests includecomputer architecture, multiprocessor and multicomputer systems, logicprogramming, and design automation. Dr. Despain is a member of IEEE,ACM, and AAAI.

power efficient technology decomposition and mapping under an extended power consumption model

Documents