ieee transactions on very large scale integration …574 ieee transactions on very large scale...

17
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004 573 Reducing Dynamic Power Consumption in Synchronous Sequential Digital Designs Using Retiming and Supply Voltage Scaling Noureddine Chabini and Wayne Wolf, Fellow, IEEE Abstract—The problem of minimizing dynamic power con- sumption by scaling down the supply voltage of computational elements off critical paths is widely addressed in the literature for the case of combinational designs. The problem is NP-hard in gen- eral. To address the problem in the case of synchronous sequential digital designs, one needs to move some registers while applying voltage scaling. Moving these registers shifts some computational elements from critical paths, and can be done by basic retiming. Integrating basic retiming and supply voltage scaling to address this NP-hard problem cannot in general be done in polynomial run time. In this paper, we propose to first apply a guided retiming and then to apply supply voltage scaling on the retimed design. We devise new polynomial time algorithms to realize this guided retiming, and the supply voltage scaling on the retimed design. Also, we show that the problem in the case of combinational de- signs is not NP-hard for some combinational circuits with certain structure, and give a polynomial time algorithm to optimally solve it. Methods to determine lower bounds on the optimal reduction of dynamic power consumption are also provided. Experimental results on known benchmarks have shown that the proposed approach can reduce dynamic power consumption by factors as high as 61% for single-phase designs with minimal clock period. Also, they have shown that it can solve optimally the problem, and produce converter-free designs with reduced dynamic power consumption. For large size circuits from ISCAS’89 benchmark suite, the proposed algorithms run in 15 s–1 h. Index Terms—CMOS, combinational digital designs, dynamic power, power consumption, retiming, sequential digital designs, supply voltage scaling, timing constraints. I. INTRODUCTION P OWER consumption became a hot potato issue in the design of today and next-generation digital systems. For portable systems, one needs to reduce the power consumption to prolong the battery life. Prolonging battery life is required for some critical portable systems such as wearable medical systems. Also, it became a product differentiator in the market. The power consumed in high speed systems transforms to heat which requires special cooling devices. For the latter systems, one then needs to reduce the cooling cost. Manuscript received March 14, 2003; revised August 13, 2003. N. Chabini is with the Department of Electrical and Computer Engineering, Royal Military College of Canada, Kingston, ON, K7K 7B4, Canada (e-mail: [email protected]). W. Wolf is with the Department of Electrical Engineering, Princeton Univer- sity, Princeton, NJ 08544 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2004.827569 In CMOS technology, the average power consumption in a digital design is defined as follows [7]: (1) where , , , and stand, respectively, for dynamic power, short-circuit power, leakage power, and static power. Dynamic power is arguably the dominant component in (1). It is a quadratic function of the supply voltage, denoted [7]: (2) where is the switching activity factor, is the loading capac- itance, and is the clock frequency. The short-circuit power accounts for a small portion of the total power consumed described by (1). It is approximately de- fined by the following [7]: (3) where is a constant and is the threshold voltage. Even though a transistor is “off,” it conducts some nonzero current. We denote by the total current conducted by transis- tors that are in “off” mode. Leakage power consumption can be approximately expressed as follows [7]: (4) The current depends on the threshold voltage, and may in- crease quickly when the latter becomes too small. The magnitude of the dissipation of depends on the logic family used [7]. For instance, one may need to use comple- mentary CMOS gates instead of pseudo-NMOS ones to reduce the consumption of . Basic retiming has been proposed in [3] as an optimization technique for synchronous sequential digital designs. This tech- nique changes the location of registers in the design in order to achieve one of the following goals: 1) minimizing the clock pe- riod; 2) minimizing the number of registers; or 3) minimizing the number of registers for a target clock period. Minimizing dynamic power for synchronous sequential dig- ital designs is addressed in the literature. Paper [6] presented heuristics to minimize the switching activity. The approach in [6] is based on the fact that registers have to be positioned on the output of computational elements of high switching activity, 1063-8210/04$20.00 © 2004 IEEE

Upload: others

Post on 10-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004 573

Reducing Dynamic Power Consumption inSynchronous Sequential Digital Designs

Using Retiming and SupplyVoltage Scaling

Noureddine Chabini and Wayne Wolf, Fellow, IEEE

Abstract—The problem of minimizing dynamic power con-sumption by scaling down the supply voltage of computationalelements off critical paths is widely addressed in the literature forthe case of combinational designs. The problem is NP-hard in gen-eral. To address the problem in the case of synchronous sequentialdigital designs, one needs to move some registers while applyingvoltage scaling. Moving these registers shifts some computationalelements from critical paths, and can be done by basic retiming.Integrating basic retiming and supply voltage scaling to addressthis NP-hard problem cannot in general be done in polynomialrun time. In this paper, we propose to first apply a guided retimingand then to apply supply voltage scaling on the retimed design.We devise new polynomial time algorithms to realize this guidedretiming, and the supply voltage scaling on the retimed design.Also, we show that the problem in the case of combinational de-signs is not NP-hard for some combinational circuits with certainstructure, and give a polynomial time algorithm to optimally solveit. Methods to determine lower bounds on the optimal reductionof dynamic power consumption are also provided. Experimentalresults on known benchmarks have shown that the proposedapproach can reduce dynamic power consumption by factors ashigh as 61% for single-phase designs with minimal clock period.Also, they have shown that it can solve optimally the problem,and produce converter-free designs with reduced dynamic powerconsumption. For large size circuits from ISCAS’89 benchmarksuite, the proposed algorithms run in 15 s–1 h.

Index Terms—CMOS, combinational digital designs, dynamicpower, power consumption, retiming, sequential digital designs,supply voltage scaling, timing constraints.

I. INTRODUCTION

POWER consumption became a hot potato issue in thedesign of today and next-generation digital systems. For

portable systems, one needs to reduce the power consumptionto prolong the battery life. Prolonging battery life is requiredfor some critical portable systems such as wearable medicalsystems. Also, it became a product differentiator in the market.The power consumed in high speed systems transforms to heatwhich requires special cooling devices. For the latter systems,one then needs to reduce the cooling cost.

Manuscript received March 14, 2003; revised August 13, 2003.N. Chabini is with the Department of Electrical and Computer Engineering,

Royal Military College of Canada, Kingston, ON, K7K 7B4, Canada (e-mail:[email protected]).

W. Wolf is with the Department of Electrical Engineering, Princeton Univer-sity, Princeton, NJ 08544 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2004.827569

In CMOS technology, the average power consumption in adigital design is defined as follows [7]:

(1)

where , , , and stand, respectively, fordynamic power, short-circuit power, leakage power, and staticpower.

Dynamic power is arguably the dominant component in (1).It is a quadratic function of the supply voltage, denoted [7]:

(2)

where is the switching activity factor, is the loading capac-itance, and is the clock frequency.

The short-circuit power accounts for a small portion of thetotal power consumed described by (1). It is approximately de-fined by the following [7]:

(3)

where is a constant and is the threshold voltage.Even though a transistor is “off,” it conducts some nonzero

current. We denote by the total current conducted by transis-tors that are in “off” mode. Leakage power consumption can beapproximately expressed as follows [7]:

(4)

The current depends on the threshold voltage, and may in-crease quickly when the latter becomes too small.

The magnitude of the dissipation of depends on thelogic family used [7]. For instance, one may need to use comple-mentary CMOS gates instead of pseudo-NMOS ones to reducethe consumption of .

Basic retiming has been proposed in [3] as an optimizationtechnique for synchronous sequential digital designs. This tech-nique changes the location of registers in the design in order toachieve one of the following goals: 1) minimizing the clock pe-riod; 2) minimizing the number of registers; or 3) minimizingthe number of registers for a target clock period.

Minimizing dynamic power for synchronous sequential dig-ital designs is addressed in the literature. Paper [6] presentedheuristics to minimize the switching activity. The approach in[6] is based on the fact that registers have to be positioned onthe output of computational elements of high switching activity,

1063-8210/04$20.00 © 2004 IEEE

574 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

since the output of a register switches only at the arrival ofthe clock signal compared to a computational element that mayswitch many times during the clock period.

In [5], fixed-phase retiming is proposed to reduce dynamicpower consumption. The edge-triggered circuit is first trans-formed to a two-phase level-clocked circuit, by replacing eachedge-triggered flip-flop by two latches. Using the resultinglevel-clocked circuit, the latches of one phase are kept fixed,while the latches belonging to the other phase are moved ontowires with high switching activity and loading capacitance.

Due to the quadratic term in (2), the dynamic power maysignificantly be reduced by scaling down the supply voltage ofsome computational elements. The effect of scaling down thesupply voltage of a computational element is an increase of theexecution delay of that element.

For a fixed threshold, and based on (3) and (4), scalingdown the supply voltage of some computational elementsmay also reduce the short-circuit power and leakage powerconsumptions. In this paper, we focus on minimizing dynamicpower consumption.

The problem of minimizing dynamic power consumption(MDPC) by scaling down the supply voltage of computationelements off critical paths has been widely addressed in thecase of combinational digital designs. For this kind of designs,interested reader is referred to [2], [8], [9], [12], and [13] for aliterature review on approaches to the problem. Digital systemswith multiple supply voltages are designed [10], [11], whichdemonstrates the feasibility of using multiple supply voltagesto reduce power consumption.

In this paper, we address the MDPC problem for synchronoussequential digital designs. Since critical paths are related tothe position of registers in the designs, our aim is not just toscale down the supply voltage of computation elements offcritical paths, but also to move registers from their positionsin order to maximize the number of computation elementsoff critical paths, leading to a minimum dynamic powerconsumption. Registers have to be moved from their positionsby basic retiming [3]. Instead of unifying basic retiming andsupply voltages scaling, we propose to apply a guided retimingfollowed by the application of voltage scaling on the retimeddesign. Unifying basic retiming and supply voltage scalingto minimize dynamic power consumption cannot be donein general by a polynomial time algorithm, since the supplyvoltage scaling problem is NP-hard in general even for the caseof combinational designs [2].

We provide new polynomial time algorithms to realize theguided retiming as well as the supply voltage scaling on theretimed design. The algorithms are based on linear programs.Experimental results have shown that this approach can reducedynamic power consumption by factors as high as 61%. Wecompare our approach with the approach in [1], which unifiesbasic retiming and supply voltage scaling to address the MDPCproblem using a mixed integer linear program (MILP). Basedon the obtained numerical results, our approach can reduce dy-namic power consumption by factors equal or very close to thereduction factors obtained by [1]. Also, it is based on polyno-mial time algorithms compared to [1] that uses an MILP whose

worst-case execution time is exponential for the case of theMDPC problem. Although, we were able to optimally solvethis MILP with run time ranging from 1 to 25 min using smallhigh-level designs, it was not possible to solve it for large sizedesigns such as circuits from ISCAS’89 benchmark suite [14]even after around 1 day of nonstop computation. However, ex-perimental results have shown that the run time for the proposedapproach using the later circuits is in 15 s-1 h.

We also show that the MDPC problem in the case of combi-national designs is not NP-hard for some combinational circuitswith certain structure, and give a polynomial time algorithm tooptimally solve it. Also, we provide methods to determine lowerbounds on the value of dynamic power consumption that can beobtained if the general form of this problem is solved optimally.

The rest of the paper is organized as follows. In Section II,we present how a synchronous sequential digital design ismodeled in this paper, give an introduction to the basic retimingtechnique, and define incidence, unimodular and eulerianmatrices. Section III shows, with an example, that one needsto use retiming in minimizing dynamic power consumption forsequential designs by supply voltage scaling. The formulationof the problem we address and an heuristic resolution approachare presented in Section IV. In Section V, we devise a methodto realize the guided retiming. We devise exact and heuristicmethods to realize the supply voltage scaling on the retimeddesign in Section VI. Although the supply voltage scalingproblem is NP-hard, we identify in Section VII some specialcases for which this problem can be optimally solved in poly-nomial run-time, and give methods to solve them. Then, wepresent how one can determine lower bounds on the minimalreduction of dynamic power consumption by supply voltagescaling. Extensions are outlined in Section VIII. Experimentalnumerical results are provided in Section IX to experimentallyshow the effectiveness of the approach in reducing dynamicpower consumption. Section X concludes the paper.

II. PRELIMINARIES

A. Design Representation

We represent a synchronous sequential digital design (as in[3]) by a directed cyclic graph , where isthe set of computational elements in the design, and is theset of edges that represent interconnections between vertices.Each vertex in has a nonnegative integer execution delay

. Each edge , from node to node , in isweighted with a register count , representing thenumber of registers on the wire between and .

Fig. 1 presents a directed cyclic graph model of a synchronoussequential circuit. The execution delay of each computationalelement of this circuit is specified as a label on the left of eachnode. Each symbol inside of each node is a label of that node.

Since a combinational digital design does not have registerson the wires, it can then be modeled as an acyclic graph

, where , , and are defined as above.

B. Basic Retiming

As introduced in Section I, basic retiming (or retiming forshort in this paper) [3] moves registers in the design in order

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 575

Fig. 1. Directed cyclic graph model.

to optimize some figures of merit. To move registers, one needsto assign a natural integer value to each computational element

. The physical meaning of the assigned values can be viewedas follows. Let be a natural integer value assigned to a com-putational element . If is positive then we have to move

registers from each output wire of and to put them oneach input wire of , assuming that we have at least regis-ters on each output wire of . If is negative, the previousprocess is reversed. When is equal to zero, no register hasto be moved across .

Let be a synchronous sequential digital de-sign. Mathematically, retiming is defined as a function

, which transforms to a functionally equivalent synchronoussequential digital design . The set repre-sents natural integers.

The weight of each edge in the retimed design isdefined as follows:

(5)

Since the weight of each edge in represents the number ofregisters on that edge, then we must have

(6)

Any retiming that satisfies (6) is called a valid retiming. Fromexpressions (5) and (6) one can deduce the following inequality:

(7)

We denote by a path from node to node in .Equation (5) implies that for every two nodes and in ,the change in the register count along any path afterretiming depends only on its two endpoints

(8)

where

(9)

We denote by the delay of a path fromnode to node . is the sum of the execution delaysof all the computational elements that belong to .

A 0-weight path is a path such that . Theminimal clock period of a synchronous sequential digital designis the longest 0-weight path. It is defined by the following:

(10)

Two matrices called and are used in most of retiming’salgorithms. They are defined as follows [3]:

(11)

(12)

The matrices and can be computed as explained in [3].One application of retiming is to minimize the clock period of

synchronous sequential digital designs. For instance, for Fig. 1,the clock period is , which is equal to

. However, we can obtain if we apply thefollowing retiming vector to the vector of nodes

in where the value located at the th positionin the retiming vector corresponds to the value assigned by tothe node located at the th position in the vector of nodes. Theretimed design is presented by Fig. 2.

For the purpose of this paper, we extract from [3] the fol-lowing two theorems (Theorems 1 and 2), which are also provedin [3].

Theorem 1: Let be a synchronous digitaldesign, and let be a positive real number. Then there is a re-timing of such that the clock period of the resulting retimeddesign is less than or equal to if and only if there existsan assignment of integer values to each node in suchthat the following conditions are satisfied: (1)

, , and (2) ,such that .

Theorem 2: Let be a synchronous digitaldesign, and let be a positive real number. Then the clock pe-riod of is less than or equal to if and only if there exists afunction such that , andsuch that , and .

C. Incidence, Unimodular and Eulerian Matrices

We denote by the incidence matrix of a directed graph. Therows of are indexed by the edges of the graph and its columnsare indexed by the vertices of the graph. The entries of aredefined as follows: is equal to 1 if is the tail of the edge, to if is the head of , and to 0 otherwise. For instance,

the incidence matrix of the graph in Fig. 1 is

where the rows of are indexed starting from the top left cornerby the arcs , , , , , and , respectively. Itscolumns are indexed starting from the top left corner by nodes1, 2, 3, 4, and 5, respectively.

A matrix is totally unimodular if every one of its square sub-matrices has a determinant equal to , 0, or 1 [18].

A submatrix of a matrix is said to be Eulerian [18], [19] if thesum of the entries of its rows and the sum of the entries of itscolumns are both even.

576 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

Fig. 2. Retimed design with minimal clock period � = 30.

Fig. 3. Other retimed design with � = 30.

III. IMPORTANCE OF UNIFYING RETIMING AND SUPPLY

VOLTAGE SCALING

We show with an example the importance of combiningretiming and supply voltage scaling to minimize the dynamicpower consumption for synchronous sequential digital designs.

Let us use the design in Fig. 1. Suppose that we havetwo supply voltages 5 and 3 V. Also, assume that if oneuses the supply voltage 5 V, the computational elements

will have, respectively, the execution delays, and will consume, respectively, the dy-

namic powers . And that if the supplyvoltage 3 V is used, the execution delays and the dynamicpower consumptions for these computational elements will be,respectively, and .

Our objective is to minimize the dynamic power consumptionfor designs operating with a target clock period. Assume thatwe want a single-phase clocked design with a minimal clockperiod. For the design in Fig. 1, which operates initially withsupply voltage 5 V, the minimal clock period is 30, whichis obtained as discussed in Section II-B. A possible retimed de-sign with clock period 30 is presented in Fig. 2. An otherpossible retimed design with the same clock period 30 ispresented in Fig. 3.

We have now two possible designs both operate with a clockperiod 30, and 5 V as a supply voltage. If one wants to re-duce dynamic power consumption by scaling down the supplyvoltage of computational elements off critical paths, then designin Fig. 3 is better than the one at Fig. 2. Indeed, let us computereduction factors of dynamic power consumption for those de-signs using the two supply voltages above and using 30 asthe length of each critical path. For the design in Fig. 2, we havethree critical paths , and . Consequently,no computational element can operate with supply voltage 3 V.Hence, we cannot reduce the dynamic power consumption forthat design. For the design in Fig. 3, only the paths and

are critical. Consequently, computational elements 1, 2,3, and 4 must be powered by 5 V, and computational element 5can operate with supply voltage 3 V. In this case, dynamic powerconsumption is reduced by 8.38%.

To minimize, by supply voltage scaling, the dynamic powerconsumption for synchronous sequential designs, it is then clear

that one needs to simultaneously apply retiming and supplyvoltage scaling.

IV. PROBLEM DEFINITION AND A HEURISTIC APPROACH

FOR ITS RESOLUTION

We are given a synchronous sequential digital designthat operates with a given supply voltage, which is

called here highest supply voltage. The design is assumed to op-erate at a clock period, , that is given by the designer or deter-mined by applying a retiming for clock period minimization on

. We are also given multiple supply voltages and every delayas well as dynamic power consumption of each computationalelement for each supply voltage. The number of supply voltagesand their values are assumed to be known and are not neces-sarily the same for all the computational elements (In this paper,we do not deal with the problem of determining the number ofsupply voltages to be used, and the value for each one of themfor each computational element). Our objective is to assign newsupply voltages, from the set of the given supply voltages, tothe computational elements of in order to minimize the totaldynamic power consumption. Each computational element willhave one and only one supply voltage. Since we want to keep

operating at the clock period , the computational elementson critical paths will be kept operating at their original supplyvoltages, while the supply voltages of those off critical paths arereplaced by low supply voltages.

As seen in Section III, minimizing dynamic power consump-tion in synchronous sequential digital designs by only changingthe supply voltages of computational element off critical pathsis not enough. We have to also apply retiming to shift some com-putational elements from the critical paths.

As a summary, our main problem in this paper is to develop anapproach that allows to apply, in some manner, a retiming and asupply voltage scaling, to have a functionally equivalent designthat operates at a target clock period and consumes the minimumdynamic power. We refer to this problem as the MDPC problem.

To optimally solve the MDPC problem, one needs to unify re-timing and supply voltage scaling. A MILP is proposed in [1] torealize this unification. However, since the problem is NP-hardin its general form, then this MILP cannot be solved, in general,by a polynomial time algorithm. Thus, heuristic approaches arerequired to determine approximate solutions to the problem. Inthe following, we propose a heuristic to determine this kind ofsolutions to the problem. We call this heuristic No-Way. To thebest of our knowledge, No other Way is available in the litera-ture at this moment to determine approximate solutions to theMDPC problem.

Before presenting the algorithm No-Way, we first introducethe basic idea on which it is based. This idea comes from the fol-lowing observation. Imagine that we have a design where thereis one and only one register on each wire that connects two com-putational elements. In this case, there is no need to computea retiming. The supply voltage of each computational elementcan be scaled down until the execution delay of that elementbecomes equal to the given clock period. For this design, theMDPC problem transforms to the problem of finding a mannerto apply voltage scaling on that design in order to minimize the

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 577

dynamic power consumption without changing the clock periodof the design.

When only some wires have a number of registers greaterthan or equal to 1 for each one of them, then the idea is to max-imize the number of wires having registers on them. Note thatthe resulting design must still have the target clock period. Max-imizing the number of this kind of wires can be done by applyinga retiming for a target clock period. Let us double check if thiswill help to reduce dynamic power consumption, and if yes, isit possible to sometimes obtain the same dynamic power reduc-tion as if one unifies retiming and supply voltage scaling?

Assume that we have to solve the MDPC problem for the de-sign in Fig. 1 for a target clock period 30, using the supplyvoltages presented in Section III. As explained in Section III,there are two possible retimed designs both with 30. Thenumber of wires having registers is 3 for Fig. 2 and is 4 forFig. 3. As presented in Section III, dynamic power consump-tion is reduced only for the design in Fig. 3, which is the designthat has the maximum number of arcs with registers. Hence, bymaximizing the number of arcs with registers first and then ap-plying supply voltage scaling on the resulting design, one cansometimes reduce dynamic power consumption. To answer thequestion of whether or not the proposed approach can allow toobtain the same dynamic power reduction as if one unifies re-timing and supply voltage scaling, we used the exact approachin [1] to solve this problem, and compared the obtained designwith the design in Fig. 3. We found that the two designs are iden-tical. Hence, the response to the question is yes.

Our proposed heuristic, called No-Way, to determine approx-imate solution to the MDPC problem is presented below. Thealgorithm No-Way produces a design with reduced dynamicpower consumption, assuming that we do not consider dynamicpower consumed by registers. Since to reduce the switchingactivities one might need to place registers on the output ofmany computational elements, No-Way could allow to alsoreduce the switching activities in the design (see the synthesisof [5] and [6] presented in Section I). The design produced byNo-Way is guaranteed to operate with the target clock period .

Algorithm No-WayInputs:In1- Synchronous sequential design

.In2- Target clock period .In3- Possible supply voltages for each

In3.1: supply voltages for each

In3.2: At supply voltage ,each has an execution delay

and consumes the dynamicpower .

Output: A functionally equivalent designwith clock period , and with a reduceddynamic power consumption.Begin1- Apply a retiming on to maximizethe number of arcs such that

, while keeping the clock periodequal to . Let be the resulting re-timed design.2- Remove all arcs such that

from . Let be the re-sulting combinational design.3- Solve the MDPC problem for the caseof combinational designs using , whilekeeping the clock period equal to . (may not be connected. This may help toreduce the run time of the method used tosolve the latter problem). Let be theresulting design.4- Put all the removed arcs in Step 2back to .5- Return .End

In Sections V and VI, we present methods to realize Steps1 and 3 of the algorithm No-Way. Note that, in Step 1 of thealgorithm, one may need to also consider register sharing [3] inorder to reduce the total number of registers.

V. MAXIMIZING THE NUMBER OF WIRES WITH REGISTERS IN

SYNCHRONOUS SEQUENTIAL DESIGNS

In this section, we provide a method to determine a re-timing that transforms a synchronous sequential design,

, operating with a clock period , to an otherfunctionally equivalent design, , operatingwith the same clock period and having a maximum numberof arcs that possess registers.

For every arc in , let be a 0–1 unknown vari-able defined as follows. takes the value 1 if

, and 0 otherwise. Based on the definition of ’s and onInequality (6), we then have

(13)

(14)

Using (5), we can transform (14) to

(15)

From Theorem 1, we can derive the following theorem.Theorem 3: Let be a synchronous dig-

ital design, and let be a positive real number. Then there isa retiming of such that the clock period of the resulting re-timed design is less than or equal to iff there exists anassignment of integer values to each node in , and 0/1values to each arc in , such that the followingconditions are satisfied: (1) , ,(2) , , and (3)

, such that .Proof: It can be easily derived from the proof of Theorem

1.The number of arcs that possess registers in is

(16)

578 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

The system of Inequalities in Theorem 3 may have many pos-sible solutions. To have a solution that maximizes , one canadd the following formula

(17)

to that system. We obtain the Integer Linear Program (ILP)(18)–(23). By solving this ILP, a retiming that allows to max-imize is then obtained.

(18)

Subject to

(19)

(20)

(21)

such that (22)

and are non-negative integers

(23)

Theorem 4: Assuming that the clock period of the input de-sign is less than or equal to , the ILP (18)–(23) has alwaysa solution, and can always be solved optimally in polynomialrun-time.

Proof: Since we assume that the input design has a clockperiod less than or equal to , then we do not need in this caseto move registers for clock period minimization. In this case, aretiming exists since the function that assigns the same valueto all computational elements is one possible retiming such thatthe clock period of the retimed design is less than or equal to

. Hence, by Theorem 3, the system (19)–(23) has always asolution which implies that the ILP has always a solution.

The ILP (18)–(23) can be solved in polynomial run-time. In-deed, the right hand sides of inequalities (19)–(22) are integers.Linear programming theory [15] says that if the constraint ma-trix of an ILP is totally unimodular and if the right hand sidesof inequalities [as (19)–(22) in our case] are integers then theILP and its relaxation obtained by ignoring the constraints inte-gers [ignoring (23) in our case] have the same optimal solution.This relaxed ILP is a linear program, and hence it can be solvedoptimally in polynomial run-time by using for instance methods[16], [17]. Now, to complete the proof, we have to prove that theconstraint matrix of the ILP (18)–(23) is totally unimodular.

Nonnegativity constraint for unknown variables isalready represented by Inequality (19). By first adding to theILP (18)–(23) a nonnegativity constraint for unknown variables

[derived from (23)], the constraint matrix of the ILP is

where the submatrices of are as follows. Matrices , , ,and are the identity matrix. Each row of the identity matrixcontains exactly a 1; the other entries of that row are equal to0. Matrices , , , and are the zero matrix. The zeromatrix is a matrix where each one of its entries is equal to 0. Thematrix is the incidence matrix of the directed cyclic graph,

, modeling the input design. The matrix is the incidencematrix of the graph derived from by removing all its arcsand adding new arcs to . The added arcs express the such-thatclose in (22). Each row of and contains exactly a 1 anda 1; the other entries of that row are equal to 0.

To prove that is totally unimodular, we first state the fol-lowing theorem which is also proved in [18].

Theorem 5: A matrix is totally unimodular if and only iffor every square eulerian submatrix of , we have that thesum of the entries of divides by 4.

Let be an arbitrary square eulerian submatrix of . Thesum of the entries of each row of is even; all columns ofhave this propriety too. If is a zero matrix, then the sum ofits entries divides by 4. In this case, is totally unimodular byTheorem 5.

We focus now on the case where is a nonzero matrix. Due tothe eulerian property, any nonzero entry of comes only from

or . Recall that each row of and contains exactlya 1 and a 1; the other entries of that row are equal to 0. Sincefor each row of , the sum of all the entries of that row must beeven, then this row contains exactly 1 and 1, which they willcome from a row of or ; the rest of the entries are equalto 0. In this case, the sum of all the entries of each row of isequal to zero. Hence, the sum of all the entries of is equal to0 which divides by 4. Consequently, by Theorem 5, is alsototally unimodular in this case.

For an arbitrary square eulerian submatrix of , we haveproved that the sum of all the entries of divides by 4. ByTheorem 5, we then have that the constraint matrix of the ILP(18)–(23) is totally unimodular, and hence this ILP is solvablein polynomial run-time.

VI. AN APPROACH FOR SOLVING THE MDPC PROBLEM FOR

COMBINATIONAL DESIGNS

The latency, denoted here by , of a combinational designcan be fixed by the designer or assigned to a

value equal to its lower bound. This lower bound denoted hereby is defined as

(24)

The problem we address in this section is defined as follows.We are given a combinational design that operates witha given supply voltage, which is called here highest supplyvoltage. We are also given multiple supply voltages and everydelay as well as dynamic power consumption of each computa-tional element for each supply voltage. The number of supplyvoltages and their values are assumed to be known and are notnecessarily the same for all the computational elements (Again,we do not address in this paper the problem of determining thenumber of supply voltages to be used, and the value for eachone of them for each computational element). Our objective is

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 579

then to assign new supply voltages, from the set of the givensupply voltages, to the computational elements of in orderto minimize the total dynamic power consumption withoutchanging the target latency of the design. Each computationalelement will have one and only one supply voltage. Since wewant that still has a given latency , then the computationalelements on critical paths will be kept operating at their originalsupply voltages, while the supply voltages of those off criticalpaths are replaced by low supply voltages.

Before presenting an approach for solving the problem, wefirst provide additional notation and definitions. Based on thesupply voltages used, we assume that we have differentimplementations for each computational element . If supplyvoltage , where , is used, then the computationalelement has an execution delay and consumesthe dynamic power . Assume that supply voltages aresorted from the highest to the smallest. For each , andfor each such that , we denote by a binaryvariable, which is equal to 1 if supply voltage is used, and to0 otherwise.

We state the following theorem that we will use to develop aMILP to optimally solve the problem.

Theorem 6: Let be a combinational design,and let be a positive real number. Then the latency of is lessthan or equal to iff there exists an assignment of nonnegativereal values to each node in such that the followingconditions are satisfied: (1) , , (2)

, , and (3) , .Proof: It is the same as proof of Theorem 2, since we got

Theorem 2 if we replace with .Based on the definition of the problem, the execution delay

of each computational element can be expressed as follows:

(25)

(26)

From Theorem 6 and (25) and (26), one can then derive thefollowing theorem.

Theorem 7: Let be a combinational design,and let be a positive real number. Then the latency of isless than or equal to iff there exists an assignment of non-negative real values to each node in , and 0/1 values toeach binary variable such that the following conditions aresatisfied

Proof: It can be easily derived from Theorem 6, using (25)and (26).

We now investigate the determination of an MILP to opti-mally solve the MDPC problem in the case of combinationaldesign. The objective function to minimize in the MILP is thetotal dynamic power consumption, which is equivalent to

(27)

The system of equalities and inequalities in Theorem 7 mayhave many possible solutions. To have a solution that leads toa minimum dynamic power consumption, one may add Expres-sion (27) to that system as well as 0–1 constraints on variables

’s. We obtain the MILP (28)–(33). By solving it we obtainthe optimal solution to the MDPC problem.

(28)

(29)

(30)

(31)

(32)

and

(33)

Theorem 8: Assuming that the latency of the input design isless than or equal to when highest supply voltages are used,the MILP (28)–(33) has always a solution, and the designobtained from the optimal solution to this MILP has a latencyless than or equal to .

Proof: The MILP has always a solution. Indeed, the inputdesign is assumed to have a latency less than or equal to whenhighest supply voltages are used. From Theorem 7, we knowthat the system (29)–(33) has a solution, which implies that theMILP has always a solution.

The design obtained from the optimal solution to this MILPhas a latency less than or equal to . Indeed, this solution satis-fies (29)–(32), hence, Theorem 7 applies which implies that thelatency of the design is less than or equal to .

The size of the combinational designs in Step 3 of the al-gorithm No-Way may be very small. Consequently, the MILP(28)–(33) can in this case be used to solve optimally the problemin reasonable run time.

Note that this MILP cannot be always solved in polynomialrun time, since the problem is NP-hard in general. Hence, forgeneral and very large designs, heuristics to determine approx-imate solutions to the problem are required. We provide in thissection a new heuristic to determine this kind of solutions to the

580 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

problem. We will return in Section VII on some special cases ofthis problem that can be solved in polynomial run time.

Since the execution delay of any implementation of the com-putational element is between and , we then have

(34)

(35)

Linear programs are solvable in polynomial time. To deter-mine approximate solutions to the MDPC problem, a linear pro-gram (LP) can be derived from the MILP (28)–(33). This LPcan be obtained by dropping the constraints (33), and addingthe constraints (34) and (35) to the constraints of the resultinglinear program. The dynamic power consumption obtained inthe optimal solution of this LP is a lower bound on the value of(28).

Solving this LP may lead to nonrealizable solutions sincethere may exist at least one computational element such that

found in these solutions does not correspond to any execu-tion delay of the available implementations of . But a nonre-alizable solution from a solution to the MDPC problem can betransformed to a realizable one. Indeed, once the LP is solved,the supply voltage under which computational element mustoperate is , where is nonnegative integer such that

. Note that, if is on a critical path, then .

VII. SPECIAL CASES OF THE MDPC PROBLEM

In this section, we address the MDPC problem for combina-tional designs. Although this problem is NP-hard, there is somecases where it can be solved in polynomial run-time. In this sec-tion, we will provide one such case and devise a linear programto optimally solve the corresponding MDPC problem. We willalso provide how lower bounds on the optimal solution of thisproblem can be obtained.

Using the notations and definitions given in Section VI, a spe-cial case of this problem can be defined when: (1) ,

(i.e., we have two supply voltages for each compu-tational element; recall that we have to use only one of them),(2) , , and (3) both the latency as wellas all the possible execution delays for each are integers. Inthis case, the MDPC problem can be optimally solved in poly-nomial time, by using for instance the linear program that wewill develop in the rest of this section.

We denote by the reduction of dynamic power if oneswitches the supply voltage of computational element fromthe high supply voltage to low supply voltage. is mathemat-ically defined as follows:

(36)

is nonnegative, since the dynamic power consumed at thehigh supply voltage is always greater than or equal to the one atthe low supply voltage.

For each computational element , let be a 0–1 variabledefined as follows. is equal to 1 if the low supply voltage isused to drive , and equal to 0 otherwise.

The execution delay of each computational element can thenbe defined as follows:

(37)

(38)

As we did for deriving Theorem 7, by using (37) and (38),one can derive from Theorem 6 the following theorem.

Theorem 9: Let be a combinational design,and let be a positive real number. Then the latency of isless than or equal to if there exists an assignment of nonneg-ative real values to each node in , and 0/1 values toeach binary variable such that the following conditions aresatisfied:

Proof: It can be easily derived from Theorem 6.For this special case of the MDPC problem, instead of mini-

mizing the objective function

(39)

as we did for the MILP (28)–(33), one can then maximize thereduction in dynamic power consumption described by (36),which is equivalent to (40)

(40)

The system of inequalities in Theorem 9 may have many pos-sible solutions. To have a solution that leads to a minimum dy-namic power consumption, one can add Expression (40) to thatsystem as well as 0–1 constraints on variables ’s. We obtainthe MILP (41)–(46). By solving it we obtain the optimal solu-tion to this special case of the MDPC problem

(41)

(42)

(43)

(44)

(45)

(46)

Theorem 10: Assuming that the latency of the input designis less than or equal to when highest supply voltages are used,

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 581

(1) the MILP (41)–(46) has always a solution, (2) the designobtained from the optimal solution to this MILP has a latencyless than or equal to , and (3) this MILP can be optimallysolved in polynomial run-time.

Proof: Parts (1) and (2) of the theorem can be proved aswe did for Theorem 8.

To prove part (3), we proceed as we did for Theorem 4. Towrite the constraints matrix of the MILP, we have to first trans-form the expression (46) to the equivalent expressions

(47)

(48)

(49)

After transforming (46) to (47)–(49) and arranging the orderof the constraints (for the proof’s purposes), the MILP (41)–(46)transforms to the equivalent MILP (50)–(56)

(50)

(51)

(52)

(53)

(54)

(55)

(56)

The MILPs (41)–(46) and (50)–(56) have the same optimal so-lution, since we have only replaced (46) by (47)–(49). In the restof this proof, we will then use the MILP (50)–(56) instead of theMILP (41)–(46). Without excluding the obtainment of the op-timal solution of the MILP (50)–(56), we assume for the purposeof the theorem’s proof that we will solve it on an extended graph

obtained from the graph modeling the input design. isderived from by adding dummy nodes that we connect to oneof the nodes of by dummy arcs. Each one of these dummynodes must be a source node of one and only one arc. Let bethe number of nodes in that are the source of more than onearc from . Dummy nodes are added to until the number ofnodes in that are the source node of one and only one arcfrom becomes equal to . To avoid excluding the obtainmentof the optimal solution of the MILP, any dummy node musthave (i.e., execution delay equal to 0 at the high supplyvoltage) and (this will force the objective function of theMILP to take the same value whether or is used).

The right hand sides of inequalities (51)–(55) are integers.Consequently, if the constraint matrix of the MILP (50)–(56)is totally unimodular then its optimal solution can be determinedin polynomial run-time by first removing constraint (56) andthen solving the resulting linear program obtained from thatMILP. To complete the proof, we have to prove that is totallyunimodular.

The constraint matrix of the MILP (50)–(56) is

where the submatrices of are as follows. Matrices , ,, , and are the identity matrix. Each row of the identity

matrix contains exactly a 1; the other entries of that row areequal to 0. Matrices , , and are the zero matrix. Thezero matrix is a matrix where each one of its entries is equalto 0. The matrix is the incidence matrix of . Each rowof contains exactly a 1 and a 1; the other entries of thatrow are equal to 0. Recall that rows of the incidence matrix areindexed by the edges of the graph. The matrix looks like theincidence matrix but each one of its entries located at row andcolumn is equal to 1 if is the tail of and to 0 otherwise. Anyrow of contains one and only one entry equal to 1; all theother entries of that row are equal to 0. There exists columns of

(columns induced by dummy nodes in are among them)such that each one of them contains one and only one entryequal to 1; all the other entries of the column are equal to 0.Let be the set of those columns. But, some columns ofcould contain more than one entry equal to 1. Let be the setof those columns. Since rows of are indexed by arcs, thenone can always arrange these arcs such that columns fromare never adjacent. Hence, matrix can always be arranged insuch a way that any one of its columns that contains more thanone entry equal to 1 is surrounded by two columns where eachone of them contains one and only one entry equal to 1, and allthe other entries of the column are equal to 0.

We use Theorem 5 to prove that is totally unimodular. Letbe an arbitrary square eulerian submatrix of . The sum of

the entries of each row of is even; all columns of have thispropriety too. If is a zero matrix, then the sum of its entriesdivides by 4. In this case, is totally unimodular by Theorem5. We focus in the rest of the proof on the case where is anonzero matrix.

Due to the eulerian property, no nonzero entry of comesfrom matrices , and (because for each row orcolumn of , the sum of the entries of that row or column mustbe even). Hence, all the nonzero entries of come from .Since each row of contains exactly a 1 and a , and allthe other entries of that row are equal to 0, then each row ofcontains exactly a 1 and a ; all the other entries of that roware equal to 0. In this case, the sum of all the entries of each rowof is 0, which implies that the sum of all the entries of is 0.This sum divides by 4 which implies that is totally unimod-ular by Theorem 5.

For an arbitrary square eulerian submatrix of , we haveproved that the sum of all the entries of divides by 4. ByTheorem 5, we then have that the constraint matrix of the ILP(18)–(23) is totally unimodular, and hence the MILP is solvablein polynomial run-time.

582 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

We now examine the MDPC problem in the case of generaldual supply voltages. Let

(57)

In the case of

such that (58)

the MDPC problem may not be optimally solvable in poly-nomial run time, since it is NP-hard. In this case, we can de-rive lower bounds on the minimal value of (28) by: 1) adaptingthe MILP (41)–(46) as it will be presented in next paragraph, or2) transforming the design to a design for which ,

, and then applying MILP (41)–(46) on the transformed de-sign. Note that such lower bound could be used to prune the so-lution space when exact algorithms based on branch-and-boundare used to solve optimally this NP-hard problem, or to evaluatethe quality of heuristics used in determining approximate solu-tions for this problem.

When (58) is true, then MILP (41)–(46) transforms to theMILP (59)–(64). A lower bound on the minimal value of (28)can then be determined in polynomial run-time by solving thelinear program (59)–(63)

(59)

(60)

(61)

(62)

(63)

(64)

VIII. EXTENSIONS

The proposed approach can be extended to further reduce thepower consumption for sequential digital designs. In the presentversion of the proposed approach, the number of registers some-times decreases by applying the suggested kind of retiming. Ofcourse, it may increase. However, an increase in the number ofregisters does not mean that the total power consumed by thedesign will always be increased. Indeed, since registers switchonly at the arrival of the clock compared to computational el-ements that may switch many times during the clock cycle, anincrease in the number of registers after retiming can sometimeshelp in reducing dynamic power consumption due to switchingactivities (see the synthesis of [5] and [6] presented in Section I).

There is more than one strategy to control the powerconsumption due to registers in the retimed design. The firststrategy is to produce a retimed design without increasing thenumber of registers. We first present how that strategy can beexploited with the proposed guided retiming before presentingother strategies.

By some mathematical manipulations as presented in thepaper on basic retiming [3], the number of registers in theretimed design is

(65)

where is the number of predecessors of computational ele-ment minus the number of the successors of . Note that ’scan take negative or positive integer values. From expression(65), it is clear that if

(66)

then any valid retiming will not increase the number of registersin the retimed design. Consequently, by adding inequality (66)to the constraints of the ILP (18)–(23), our proposed guided re-timing can maximize the number of arcs that possess registerswithout increasing the original number of registers. Letbe the ILP (18)–(23) with the added constraint (66). As the ILP(18)–(23), one property of the is that it can be solvedin polynomial run-time. Indeed, when the values of ’s are inthe set one can prove this property as we did forTheorem 4. When some values of ’s are not in , thento do the same proof as the one of Theorem 4, we need to havethe constraint matrix of the to be totally unimodular,which implies that all the entries of must be in .To have all the entries of in , the graph modelingthe design can be extended by dummy nodes in order to alwayshave ’s with values in . Without loss of generality,Fig. 4 illustrates how dummy nodes can be added to the graphmodeling the design. The symbol on each arc in Fig. 4 meansthat the weight of that arc is equal to its weight in the originalgraph (i.e., graph before adding dummy nodes). Once dummynodes are added, arcs are created to connect them. The delay ofeach dummy node is 0, which ensures that timings will not beaffected by the added dummy nodes. The weigh of each addedarc that connects to each computational element is denotedby . The value of must be at least the numberof registers that any valid retiming would move across whenthose arcs are not added. An algorithm to compute upper boundson the number of registers that any valid retiming can moveacross is provided in [1]. That algorithm can be used to com-pute tight values for ’s.

If the designer accepts that the number of registers can be in-creased in the retimed design, then as a second strategy to con-trol the power consumption due to these registers is by makingthe guided retiming more smart. The guided retiming can bemade more smart by for instance assigning a cost to eacharc. Each could for instance be an upper bound on theswitching activity related to the wire . Using upper boundson switching activities as the costs, one needs to put registerson the wires with high costs in order to reduce power consump-tion due to switching activities. If those costs are assigned to thearcs, then instead of maximizing (18) in the ILP (18)–(23), oneneeds to maximize

(67)

subject to constraints (19)–(23).

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 583

Fig. 4. Illustration of how dummy nodes are added.

A third strategy to control the power consumption in the re-timed design due to registers concerns the way to apply supplyvoltage scaling on the retimed design. One may need to givepreference to first scale down the supply voltage of computa-tional elements that have registers on their outputs; this mightallow to also reduce dynamic power due to these registers if weassume that any computational element and its fanout registershave the same supply voltage.

IX. EXPERIMENTAL RESULTS

The objective in this section is to assess the quality, in termsof dynamic power consumption, of the designs produced bythe proposed approach. Circuits of small size are used (1) topresent the impact of excluding retiming from the process ofreducing dynamic power consumption in sequential designsby supply voltage scaling, (2) to compare the quality of thedesigns produced by the proposed approach versus. by theexact method proposed in [1], and (3) to test whether or not theproposed approach can reduce dynamic power consumptionwithout increasing the number of registers. These circuits are atthe system level, where computational elements are for instanceIP Blocks such as adders and multipliers. To also assess the ex-ecution times required by algorithms of the proposed approach,we use large size circuits from ISCAS’89 benchmark suite[14]. As supply voltages, we use the first supply voltagesfrom the set V V V V V V V V , where

; recall that, the problem of determining theright number of supply voltages and their values for eachcomputational element is not the problem we address in thispaper. The difference between two successive supply voltagesis fixed to 0.5 V, since we also want to test the effectivenessof the proposed approach in producing designs without levelconverters based on [9]. The approach in [9] assumes that alevel-converter between computational elements and canbe omitted if Inequality (68) is satisfied

(68)

where is a given value, which is fixed arbitrarily here to0.5 V. We assume that supply voltages are greater than ,where is the threshold voltage. We use 0.7 V, whichis a typical value used in the literature. To determine the delay

and the power consumed , for a given computationalelement and supply voltage , we proceed as follows. First,we use the expression

described in [8], where is assumed to be known. Second,’s are determined assuming that the fanout of the compu-

tational element has the same loading capacitance. From (2),we then have

For each circuit, the clock period is fixed to the minimalvalue determined by applying a retiming for minimal clock pe-riod minimization on the circuit operating at the highest supplyvoltages (i.e., 5 V). More reductions of dynamic power con-sumption should be obtained if the clock period is fixed to avalue greater than this used value.

We implement the algorithm No-Way. For Steps 1 and 3 ofthis algorithm, we use respectively the ILP (18)–(23), and theLP derived from the MILP (28)–(33) as discussed in Section VI.Once the LP is solved, the obtained solution is then convertedto a realizable design as discussed in Section VI (i.e., if forexample we found a supply voltage of value say 4.75 V, wechanged it to 5 V instead of 4.5 V, since a slowest componentcan always be replaced by a fastest one while satisfying timingconstraints). The dynamic power consumption of the design isthen computed.

All experiments were done using an UltraSparc 10 with 1 GBRAM. We use [4] to solve the ILPs and LPs. Recall that, basedon Theorem 4, these ILPs are solvable in polynomial time. Foreach circuit, the ILPs and LPs are automatically generated by amodule we coded in C++.

Obtained numerical results are presented by Figs. 5–17.Figs. 5–8 summarize relative dynamic power savings using:(1) supply voltage scaling without retiming, (2) the proposedapproach using the ILP (18)–(23) for computing the guidedretiming, (3) the exact approach in [1], and (4) the proposedapproach using the ILP (18)–(23) and constraint (66) forcomputing the guided retiming, respectively, using the smallcircuits as described above. For numerical results on these fourplots, (68) is not used in the constraints of the LP. Also, resultswhen 8 supply voltages are used are omitted since they are thesame as when 7 supply voltages are used. As it can be observedfrom Fig. 5, relative dynamic power saving factors are zeroexcept for two circuits. For these two circuits, relative dynamicpower savings are always less than or equal to those obtainedby the proposed approach or by the exact approach in [1].Consequently, retiming is useful for reducing dynamic powerconsumption in sequential designs by supply voltage scaling.As one can observe from Fig. 6, relative dynamic power savingfactors are from 2.78% to 36.22%, and are sometimes the sameas those obtained in Fig. 7 by the exact approach at [1]. Thisindicates that the proposed approach is able to sometimes solveoptimally the MDPC problem. From Fig. 8, we deduce thatthe proposed approach was be able to produce designs with

584 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

Fig. 5. Relative dynamic power saving in case of supply voltage scaling without retiming.

Fig. 6. Heuristic approach [In this paper: The guided retiming is computed using ILP (18)–(23)]: without level-converter constrains.

reduced dynamic power consumption without increasing thenumber of registers after applying the guided retiming.

Figs. 9 and 10 report relative dynamic power savings usingthe proposed approach, and the exact approach in [1], respec-tively, when (68) is incorporated into the constraints of the LPfor the proposed approach, as well as into the constraints of theMILP for the exact approach. We use the same circuits as thoseused to report numerical results in Figs. 6 and 7. By comparingFigs. 9 and 10, one can deduce that the performance of the pro-posed approach is still close to the exact one in [1].

Since the MDPC problem is NP-hard in general, the exactapproach (i.e., MILP) in [1] cannot be used to produce designswith reduced dynamic power consumption in reasonable runtimefor circuits of large size such as circuits from ISCAS’89benchmark suite. The runtime for solving this MILP is less than25 min [1] when small circuits are used to get numerical results

summarized by Figs. 7 and 10. However, it was impossibleto solve it when circuits from ISCAS’89 benchmark suite areused, even after 1 day of nonstop computation. In contrastto this MILP, the proposed approach has been exercised oncircuits from this benchmark suite, and proved that it is ableto produce designs with reduced dynamic power consumptionin reasonable runtime. Indeed, when (68) is not used in theconstraints of the LP, relative dynamic power savings as highas 61% have been obtained (see Fig. 11) in a run time around1 h (we will return to discuss the run time in the sequel).When this equation is incorporated in constraints of the LP,relative dynamic power savings as high as 40% are obtained(see Fig. 15).

We counted the level-converters required for the produceddesigns once the supply voltages are assigned. When (68) is notused in the constraints of the LP, the number of required level-

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 585

Fig. 7. Optimal approach [1]: without level-converter constrains.

Fig. 8. Heuristic approach [In this paper: The guided retiming is computed using ILP (18)–(23) and adding constraint (66)]: 1. Without level-constraints, and2. Without increasing the number of registers.

converters varies from 0 to 182 (see Fig. 14). But, when thisequation is incorporated in constraints of the LP, this numberbecame 0 except for circuits S1238 and S1488 (see Fig. 17).For these two circuits, 2 level-converters were required in theproduced design when 7 and 8 supply voltages are used to savean additional very small portion of dynamic power comparedto the case when 2–6 supply voltages are used. This indicatesthat the proposed approach can produce designs with reduceddynamic power consumption and also without level-converters.

Dynamic power consumed by level-converters may beignored if for instance computational elements are addersand/or multipliers as done in [12]. When computationalelements consume power close to the power consumed by alevel-converter, then the effectiveness of the proposed approachwithout using (68) will depend on the ratio of the total powersaved and the total power consumed by level-converters. Cur-rently, we do not have benchmarks to experimentally measurethis ratio.

586 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

Fig. 9. Heuristic approach [in this paper: The guided retiming is computing using ILP (18)–(23)]: with level-converter constraints.

Fig. 10. Optimal approach [1]: with level-converter constraints.

Fig. 11. Heuristic approach [In this paper: The guided retiming is computed using ILP (18)–(23)]: without level-converter constrains.

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 587

Fig. 12. Run time for solving the ILP (18)–(23): without level-converter constrains.

Fig. 13. Run time for solving the LP for supply voltage scaling: without level-converter constrains.

Fig. 14. Number of level converters required in the design: without level-converter constraints.

Fig. 15. Heuristic approach [this paper: The guided retiming computed using ILP (18)–(23)]: with level-converter constraints.

The run time required by the algorithm No-Way is mainlydetermined by the run time for Step 1 and the run time forStep 3. Consequently, we measured only the later run times.Figs. 13 and 16 summarize the run times for solving, respec-tively, the ILP, the LP without (68), and the LP with (68) intoits constraints. Recall that large size circuits from ISCAS’89benchmark suite are used for this end. As expected, experi-

mental results have shown that the run time of the ILP is thedominant component in the total run time of the algorithm.

X. CONCLUSIONS

In this paper, we have addressed the problem of mini-mizing dynamic power consumption in synchronous sequential

588 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 6, JUNE 2004

Fig. 16. Run time for solving the LP for supply voltage scaling: with level-converter constraints.

Fig. 17. Number of level converters required in the design: with level-converter constraints.

designs under timing constraints using retiming and supplyvoltage scaling. Minimizing dynamic power consumption insynchronous sequential digital designs by only changing thesupply voltages of computational element off critical paths isnot enough. We have to also apply retiming to shift some com-putational elements from the critical paths. Unifying retimingand supply voltage scaling to address that problem cannotalways be done in polynomial run time since the problem isNP-hard in general even for the case of combinational designs.

We have proposed a heuristic approach to determine approxi-mate solutions to the problem. Instead of unifying basic retimingand supply voltage scaling, we have proposed to first determinea guided retiming and then applying supply voltage scaling onthe retimed design. We have proposed algorithms of polynomialtime complexity to realize this guided retiming as well as thesupply voltage scaling. We have also shown that the problem inthe case of combinational designs is not NP-hard for some com-binational circuits with certain structure, and we have given apolynomial time algorithm to optimally solve it. Also, we pro-vided algorithms of polynomial time complexity to determinelower bounds on the optimal reduction of dynamic power con-sumption in combinational designs that one can obtain if theproblem is optimally solved.

The proposed approach proved effective in producing designswith reduced dynamic power consumption, and also withoutlevel converters. Although no optimization is done at this mo-ment to speed up the run time of the current implementation ofthe approach, experimental results have shown that around 1 hwas enough for the proposed algorithms to determine approxi-mate solutions to the problem for large designs such as circuitsfrom ISCAS’89 benchmark suite.

ACKNOWLEDGMENT

The authors would like to thank the three anonymous re-viewers for their valuable comments from which this paper hasbenefited.

REFERENCES

[1] N. Chabini, I. Chabini, E.-M. Aboulhamid, and Y. Savaria, “Unifica-tion of basic retiming and supply voltage scaling to minimize dynamicpower consumption for synchronous digital designs,” in Proc. GreatLakes Symp. VLSI, Washington, DC, Apr. 2003.

[2] J.-M. Chang and M. Pedram, “Energy minimization using multiplesupply voltages,” IEEE Trans. VLSI Syst., vol. 5, pp. 1–8, 1997.

[3] C. E. Leiserson and J. B. Saxe, “Retiming synchronous circuitry,” Algo-rithmica, pp. 5–35, Jan. 1991.

[4] The LP_Solve Tool. [Online]. Available: ftp://ftp.ics.ele.tue.nl/pub/lp_solve/

[5] K. N. Lalgudi and M. Papaefthymiou, “Fixed-phase retiming for lowpower,” in Proc. Int. Symp Low-Power Electronics and Design, 1996,pp. 259–264.

[6] J. Monteiro, S. Devadas, and A. Ghosh, “Retiming sequential circuitsfor low power,” in Proc. IEEE/ACM Int. Conf. Computer-Aided Design,1993, pp. 398–402.

[7] A. Raghunathan, N. K. Jha, and S. Dey, High-Level Power Analysis andOptimization. Norwell, MA: Kluwer, 1997.

[8] K. Usami and M. Horowitz, “Clustered voltage scaling technique forlow-power design,” in Proc. Int. Workshop Low-Power Design, 1995,pp. 3–8.

[9] Y.-J. Yeh, S.-Y. Kuo, and J.-Y. Jou, “Converter-free multiple-voltagescaling techniques for low-power CMOS digital design,” IEEE Trans.Computer-Aided Design, vol. 20, pp. 172–176, Jan. 2001.

[10] I. Mutsunorial et al., “A low-power design method using multiple supplyvoltages,” in Proc. Int. Symp. Low-Power Electronics and Design, Aug.1997, pp. 36–41.

[11] T. Pering, T. D. Burd, and R. W. Brodersen, “Voltage scheduling in theIpARM microprocessor system,” in Int. Symp. Low-Power Electronicsand Design, 2000, pp. 96–101.

CHABINI AND WOLF: REDUCING DYNAMIC POWER CONSUMPTION IN SYNCHRONOUS SEQUENTIAL DIGITAL DESIGNS 589

[12] S. Raje and M. Sarrafzadeh, “Scheduling with multiple voltages,” J.VLSI Integration, vol. 23, pp. 37–59, 1997.

[13] K. Roy, W. Liqiong, and C. Zhanping, “Multiple-V multiple-VCMOS (MVCMOS) for low power applications,” in Proc. IEEE Int.Symp. Circuits and Systems, vol. 1, 1999, pp. 366–70.

[14] ISCAS’89 Benchmark Suite [Online]. Available:http://www.cbl.ncsu.edu/benchmarks

[15] A. Schrijver, Theory of Linear and Integer Programming. New York:Wiley, 1986.

[16] L.-G. Khachian, “A polynomial algorithm in linear programming,” So-viet Math Doklady, vol. 20, pp. 191–194, 1979.

[17] N. Karmakar, “A new polynomial-time algorithm for linear program-ming,” Combinatorica, vol. 4, pp. 373–395, 1984.

[18] P. Camion, “Characterization of totally unimodular matrices,” Proc.Amer. Math. Soc., vol. 16, pp. 1068–1073, 1965.

[19] C. Berge, Théorie des Graphes et ses Applications. Paris, France:Dunod, 1958.

Noureddine Chabini received the B.Sc. degreein computer science and automatic systems fromCaddi Ayyad University, Marrakech, Morocco, in1995, and the M.Sc. and Ph.D. degrees, in computerscience, from the University of Montreal, Montreal,QC, Canada, in 1998 and 2001, respectively.

He is currently an Assistant Professor in the De-partment of Electrical and Computer Engineering atthe Royal Military College of Canada, Kingston, ON,Canada. He also holds an invited research-collabo-rator position at Princeton University, Princeton, NJ.

Before joining the Royal Military College of Canada, he was an Invited Re-searcher at Princeton University. Before joining Princeton University in 2002, hewas an invited researcher at the École Polytechnique de Montréal, Canada. Hisscientific research interests include: design for low-power consumption, designfor high-performance, embedded systems design, system-on-chip design, hard-ware/software codesign, synthesis, compiler optimizations, CAD for digital sys-tems design, and application of combinatorial optimization techniques to digitalsystems design.

Wayne Wolf (F’98) received the B.S., M.S.,and Ph.D. degrees in electrical engineering fromStanford University, Stanford, CA, in 1980, 1981,and 1984, respectively.

Currently, he is Professor of Electrical Engi-neering and Associated Faculty in the Department ofComputer Science, Princeton University, Princeton,NJ. From 1984 to 1989, he was with AT&T BellLaboratories, Murray Hill, NJ. His research interestsinclude embedded computing, multimedia systems,VLSI and computer-aided design. He is the author of

Computers as Components: Principles of Embedded Computer System Design,(San Mateo, CA: Morgan Kaufmann, 2000) and Modern VLSI Design 3rdedition, (Englewood Cliffs, NJ: Prentice-Hall, 2002).

Dr. Wolf is a Fellow of ACM, and received the ASEE/EED and HP Fred-erick E. Terman Award in 2003. He is an IEEE Computer Society Golden CoreMember and a Member of Phi Beta Kappa and Tau Beta Pi.