research article modelling and automated...

28
Research Article Modelling and Automated Implementation of Optimal Power Saving Strategies in Coarse-Grained Reconfigurable Architectures Francesca Palumbo, 1 Tiziana Fanni, 2 Carlo Sau, 2 Paolo Meloni, 2 and Luigi Raffo 2 1 POLCOMING, Information Engineering Unit, University of Sassari, Sassari, Italy 2 Department of Electrical and Electronic Engineering (DIEE), University of Cagliari, Cagliari, Italy Correspondence should be addressed to Tiziana Fanni; [email protected] Received 18 April 2016; Accepted 14 September 2016 Academic Editor: Wen B. Jone Copyright © 2016 Francesca Palumbo et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper focuses on how to efficiently reduce power consumption in coarse-grained reconfigurable designs, to allow their effective adoption in heterogeneous architectures supporting and accelerating complex and highly variable multifunctional applications. We propose a design flow for this kind of architectures that, besides their automatic customization, is also capable of determining their optimal power management support. Power and clock gating implementation costs are estimated in advance, before their physical implementation, on the basis of the functional, technological, and architectural parameters of the baseline design. Experimental results, on 90 and 45nm CMOS technologies, demonstrate that the proposed approach guides the designer towards optimal implementation. 1. Introduction Electronic devices on the market rely on the execution of computation-intensive applications on complex heteroge- neous systems. Coarse-grained reconfigurable (CGR) plat- forms combine the high performance levels provided by Application Specific Integrated Circuit (ASIC) designs with an increased flexibility, allowing the execution of a larger set of applications over the same substrate [1, 2]. However, in the dark silicon era, due to the limited available power budget, a gap exists between the number of transistors that can be placed within a die and the number that can be actually active during execution [3, 4]. erefore, systems are also required to be energy efficient and CGR designs must integrate specific power management techniques of the functional logic regions constituting them. Several effective techniques for power monitoring [5] and reducing [6] have been presented at the state of the art. Among them, voltage/frequency scaling [7, 8] and power shut-off schemes [9, 10] can be extremely beneficial. How- ever, their integration requires manual intervention of the designer, resulting in a complex, error prone, and time con- suming process. While commercial synthesizers [11] allow the automatic implementation of low overhead saving strategies at the gate level, such as fine-grained clock gating, they only provide implementation-level instruments to apply more complex strategies, like power gating. In the CGR systems field, the Multi-Dataflow Composer tool (MDC), combining the dataflow-based system specification approach with the coarse-grained reconfigurable design paradigm, is capable of automatically generating run-time reconfigurable multifunc- tional systems, featuring flexibility and area minimization [12]. MDC was originally meant to address reconfigurable codec implementations and was conceived to be exploited within MPEG Reconfigurable Video Coding (MPEG-RVC) studies. However, it was successfully adopted also in different resource and power-constrained scenarios [13, 14], where only microprogrammed solutions have been used so far, either exploiting single-core digital signal processors [15] or custom multicore embedded processors [16]. MDC design suite is composed of different extensions. e work presented Hindawi Publishing Corporation Journal of Electrical and Computer Engineering Volume 2016, Article ID 4237350, 27 pages http://dx.doi.org/10.1155/2016/4237350

Upload: others

Post on 21-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Research ArticleModelling and Automated Implementation ofOptimal Power Saving Strategies in Coarse-GrainedReconfigurable Architectures

Francesca Palumbo1 Tiziana Fanni2 Carlo Sau2 Paolo Meloni2 and Luigi Raffo2

1POLCOMING Information Engineering Unit University of Sassari Sassari Italy2Department of Electrical and Electronic Engineering (DIEE) University of Cagliari Cagliari Italy

Correspondence should be addressed to Tiziana Fanni tizianafannidieeunicait

Received 18 April 2016 Accepted 14 September 2016

Academic Editor Wen B Jone

Copyright copy 2016 Francesca Palumbo et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

This paper focuses on how to efficiently reduce power consumption in coarse-grained reconfigurable designs to allow their effectiveadoption in heterogeneous architectures supporting and accelerating complex and highly variable multifunctional applicationsWepropose a design flow for this kind of architectures that besides their automatic customization is also capable of determining theiroptimal power management support Power and clock gating implementation costs are estimated in advance before their physicalimplementation on the basis of the functional technological and architectural parameters of the baseline design Experimentalresults on 90 and 45 nm CMOS technologies demonstrate that the proposed approach guides the designer towards optimalimplementation

1 Introduction

Electronic devices on the market rely on the execution ofcomputation-intensive applications on complex heteroge-neous systems Coarse-grained reconfigurable (CGR) plat-forms combine the high performance levels provided byApplication Specific Integrated Circuit (ASIC) designs withan increased flexibility allowing the execution of a larger setof applications over the same substrate [1 2] However in thedark silicon era due to the limited available power budgeta gap exists between the number of transistors that can beplaced within a die and the number that can be actually activeduring execution [3 4] Therefore systems are also requiredto be energy efficient andCGRdesignsmust integrate specificpowermanagement techniques of the functional logic regionsconstituting them

Several effective techniques for powermonitoring [5] andreducing [6] have been presented at the state of the artAmong them voltagefrequency scaling [7 8] and powershut-off schemes [9 10] can be extremely beneficial How-ever their integration requires manual intervention of the

designer resulting in a complex error prone and time con-suming processWhile commercial synthesizers [11] allow theautomatic implementation of low overhead saving strategiesat the gate level such as fine-grained clock gating they onlyprovide implementation-level instruments to apply morecomplex strategies like power gating In the CGR systemsfield the Multi-Dataflow Composer tool (MDC) combiningthe dataflow-based system specification approach with thecoarse-grained reconfigurable design paradigm is capable ofautomatically generating run-time reconfigurable multifunc-tional systems featuring flexibility and area minimization[12] MDC was originally meant to address reconfigurablecodec implementations and was conceived to be exploitedwithin MPEG Reconfigurable Video Coding (MPEG-RVC)studies However it was successfully adopted also in differentresource and power-constrained scenarios [13 14] whereonly microprogrammed solutions have been used so fareither exploiting single-core digital signal processors [15] orcustom multicore embedded processors [16] MDC designsuite is composed of different extensionsThework presented

Hindawi Publishing CorporationJournal of Electrical and Computer EngineeringVolume 2016 Article ID 4237350 27 pageshttpdxdoiorg10115520164237350

2 Journal of Electrical and Computer Engineering

in this paper is related to MDC power management exten-sion which has been previously addressed in [17 18] MDCtool identifies in the generated CGR system the minimumset of disjointed functionally homogeneous logic areas ofthe system called logic regions These latter are exploited toautomatically implement dynamic power management stra-tegies applying indistinctly to all of them either clock [17] orpower gating methodologies [18]

The work we are presenting in this paper intends topropose a power modelling methodology and to improvethe MDC power management extension by integrating suchmethodology within its automated flow We introduce in thispaper an algorithm that analyses the identified logic regionsand on the basis of one single synthesis and a minimal setof simulations (one for each scenario of the multifunctionalproblem) is capable of optimally characterizing the powermanagement support This flow in a separate manner foreach logic region of the CGR design is capable of assessingboth clock and power gating management costs and of deter-mining which is the optimal power saving strategy (if any)prior to any physical system implementation The algorithmis based on detailed static and dynamic power consumptionmodels that take into account functional architectural andtechnological parameters to define the potential overheadand benefits of the considered solutions As a future perspec-tive besides its application within the MPEG-RVC scenariothe proposed modelling strategy may also be extended tosupport other complex autonomous computing systems [19]where the number of involved resources may change at run-time

The rest of this paper is organized as follows Section 2reports the background of this work Section 3 describes theoperation of MDC in particular Section 31 focuses on thebase operation while Section 32 details the proposed powerestimation models and their integration within the tool Sec-tion 4 presents the designs under test used to validate the pro-posed approach Section 41 involves a FFT use case targetinga 90 nm CMOS technology while Section 42 presents theexperimental results conducted to assess the enhanced MDCflow on a zoom coprocessor (targeting both 90 nm and 45 nmCMOS technologies) Finally Section 44 details the benefitsof the proposed models before concluding with some finalremarks in Section 5

2 Background

This work presents a model of power consumption for CGRsystemsThemodel is capable of estimating the system powerdissipation since the early stages of the design flow and it hasbeen integrated within an automated flow that decides whichpower saving technique between clock gating and powergating has to be applied to each portion of the design

This section provides an overview of the state of the artin the field of power-aware optimization and in more detailon the main aspects involved in the proposed approachSection 21 introduces the distinctive features of CGR systemsand the main architectural trends for such a kind of devicesSection 22 deals with power issues in digital system with

a particular emphasis on modelling strategies and designautomation

21 Coarse-Grained Reconfigurable Systems Reconfigurablearchitectures are usually conceived as collections of func-tional units (FUs) whose functionality and connections canbe configured at run time to adapt them to different appli-cations or operating modes Such systems can be classi-fied according to the granularity of the FUs Fine-grainedapproaches typically exploiting FPGA devices as underlyingtechnology involve bit-level FUs resulting in a higher flexi-bility but requiring long configuration time (due to the con-figuration bitstream size) Coarse-grained reconfigurabilityon the other hand provides word-level FUs thus providingless flexibility while guaranteeing faster configuration phasesCGR architectures are usually exploited to design flexibleASICs making them capable of switching among a finiteset of functionality In such design cases high efficiency interms of area obstruction of the designed system can be easilyobtained but not all the resources are evenly involved in thecomputation Dedicated power management techniques areneeded to reduce the overhead in terms of power consump-tion related to resources that are not involved in each oper-ating mode [17] CGR architectures already demonstratedbeing suitable to address application scenarios that requireflexibility along with strong area power and executionefficiency [13 20]

One of themain issues of CGR architectures is their com-plex mapping and programming [21 22] Several works triedto automate the mapping of applications and computationalkernels onto CGR and multicore systems [23ndash25] The map-ping problem requires specific knowledge of the consideredkernels that usually have to be identified and specified bymeans of hardware description languagesThemapping effortis directly proportional to the number of involved kernels[26] Recently dataflow models demonstrated to be veryuseful in this scenario [27 28] Dataflows describe programsthrough a graph whose nodes are processing elements (actor)linked by point-to-point unidirectional channels managedaccording to a FIFO protocol Actors encapsulate their ownstate and communicate only through atomic packets of data(tokens) Due to their intrinsic modularity dataflows favourhardware and software components definition and reuseFurthermore they are natively capable of highlighting theintrinsic parallelism of the specified applications The Multi-Dataflow Composer (MDC) tool adopted within the pre-sented work relies on dataflows (RVC-CAL formalism byMPEG is currently supported) to solve the CGR mappingproblem It exploits the characteristics of such kind of modelsto provide several advanced features (eg power manage-ment [17 18] or coprocessing units automatic generation[29])

22 Power Management Power consumption in digital devi-ces is composed mainly of two different contributions dyn-amic and static The former is due to capacitance charg-ingdischarging when logic transitions occur (ie switchingactivity) The latter is due to leakage currents and it is con-sumed also when no circuit activity is present Modern

Journal of Electrical and Computer Engineering 3

designers need to consider both termswhen conceiving smartmanagement strategies Several techniques (clock gatingmultifrequency operand isolation multithreshold multisup-ply libraries power gating etc) exist and in some cases theyare automatically implemented by commercial synthesisplace-and-route tools In custom computing systems someadvanced design tools support the designers in the applica-tion-driven customization of the hardware architectures[30 31] However generally speaking invasive techniques(requiring insertion of additional logic and target technologysupport (such as the availability of dedicated cells andprocesses on the implementation stack)) still need tools to beguided with significant manual effort by the designer

Clock gating is an example of quite noninvasive tech-nique It may reduce the dynamic power consumption dueto the clock tree and to sequential logic up to the 40 [32] Itconsists in shutting off the clock of the unused synchronouslogic by means of simple AND gates Clock gating has beendeeply automated and it is available on most of the com-mercial synthesizers In the MPEG-RVC community recentstudies [33] presented an extension of a High-Level Synthesistool Xronos to selectively switch off clock signal for parts ofthe circuit that are idle due to stalls in the pipeline to reducepower consumption Moreover as mentioned the MDC toolhas the capability of identifying by means of a graph-basedanalysis of the input dataflow specifications independentcircuitry regions These logic regions can be clock gatedto dynamically adapt power consumption when switchingbetween different functionalities [17 18] From the technicalpoint of view in ASIC designsAND gates can be used directlyon the clock to disable it while in FPGA designs the clocknetwork cannot be modified by the insertion of any customlogic and dedicated cells are required (Xilinx boards eg areequipped with dedicated blocks (BUFGs) whose outputs candrive distinct regions of logic powering down different designportions (when enabled)) Clock gating can be applied atdifferent granularities fine-grained approaches act on singleregisters whereas coarse-grained ones are referred as in[17 18 33] to a set of resources Commercial synthesizersnormally can automate only fine-grained strategies

Power gating is quite invasive The main idea behind it isas follows if a specific portion of the design is not used in agiven computationmode then it can be completely switched-off by means of a sleep transistor This technique as theclock gating one is applicable at different granularities fine-grained approaches require driving a different sleep transistorfor every cell in the system while coarse-grained ones againoperate on a set of resources instantiating one sleep transistorto drive different cells connected to a shared power networkMDC as discussed in [18] supports also automatic powergating for CGR architectures Each identified logic regionin the CGR system is implemented (no matter of its natureor characteristics) as a different power domain (PD) thatin order to be managed requires to insert and drive thefollowing resources

(i) The sleep transistor between the gated region and themain power supply to switch onoff the derived powersupply

(ii) The isolation logic between the gated region andnormally-on cells to avoid the transmission of spuri-ous signals in input to the normally-on cells

(iii) The state retention logic to maintain where neededthe internal state of the gated region

MDC besides defining the power gated design netlist pro-vides also the automatic definition of the power format fileThis file specifies the shut-off isolation and state retentionrules (if you have 10 PD you are required to define 3 lowast10 = 30 interfaces) along with their respective enable signals([1 lowast shut-off + 1 lowast isol + 2 lowast reten] lowast10 = 40 signals) anda dedicated Power Controller to properly drive them Thepower format file automatically created byMDC is compliantwith the Silicon Integration Initiativersquos Common Power For-mat (CPF) whose definition is driven mainly by developersusing Cadence [34] tools

221 Modelling To the best of our knowledge literature doesnot treat the problem of modelling power gating and clockgating costs in CGR designs Some approaches only partiallyaddress the issue For example [35] focusses on low-powertechniques and power modelling for FPGAs In [36] onlyclock gating is taken into account different power states (onthe basis of the clock enable signals) are defined and theirconsumption is characterized by low-level Power Analysisresults [37] focusses on estimating the leakage reduction forpower gating and reverse body bias

The CASPER simulator for shared memory many-coreprocessors [38] includes precharacterized libraries contain-ing power dissipation models of different hardware com-ponents enabling accurate power estimation at a high-levelexploration stage In particular the authors implement Chip-wide Dynamic Voltage Frequency Scaling and PerformanceAwareCore-Specific Frequency ScalingTheFALPEM frame-work [39] provides power estimations at preregister transferlevel (RTL) stage specifically targeting the power consumedby clock network and interconnect but power and clockgating costs are not defined Other approaches perform anestimation that considers different components Li et al [40]propose an architecture-level integrated power area andtiming modelling framework for multicore systems whichevaluates system building blocks (CPU buses etc) fordifferent technology nodes providing also power gatingsupport Finally the work presented in [41] focuses on on-chip networks

3 Design Suite for Coarse-GrainedPower-Aware Systems

This section discusses the proposed technique for modellingthe power consumption of a CGR system when clock gatingor power gating are applied These models combined in aselection algorithm can be exploited for developing an auto-mated design flow for power-efficient CGR systems wherethe optimal saving strategy is selected for each identifiedworking set of resources

In this work we have embedded these models and thealgorithm in the Multi-Dataflow Composer (MDC) tool

4 Journal of Electrical and Computer Engineering

a framework capable of CGR systems characterization MDCprovides a comprehensive design suite automating severaldevelopment tasks of the synthesis and development of CGRsystems within design flows targeting both FPGA [29] andASICThe tool provides extensive support to dynamic powermanagement [17 18] addressing power-constrained designcases scenarios completely automating implementation andcontrol of clock gating and power gating strategies in thefinal CGR platform Nevertheless such techniques are notcurrently addressed in a hybrid manner the users mustchoose the approach to be used in the design without ana priori automated analysis process Such an unsupportedselection may easily lead to suboptimal implementations onthe final platform

In the following Section 31 provides an overview ofthe MDC baseline functionality and of the current powermanagement support Section 32 discusses the proposedpowermodels and automated selection algorithm identifyingoptimal powermanagement strategy inCGR systems In bothsections step-by-step examples are presented to clarify themethodology

31 The Multi-Dataflow Composer Tool MDC automatesgeneration andmanagement ofCGR systems facing the com-plex mapping of multiple applications onto a single recon-figurable architecture It automates the mapping process andguarantees the minimization of hardware resources allowingfor significant areaenergy savings [12 42] In literature thisproblem is known as datapath merging and it deals with thecombination of a set of input datapaths described by meansof graphs onto a single reconfigurable datapath It aims atmaximally sharing (among the different input graphs) bothprocessing nodes and connections

MDC is naturally compliant with the RVC-CAL for-malism and natively supports Dataflow Process Network(DPN) models as input Currently it is interfaced with theOpen RVC-CAL Compiler Orcc [43] The Orcc front-endis responsible for parsing one-by-one the high-level DPNspecifications of the different datapaths that MDCwill mergewithin the CGR system Please note that MDC can beinterfaced with other graph parsers so that it will be ableto be easily adapted to any other dataflow-based modellingenvironment

As depicted by Figure 1 the MDC baseline flow involvesthree main phases

(1) The input DPNs parsing performed by the Orccfront-end translates the RVC-CAL specifications intoJava Intermediate Representations (IRs) which arebasically directed graphs

(2) The datapath merging performed by the MDC front-end combines the IRs into a reconfigurable IRinserting (where necessary) special switching actors(responsible for properly distributing the token flowamong the different merged DPNs) keeping traceof the system programmability through a dedicatedConfiguration Table (TAB in Figure 1)

(3) The hardware platform generation performed by theMDC back-end leads to the creation of the RTL

that describes the CGR system itself where eachactor of the reconfigurable IR is mapped onto adifferent hardware FU In this phase the hardwarecommunication protocol and the HDL (HardwareDescription Language) components library (providingthe RTL descriptions of the required FUsmanually orautomatically generated) are provided as input to thetool

At the hardware level reconfiguration takes place ina single clock cycle It is achieved through low overheadswitching elements (SBoxes) that allow the sharing of com-mon resources among different input DPNs SBoxes are sim-ple combinatorial multiplexers and demultiplexers whoseconfiguration is stored into dedicated Look-Up tables thataccording to the Configuration Table compute the selectorsnecessary for the correct data forwarding in order to imple-ment the requested functionality

311 Automated Power Management Dealing with reconfig-urable architectures and in particular with CGR systemsthe power consumption has to be carefully taken into con-sideration Such a kind of systems is affected by resourceredundancymainly due to the FUs that are not shared amongdifferent functionalitiesThus when a certain functionality isexecuted part of the design (not involved in the computation)is in an idle state and can uselessly consume precious powerFortunately the unused resources for each implementedfunctionality depend on the input specifications and there-fore are fixed at design-time

Given these considerations then a CGR system canbe characterized by a set of disjointed logic regions (LRs)grouping the resources that are always activeinactive at thesame time The MDC power management extension is capa-ble of automatically identifying LRs It performs the LRs iden-tification at a high-level of abstraction on the reconfigurableIR by exploiting the intrinsic modularity of the dataflowgraphsOnce the LRs have been identified theMDCdynamicpower manager automatically applies according to the userselection either clock gating [17] or power gating [18] on theresulting CGR hardware platform The identification of theminimal number of LRs is guaranteed tominimize the poweroverhead of the extra logic needed to implement the selectedpower saving strategies An overview of the power manage-ment extension is provided by Figure 1

For each input DPN 119866119894 the currently available algorithmdetermines the set 1198811015840119894 which contains all the resources of thereconfigurable IR activated by 119866119894 These are the original setsof LRs that represent the starting point for the algorithm tofind the final LRs by iteratively comparing two 1198811015840119894 sets at atime determining their possible overlapping If overlappingis found its resources are removed by the two considered setsand a new 1198811015840119895 (corresponding to a new LR) involving theseshared resources is issued1198811015840119895 groups resources that are sharedamong different input DPNs while the remaining resourcesin the two 1198811015840119894 will uniquely belong to the originally con-sidered DPNs This compare and split identification processguarantees that the number of LRs found by the MDCdynamic power manager is the minimum achievable one

Journal of Electrical and Computer Engineering 5

HDL + powercontroller +

CPF

Clock gatedHDLH

DL

gene

ratio

n

Pow

erga

ting

Cloc

kga

ting

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

id

entifi

catio

n

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TAB

IRs

(a)

(b)

HDL

ORC

C

DPNs

DPNs parsing Datapath merging Hardware platform generation

Logic regions identification Power saving application

MDC baseline flow

MDC power management extension

Figure 1 MDC design suite baseline flow (a) and corresponding power extension (b)

If this number is still too high for the considered targetplatform as it can happen if FPGAs are the target devices (inFPGA devices the number of hardware blocks that can drivethe different LRs is limited eg 32 BUFG units are availablein Xilinx boards for clock management purposes) a LRsmerging process has to be applied MDC users are requiredto specify the target technology and themaximumnumber ofimplementable LRs This latter is compared with the numberof LRs determined by the compare and split identificationprocess and if necessary the LRs merging process is appliedTwo LRs at a time are unified (details on how to mergedifferent LR sets can be found in [17] where two mergingstrategies (a power-aware one and a number-aware one) arepresented) until the constraint fixed by the user is met Thisprocess leads to a suboptimal system implementation eachDPN while activating its corresponding LRs may also acti-vate some resources that do not contribute to its computationleading to extra unnecessary power consumption

MDC power management extension during the HDLgeneration phase provides also the implementation of the

chosen power management strategy upon the identified LRsIt blindly applies the selected strategy to all the identified LRswithout any warranty on the approach effectiveness Clockgating acts only on the dynamic contribute of the powerconsumption and requires a minimum logic overhead onthe final platform Indeed the simplest implementation isachieved by means of one AND gate for each LR plus oneunique Enable Generator to properly set the enable signals ofthe AND gates according to the desired functionality On thecontrary power gating is able to reduce both power contribu-tions static and dynamic by shutting off the power supply ofthe region However it is quitemore invasive since it requiresone power switch for each LR one state retention cell for eachFlip-Flop whose state has to be kept also when the corre-sponding LR is off and one isolation cell for each bit-wisewirethat goes from a disabled LR to an enabled one A differentclock gating cell (again an AND gate) is required for eachLR according to the switching-off protocol for the properoperation of the retention cells (details on the power gatingswitch onoff protocol can be found in [11]) Furthermore

6 Journal of Electrical and Computer Engineering

A

D

F

E

G

SB_0B

SB_2

SB_1

C

Mer

ging

CDIn2 Out1E

CAIn1 Out1B

F

AIn1

In3C Out1G

AIn1

DIn2

FIn3

E

G

SB_0B

SB_1

SB_2Out1

C

Power management

Power saving

Powergating

ClockgatingA

D

F

E

G

SB_0B

SB_1 SB_2

C

Logic regions

HDL generation

Clock gated platform Power gated platform

A

In1

D

In2

FIn3

E

G

B

SB_1

SB_2

Out1C

SB_0

ID net

CLK

Enab

le g

ener

ator

CD1

CD2 always on CD

CD3

CD4

CD5

en1

en4

en5

en3

ck ck

ck

ck

ck ck

ck

A

In1

D

In2

F

In3

E

G

B SB_1

SB_2

Out1C

PD1

PD2 always on PD

PD3

PD4

PD5

PD4State

retentionStateretention

Stateretention

Stateretention

Stateretention

Stateretention ISOL

ISOL

ISOLISOL

SB_0 ISOL

Powerswitch

Powerswitch

Powerswitch

ISOL

Powerswitch

ID net

Power controller

Logi

c reg

ions

iden

tifica

tion

120572java

120573java

120574java

V998400120572

V998400120573

V998400120574

LR1 LR3

LR2LR4

LR5

VDD

VDD

VDD

VDD

Figure 2 Step-by-step example of the MDC baseline and dynamic power management features Saving strategies are blindly applied by thedynamic power manager on each identified LR

one Power Controller block (involving a different finite statemachine for each LR) is needed to properly drive the insertedpower switch state retention and isolation cells

312 Step-by-Step Example In order to clarify the featuresprovided by MDC baseline functionality and its relateddynamic power manager extension this section describesa step-by-step example of the whole flow Three differentinput functionalities labelled 120572 120573 and 120574 are consideredand modelled as DPNs As first step of the baseline MDCfunctionality the DPNs are parsed by the Orcc font-end andtranslated into Java IR graphs Figure 2 depicts an overview ofthe whole flow starting from these input IRs (120572119895119886V119886 120573119895119886V119886and 120574119895119886V119886) At this level MDC combines the dataflows intoa reconfigurable IR inserting the SBox actors (SB in Figure 2)Three SBoxes are required to share actor 119860 between 120572 and 120574and actor 119862 among all the three functionalities

Once the reconfigurable IR has been derived the dynamicpower manager can identify the corresponding LRs Thestarting 1198811015840119894 sets are

(i) 1198811015840120572 = 119860 119861 119862 SB 0 SB 1 SB 2(ii) 1198811015840120573 = 119863 119864 119862 SB 2

(iii) 1198811015840120574 = 119860 119865 119866 119862 SB 0 SB 1 SB 2The compare and split identification process produces fivedifferent LRs

(i) LR1 = 119861(ii) LR2 = 119862 SB 2(iii) LR3 = 119863 119864(iv) LR4 = 119860 SB 0 SB 1(v) LR5 = 119865 119866

LR4 and LR2 involve shared resources being activatedrespectively by 120572 and 120574 and by 120572 120573 and 120574 while the remain-ing LRs involve nonshared resources LR1 is activated only by120572 LR3 only by 120573 and LR5 only by 120574

At this point the selected power saving strategy is appliedduring the CGR systemHDL generation Figure 2 shows boththe final designs resulting from the application of clock gatingand power gating

The clock gated platform is shown in the bottom leftcorner of Figure 2 In this case the identified LRs become theClock Domains (CDs) of the resulting architecture meaningthat the involved actors are all driven by the same gatedclock SBoxes are not included in any CD since they are fully

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

2 Journal of Electrical and Computer Engineering

in this paper is related to MDC power management exten-sion which has been previously addressed in [17 18] MDCtool identifies in the generated CGR system the minimumset of disjointed functionally homogeneous logic areas ofthe system called logic regions These latter are exploited toautomatically implement dynamic power management stra-tegies applying indistinctly to all of them either clock [17] orpower gating methodologies [18]

The work we are presenting in this paper intends topropose a power modelling methodology and to improvethe MDC power management extension by integrating suchmethodology within its automated flow We introduce in thispaper an algorithm that analyses the identified logic regionsand on the basis of one single synthesis and a minimal setof simulations (one for each scenario of the multifunctionalproblem) is capable of optimally characterizing the powermanagement support This flow in a separate manner foreach logic region of the CGR design is capable of assessingboth clock and power gating management costs and of deter-mining which is the optimal power saving strategy (if any)prior to any physical system implementation The algorithmis based on detailed static and dynamic power consumptionmodels that take into account functional architectural andtechnological parameters to define the potential overheadand benefits of the considered solutions As a future perspec-tive besides its application within the MPEG-RVC scenariothe proposed modelling strategy may also be extended tosupport other complex autonomous computing systems [19]where the number of involved resources may change at run-time

The rest of this paper is organized as follows Section 2reports the background of this work Section 3 describes theoperation of MDC in particular Section 31 focuses on thebase operation while Section 32 details the proposed powerestimation models and their integration within the tool Sec-tion 4 presents the designs under test used to validate the pro-posed approach Section 41 involves a FFT use case targetinga 90 nm CMOS technology while Section 42 presents theexperimental results conducted to assess the enhanced MDCflow on a zoom coprocessor (targeting both 90 nm and 45 nmCMOS technologies) Finally Section 44 details the benefitsof the proposed models before concluding with some finalremarks in Section 5

2 Background

This work presents a model of power consumption for CGRsystemsThemodel is capable of estimating the system powerdissipation since the early stages of the design flow and it hasbeen integrated within an automated flow that decides whichpower saving technique between clock gating and powergating has to be applied to each portion of the design

This section provides an overview of the state of the artin the field of power-aware optimization and in more detailon the main aspects involved in the proposed approachSection 21 introduces the distinctive features of CGR systemsand the main architectural trends for such a kind of devicesSection 22 deals with power issues in digital system with

a particular emphasis on modelling strategies and designautomation

21 Coarse-Grained Reconfigurable Systems Reconfigurablearchitectures are usually conceived as collections of func-tional units (FUs) whose functionality and connections canbe configured at run time to adapt them to different appli-cations or operating modes Such systems can be classi-fied according to the granularity of the FUs Fine-grainedapproaches typically exploiting FPGA devices as underlyingtechnology involve bit-level FUs resulting in a higher flexi-bility but requiring long configuration time (due to the con-figuration bitstream size) Coarse-grained reconfigurabilityon the other hand provides word-level FUs thus providingless flexibility while guaranteeing faster configuration phasesCGR architectures are usually exploited to design flexibleASICs making them capable of switching among a finiteset of functionality In such design cases high efficiency interms of area obstruction of the designed system can be easilyobtained but not all the resources are evenly involved in thecomputation Dedicated power management techniques areneeded to reduce the overhead in terms of power consump-tion related to resources that are not involved in each oper-ating mode [17] CGR architectures already demonstratedbeing suitable to address application scenarios that requireflexibility along with strong area power and executionefficiency [13 20]

One of themain issues of CGR architectures is their com-plex mapping and programming [21 22] Several works triedto automate the mapping of applications and computationalkernels onto CGR and multicore systems [23ndash25] The map-ping problem requires specific knowledge of the consideredkernels that usually have to be identified and specified bymeans of hardware description languagesThemapping effortis directly proportional to the number of involved kernels[26] Recently dataflow models demonstrated to be veryuseful in this scenario [27 28] Dataflows describe programsthrough a graph whose nodes are processing elements (actor)linked by point-to-point unidirectional channels managedaccording to a FIFO protocol Actors encapsulate their ownstate and communicate only through atomic packets of data(tokens) Due to their intrinsic modularity dataflows favourhardware and software components definition and reuseFurthermore they are natively capable of highlighting theintrinsic parallelism of the specified applications The Multi-Dataflow Composer (MDC) tool adopted within the pre-sented work relies on dataflows (RVC-CAL formalism byMPEG is currently supported) to solve the CGR mappingproblem It exploits the characteristics of such kind of modelsto provide several advanced features (eg power manage-ment [17 18] or coprocessing units automatic generation[29])

22 Power Management Power consumption in digital devi-ces is composed mainly of two different contributions dyn-amic and static The former is due to capacitance charg-ingdischarging when logic transitions occur (ie switchingactivity) The latter is due to leakage currents and it is con-sumed also when no circuit activity is present Modern

Journal of Electrical and Computer Engineering 3

designers need to consider both termswhen conceiving smartmanagement strategies Several techniques (clock gatingmultifrequency operand isolation multithreshold multisup-ply libraries power gating etc) exist and in some cases theyare automatically implemented by commercial synthesisplace-and-route tools In custom computing systems someadvanced design tools support the designers in the applica-tion-driven customization of the hardware architectures[30 31] However generally speaking invasive techniques(requiring insertion of additional logic and target technologysupport (such as the availability of dedicated cells andprocesses on the implementation stack)) still need tools to beguided with significant manual effort by the designer

Clock gating is an example of quite noninvasive tech-nique It may reduce the dynamic power consumption dueto the clock tree and to sequential logic up to the 40 [32] Itconsists in shutting off the clock of the unused synchronouslogic by means of simple AND gates Clock gating has beendeeply automated and it is available on most of the com-mercial synthesizers In the MPEG-RVC community recentstudies [33] presented an extension of a High-Level Synthesistool Xronos to selectively switch off clock signal for parts ofthe circuit that are idle due to stalls in the pipeline to reducepower consumption Moreover as mentioned the MDC toolhas the capability of identifying by means of a graph-basedanalysis of the input dataflow specifications independentcircuitry regions These logic regions can be clock gatedto dynamically adapt power consumption when switchingbetween different functionalities [17 18] From the technicalpoint of view in ASIC designsAND gates can be used directlyon the clock to disable it while in FPGA designs the clocknetwork cannot be modified by the insertion of any customlogic and dedicated cells are required (Xilinx boards eg areequipped with dedicated blocks (BUFGs) whose outputs candrive distinct regions of logic powering down different designportions (when enabled)) Clock gating can be applied atdifferent granularities fine-grained approaches act on singleregisters whereas coarse-grained ones are referred as in[17 18 33] to a set of resources Commercial synthesizersnormally can automate only fine-grained strategies

Power gating is quite invasive The main idea behind it isas follows if a specific portion of the design is not used in agiven computationmode then it can be completely switched-off by means of a sleep transistor This technique as theclock gating one is applicable at different granularities fine-grained approaches require driving a different sleep transistorfor every cell in the system while coarse-grained ones againoperate on a set of resources instantiating one sleep transistorto drive different cells connected to a shared power networkMDC as discussed in [18] supports also automatic powergating for CGR architectures Each identified logic regionin the CGR system is implemented (no matter of its natureor characteristics) as a different power domain (PD) thatin order to be managed requires to insert and drive thefollowing resources

(i) The sleep transistor between the gated region and themain power supply to switch onoff the derived powersupply

(ii) The isolation logic between the gated region andnormally-on cells to avoid the transmission of spuri-ous signals in input to the normally-on cells

(iii) The state retention logic to maintain where neededthe internal state of the gated region

MDC besides defining the power gated design netlist pro-vides also the automatic definition of the power format fileThis file specifies the shut-off isolation and state retentionrules (if you have 10 PD you are required to define 3 lowast10 = 30 interfaces) along with their respective enable signals([1 lowast shut-off + 1 lowast isol + 2 lowast reten] lowast10 = 40 signals) anda dedicated Power Controller to properly drive them Thepower format file automatically created byMDC is compliantwith the Silicon Integration Initiativersquos Common Power For-mat (CPF) whose definition is driven mainly by developersusing Cadence [34] tools

221 Modelling To the best of our knowledge literature doesnot treat the problem of modelling power gating and clockgating costs in CGR designs Some approaches only partiallyaddress the issue For example [35] focusses on low-powertechniques and power modelling for FPGAs In [36] onlyclock gating is taken into account different power states (onthe basis of the clock enable signals) are defined and theirconsumption is characterized by low-level Power Analysisresults [37] focusses on estimating the leakage reduction forpower gating and reverse body bias

The CASPER simulator for shared memory many-coreprocessors [38] includes precharacterized libraries contain-ing power dissipation models of different hardware com-ponents enabling accurate power estimation at a high-levelexploration stage In particular the authors implement Chip-wide Dynamic Voltage Frequency Scaling and PerformanceAwareCore-Specific Frequency ScalingTheFALPEM frame-work [39] provides power estimations at preregister transferlevel (RTL) stage specifically targeting the power consumedby clock network and interconnect but power and clockgating costs are not defined Other approaches perform anestimation that considers different components Li et al [40]propose an architecture-level integrated power area andtiming modelling framework for multicore systems whichevaluates system building blocks (CPU buses etc) fordifferent technology nodes providing also power gatingsupport Finally the work presented in [41] focuses on on-chip networks

3 Design Suite for Coarse-GrainedPower-Aware Systems

This section discusses the proposed technique for modellingthe power consumption of a CGR system when clock gatingor power gating are applied These models combined in aselection algorithm can be exploited for developing an auto-mated design flow for power-efficient CGR systems wherethe optimal saving strategy is selected for each identifiedworking set of resources

In this work we have embedded these models and thealgorithm in the Multi-Dataflow Composer (MDC) tool

4 Journal of Electrical and Computer Engineering

a framework capable of CGR systems characterization MDCprovides a comprehensive design suite automating severaldevelopment tasks of the synthesis and development of CGRsystems within design flows targeting both FPGA [29] andASICThe tool provides extensive support to dynamic powermanagement [17 18] addressing power-constrained designcases scenarios completely automating implementation andcontrol of clock gating and power gating strategies in thefinal CGR platform Nevertheless such techniques are notcurrently addressed in a hybrid manner the users mustchoose the approach to be used in the design without ana priori automated analysis process Such an unsupportedselection may easily lead to suboptimal implementations onthe final platform

In the following Section 31 provides an overview ofthe MDC baseline functionality and of the current powermanagement support Section 32 discusses the proposedpowermodels and automated selection algorithm identifyingoptimal powermanagement strategy inCGR systems In bothsections step-by-step examples are presented to clarify themethodology

31 The Multi-Dataflow Composer Tool MDC automatesgeneration andmanagement ofCGR systems facing the com-plex mapping of multiple applications onto a single recon-figurable architecture It automates the mapping process andguarantees the minimization of hardware resources allowingfor significant areaenergy savings [12 42] In literature thisproblem is known as datapath merging and it deals with thecombination of a set of input datapaths described by meansof graphs onto a single reconfigurable datapath It aims atmaximally sharing (among the different input graphs) bothprocessing nodes and connections

MDC is naturally compliant with the RVC-CAL for-malism and natively supports Dataflow Process Network(DPN) models as input Currently it is interfaced with theOpen RVC-CAL Compiler Orcc [43] The Orcc front-endis responsible for parsing one-by-one the high-level DPNspecifications of the different datapaths that MDCwill mergewithin the CGR system Please note that MDC can beinterfaced with other graph parsers so that it will be ableto be easily adapted to any other dataflow-based modellingenvironment

As depicted by Figure 1 the MDC baseline flow involvesthree main phases

(1) The input DPNs parsing performed by the Orccfront-end translates the RVC-CAL specifications intoJava Intermediate Representations (IRs) which arebasically directed graphs

(2) The datapath merging performed by the MDC front-end combines the IRs into a reconfigurable IRinserting (where necessary) special switching actors(responsible for properly distributing the token flowamong the different merged DPNs) keeping traceof the system programmability through a dedicatedConfiguration Table (TAB in Figure 1)

(3) The hardware platform generation performed by theMDC back-end leads to the creation of the RTL

that describes the CGR system itself where eachactor of the reconfigurable IR is mapped onto adifferent hardware FU In this phase the hardwarecommunication protocol and the HDL (HardwareDescription Language) components library (providingthe RTL descriptions of the required FUsmanually orautomatically generated) are provided as input to thetool

At the hardware level reconfiguration takes place ina single clock cycle It is achieved through low overheadswitching elements (SBoxes) that allow the sharing of com-mon resources among different input DPNs SBoxes are sim-ple combinatorial multiplexers and demultiplexers whoseconfiguration is stored into dedicated Look-Up tables thataccording to the Configuration Table compute the selectorsnecessary for the correct data forwarding in order to imple-ment the requested functionality

311 Automated Power Management Dealing with reconfig-urable architectures and in particular with CGR systemsthe power consumption has to be carefully taken into con-sideration Such a kind of systems is affected by resourceredundancymainly due to the FUs that are not shared amongdifferent functionalitiesThus when a certain functionality isexecuted part of the design (not involved in the computation)is in an idle state and can uselessly consume precious powerFortunately the unused resources for each implementedfunctionality depend on the input specifications and there-fore are fixed at design-time

Given these considerations then a CGR system canbe characterized by a set of disjointed logic regions (LRs)grouping the resources that are always activeinactive at thesame time The MDC power management extension is capa-ble of automatically identifying LRs It performs the LRs iden-tification at a high-level of abstraction on the reconfigurableIR by exploiting the intrinsic modularity of the dataflowgraphsOnce the LRs have been identified theMDCdynamicpower manager automatically applies according to the userselection either clock gating [17] or power gating [18] on theresulting CGR hardware platform The identification of theminimal number of LRs is guaranteed tominimize the poweroverhead of the extra logic needed to implement the selectedpower saving strategies An overview of the power manage-ment extension is provided by Figure 1

For each input DPN 119866119894 the currently available algorithmdetermines the set 1198811015840119894 which contains all the resources of thereconfigurable IR activated by 119866119894 These are the original setsof LRs that represent the starting point for the algorithm tofind the final LRs by iteratively comparing two 1198811015840119894 sets at atime determining their possible overlapping If overlappingis found its resources are removed by the two considered setsand a new 1198811015840119895 (corresponding to a new LR) involving theseshared resources is issued1198811015840119895 groups resources that are sharedamong different input DPNs while the remaining resourcesin the two 1198811015840119894 will uniquely belong to the originally con-sidered DPNs This compare and split identification processguarantees that the number of LRs found by the MDCdynamic power manager is the minimum achievable one

Journal of Electrical and Computer Engineering 5

HDL + powercontroller +

CPF

Clock gatedHDLH

DL

gene

ratio

n

Pow

erga

ting

Cloc

kga

ting

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

id

entifi

catio

n

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TAB

IRs

(a)

(b)

HDL

ORC

C

DPNs

DPNs parsing Datapath merging Hardware platform generation

Logic regions identification Power saving application

MDC baseline flow

MDC power management extension

Figure 1 MDC design suite baseline flow (a) and corresponding power extension (b)

If this number is still too high for the considered targetplatform as it can happen if FPGAs are the target devices (inFPGA devices the number of hardware blocks that can drivethe different LRs is limited eg 32 BUFG units are availablein Xilinx boards for clock management purposes) a LRsmerging process has to be applied MDC users are requiredto specify the target technology and themaximumnumber ofimplementable LRs This latter is compared with the numberof LRs determined by the compare and split identificationprocess and if necessary the LRs merging process is appliedTwo LRs at a time are unified (details on how to mergedifferent LR sets can be found in [17] where two mergingstrategies (a power-aware one and a number-aware one) arepresented) until the constraint fixed by the user is met Thisprocess leads to a suboptimal system implementation eachDPN while activating its corresponding LRs may also acti-vate some resources that do not contribute to its computationleading to extra unnecessary power consumption

MDC power management extension during the HDLgeneration phase provides also the implementation of the

chosen power management strategy upon the identified LRsIt blindly applies the selected strategy to all the identified LRswithout any warranty on the approach effectiveness Clockgating acts only on the dynamic contribute of the powerconsumption and requires a minimum logic overhead onthe final platform Indeed the simplest implementation isachieved by means of one AND gate for each LR plus oneunique Enable Generator to properly set the enable signals ofthe AND gates according to the desired functionality On thecontrary power gating is able to reduce both power contribu-tions static and dynamic by shutting off the power supply ofthe region However it is quitemore invasive since it requiresone power switch for each LR one state retention cell for eachFlip-Flop whose state has to be kept also when the corre-sponding LR is off and one isolation cell for each bit-wisewirethat goes from a disabled LR to an enabled one A differentclock gating cell (again an AND gate) is required for eachLR according to the switching-off protocol for the properoperation of the retention cells (details on the power gatingswitch onoff protocol can be found in [11]) Furthermore

6 Journal of Electrical and Computer Engineering

A

D

F

E

G

SB_0B

SB_2

SB_1

C

Mer

ging

CDIn2 Out1E

CAIn1 Out1B

F

AIn1

In3C Out1G

AIn1

DIn2

FIn3

E

G

SB_0B

SB_1

SB_2Out1

C

Power management

Power saving

Powergating

ClockgatingA

D

F

E

G

SB_0B

SB_1 SB_2

C

Logic regions

HDL generation

Clock gated platform Power gated platform

A

In1

D

In2

FIn3

E

G

B

SB_1

SB_2

Out1C

SB_0

ID net

CLK

Enab

le g

ener

ator

CD1

CD2 always on CD

CD3

CD4

CD5

en1

en4

en5

en3

ck ck

ck

ck

ck ck

ck

A

In1

D

In2

F

In3

E

G

B SB_1

SB_2

Out1C

PD1

PD2 always on PD

PD3

PD4

PD5

PD4State

retentionStateretention

Stateretention

Stateretention

Stateretention

Stateretention ISOL

ISOL

ISOLISOL

SB_0 ISOL

Powerswitch

Powerswitch

Powerswitch

ISOL

Powerswitch

ID net

Power controller

Logi

c reg

ions

iden

tifica

tion

120572java

120573java

120574java

V998400120572

V998400120573

V998400120574

LR1 LR3

LR2LR4

LR5

VDD

VDD

VDD

VDD

Figure 2 Step-by-step example of the MDC baseline and dynamic power management features Saving strategies are blindly applied by thedynamic power manager on each identified LR

one Power Controller block (involving a different finite statemachine for each LR) is needed to properly drive the insertedpower switch state retention and isolation cells

312 Step-by-Step Example In order to clarify the featuresprovided by MDC baseline functionality and its relateddynamic power manager extension this section describesa step-by-step example of the whole flow Three differentinput functionalities labelled 120572 120573 and 120574 are consideredand modelled as DPNs As first step of the baseline MDCfunctionality the DPNs are parsed by the Orcc font-end andtranslated into Java IR graphs Figure 2 depicts an overview ofthe whole flow starting from these input IRs (120572119895119886V119886 120573119895119886V119886and 120574119895119886V119886) At this level MDC combines the dataflows intoa reconfigurable IR inserting the SBox actors (SB in Figure 2)Three SBoxes are required to share actor 119860 between 120572 and 120574and actor 119862 among all the three functionalities

Once the reconfigurable IR has been derived the dynamicpower manager can identify the corresponding LRs Thestarting 1198811015840119894 sets are

(i) 1198811015840120572 = 119860 119861 119862 SB 0 SB 1 SB 2(ii) 1198811015840120573 = 119863 119864 119862 SB 2

(iii) 1198811015840120574 = 119860 119865 119866 119862 SB 0 SB 1 SB 2The compare and split identification process produces fivedifferent LRs

(i) LR1 = 119861(ii) LR2 = 119862 SB 2(iii) LR3 = 119863 119864(iv) LR4 = 119860 SB 0 SB 1(v) LR5 = 119865 119866

LR4 and LR2 involve shared resources being activatedrespectively by 120572 and 120574 and by 120572 120573 and 120574 while the remain-ing LRs involve nonshared resources LR1 is activated only by120572 LR3 only by 120573 and LR5 only by 120574

At this point the selected power saving strategy is appliedduring the CGR systemHDL generation Figure 2 shows boththe final designs resulting from the application of clock gatingand power gating

The clock gated platform is shown in the bottom leftcorner of Figure 2 In this case the identified LRs become theClock Domains (CDs) of the resulting architecture meaningthat the involved actors are all driven by the same gatedclock SBoxes are not included in any CD since they are fully

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 3

designers need to consider both termswhen conceiving smartmanagement strategies Several techniques (clock gatingmultifrequency operand isolation multithreshold multisup-ply libraries power gating etc) exist and in some cases theyare automatically implemented by commercial synthesisplace-and-route tools In custom computing systems someadvanced design tools support the designers in the applica-tion-driven customization of the hardware architectures[30 31] However generally speaking invasive techniques(requiring insertion of additional logic and target technologysupport (such as the availability of dedicated cells andprocesses on the implementation stack)) still need tools to beguided with significant manual effort by the designer

Clock gating is an example of quite noninvasive tech-nique It may reduce the dynamic power consumption dueto the clock tree and to sequential logic up to the 40 [32] Itconsists in shutting off the clock of the unused synchronouslogic by means of simple AND gates Clock gating has beendeeply automated and it is available on most of the com-mercial synthesizers In the MPEG-RVC community recentstudies [33] presented an extension of a High-Level Synthesistool Xronos to selectively switch off clock signal for parts ofthe circuit that are idle due to stalls in the pipeline to reducepower consumption Moreover as mentioned the MDC toolhas the capability of identifying by means of a graph-basedanalysis of the input dataflow specifications independentcircuitry regions These logic regions can be clock gatedto dynamically adapt power consumption when switchingbetween different functionalities [17 18] From the technicalpoint of view in ASIC designsAND gates can be used directlyon the clock to disable it while in FPGA designs the clocknetwork cannot be modified by the insertion of any customlogic and dedicated cells are required (Xilinx boards eg areequipped with dedicated blocks (BUFGs) whose outputs candrive distinct regions of logic powering down different designportions (when enabled)) Clock gating can be applied atdifferent granularities fine-grained approaches act on singleregisters whereas coarse-grained ones are referred as in[17 18 33] to a set of resources Commercial synthesizersnormally can automate only fine-grained strategies

Power gating is quite invasive The main idea behind it isas follows if a specific portion of the design is not used in agiven computationmode then it can be completely switched-off by means of a sleep transistor This technique as theclock gating one is applicable at different granularities fine-grained approaches require driving a different sleep transistorfor every cell in the system while coarse-grained ones againoperate on a set of resources instantiating one sleep transistorto drive different cells connected to a shared power networkMDC as discussed in [18] supports also automatic powergating for CGR architectures Each identified logic regionin the CGR system is implemented (no matter of its natureor characteristics) as a different power domain (PD) thatin order to be managed requires to insert and drive thefollowing resources

(i) The sleep transistor between the gated region and themain power supply to switch onoff the derived powersupply

(ii) The isolation logic between the gated region andnormally-on cells to avoid the transmission of spuri-ous signals in input to the normally-on cells

(iii) The state retention logic to maintain where neededthe internal state of the gated region

MDC besides defining the power gated design netlist pro-vides also the automatic definition of the power format fileThis file specifies the shut-off isolation and state retentionrules (if you have 10 PD you are required to define 3 lowast10 = 30 interfaces) along with their respective enable signals([1 lowast shut-off + 1 lowast isol + 2 lowast reten] lowast10 = 40 signals) anda dedicated Power Controller to properly drive them Thepower format file automatically created byMDC is compliantwith the Silicon Integration Initiativersquos Common Power For-mat (CPF) whose definition is driven mainly by developersusing Cadence [34] tools

221 Modelling To the best of our knowledge literature doesnot treat the problem of modelling power gating and clockgating costs in CGR designs Some approaches only partiallyaddress the issue For example [35] focusses on low-powertechniques and power modelling for FPGAs In [36] onlyclock gating is taken into account different power states (onthe basis of the clock enable signals) are defined and theirconsumption is characterized by low-level Power Analysisresults [37] focusses on estimating the leakage reduction forpower gating and reverse body bias

The CASPER simulator for shared memory many-coreprocessors [38] includes precharacterized libraries contain-ing power dissipation models of different hardware com-ponents enabling accurate power estimation at a high-levelexploration stage In particular the authors implement Chip-wide Dynamic Voltage Frequency Scaling and PerformanceAwareCore-Specific Frequency ScalingTheFALPEM frame-work [39] provides power estimations at preregister transferlevel (RTL) stage specifically targeting the power consumedby clock network and interconnect but power and clockgating costs are not defined Other approaches perform anestimation that considers different components Li et al [40]propose an architecture-level integrated power area andtiming modelling framework for multicore systems whichevaluates system building blocks (CPU buses etc) fordifferent technology nodes providing also power gatingsupport Finally the work presented in [41] focuses on on-chip networks

3 Design Suite for Coarse-GrainedPower-Aware Systems

This section discusses the proposed technique for modellingthe power consumption of a CGR system when clock gatingor power gating are applied These models combined in aselection algorithm can be exploited for developing an auto-mated design flow for power-efficient CGR systems wherethe optimal saving strategy is selected for each identifiedworking set of resources

In this work we have embedded these models and thealgorithm in the Multi-Dataflow Composer (MDC) tool

4 Journal of Electrical and Computer Engineering

a framework capable of CGR systems characterization MDCprovides a comprehensive design suite automating severaldevelopment tasks of the synthesis and development of CGRsystems within design flows targeting both FPGA [29] andASICThe tool provides extensive support to dynamic powermanagement [17 18] addressing power-constrained designcases scenarios completely automating implementation andcontrol of clock gating and power gating strategies in thefinal CGR platform Nevertheless such techniques are notcurrently addressed in a hybrid manner the users mustchoose the approach to be used in the design without ana priori automated analysis process Such an unsupportedselection may easily lead to suboptimal implementations onthe final platform

In the following Section 31 provides an overview ofthe MDC baseline functionality and of the current powermanagement support Section 32 discusses the proposedpowermodels and automated selection algorithm identifyingoptimal powermanagement strategy inCGR systems In bothsections step-by-step examples are presented to clarify themethodology

31 The Multi-Dataflow Composer Tool MDC automatesgeneration andmanagement ofCGR systems facing the com-plex mapping of multiple applications onto a single recon-figurable architecture It automates the mapping process andguarantees the minimization of hardware resources allowingfor significant areaenergy savings [12 42] In literature thisproblem is known as datapath merging and it deals with thecombination of a set of input datapaths described by meansof graphs onto a single reconfigurable datapath It aims atmaximally sharing (among the different input graphs) bothprocessing nodes and connections

MDC is naturally compliant with the RVC-CAL for-malism and natively supports Dataflow Process Network(DPN) models as input Currently it is interfaced with theOpen RVC-CAL Compiler Orcc [43] The Orcc front-endis responsible for parsing one-by-one the high-level DPNspecifications of the different datapaths that MDCwill mergewithin the CGR system Please note that MDC can beinterfaced with other graph parsers so that it will be ableto be easily adapted to any other dataflow-based modellingenvironment

As depicted by Figure 1 the MDC baseline flow involvesthree main phases

(1) The input DPNs parsing performed by the Orccfront-end translates the RVC-CAL specifications intoJava Intermediate Representations (IRs) which arebasically directed graphs

(2) The datapath merging performed by the MDC front-end combines the IRs into a reconfigurable IRinserting (where necessary) special switching actors(responsible for properly distributing the token flowamong the different merged DPNs) keeping traceof the system programmability through a dedicatedConfiguration Table (TAB in Figure 1)

(3) The hardware platform generation performed by theMDC back-end leads to the creation of the RTL

that describes the CGR system itself where eachactor of the reconfigurable IR is mapped onto adifferent hardware FU In this phase the hardwarecommunication protocol and the HDL (HardwareDescription Language) components library (providingthe RTL descriptions of the required FUsmanually orautomatically generated) are provided as input to thetool

At the hardware level reconfiguration takes place ina single clock cycle It is achieved through low overheadswitching elements (SBoxes) that allow the sharing of com-mon resources among different input DPNs SBoxes are sim-ple combinatorial multiplexers and demultiplexers whoseconfiguration is stored into dedicated Look-Up tables thataccording to the Configuration Table compute the selectorsnecessary for the correct data forwarding in order to imple-ment the requested functionality

311 Automated Power Management Dealing with reconfig-urable architectures and in particular with CGR systemsthe power consumption has to be carefully taken into con-sideration Such a kind of systems is affected by resourceredundancymainly due to the FUs that are not shared amongdifferent functionalitiesThus when a certain functionality isexecuted part of the design (not involved in the computation)is in an idle state and can uselessly consume precious powerFortunately the unused resources for each implementedfunctionality depend on the input specifications and there-fore are fixed at design-time

Given these considerations then a CGR system canbe characterized by a set of disjointed logic regions (LRs)grouping the resources that are always activeinactive at thesame time The MDC power management extension is capa-ble of automatically identifying LRs It performs the LRs iden-tification at a high-level of abstraction on the reconfigurableIR by exploiting the intrinsic modularity of the dataflowgraphsOnce the LRs have been identified theMDCdynamicpower manager automatically applies according to the userselection either clock gating [17] or power gating [18] on theresulting CGR hardware platform The identification of theminimal number of LRs is guaranteed tominimize the poweroverhead of the extra logic needed to implement the selectedpower saving strategies An overview of the power manage-ment extension is provided by Figure 1

For each input DPN 119866119894 the currently available algorithmdetermines the set 1198811015840119894 which contains all the resources of thereconfigurable IR activated by 119866119894 These are the original setsof LRs that represent the starting point for the algorithm tofind the final LRs by iteratively comparing two 1198811015840119894 sets at atime determining their possible overlapping If overlappingis found its resources are removed by the two considered setsand a new 1198811015840119895 (corresponding to a new LR) involving theseshared resources is issued1198811015840119895 groups resources that are sharedamong different input DPNs while the remaining resourcesin the two 1198811015840119894 will uniquely belong to the originally con-sidered DPNs This compare and split identification processguarantees that the number of LRs found by the MDCdynamic power manager is the minimum achievable one

Journal of Electrical and Computer Engineering 5

HDL + powercontroller +

CPF

Clock gatedHDLH

DL

gene

ratio

n

Pow

erga

ting

Cloc

kga

ting

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

id

entifi

catio

n

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TAB

IRs

(a)

(b)

HDL

ORC

C

DPNs

DPNs parsing Datapath merging Hardware platform generation

Logic regions identification Power saving application

MDC baseline flow

MDC power management extension

Figure 1 MDC design suite baseline flow (a) and corresponding power extension (b)

If this number is still too high for the considered targetplatform as it can happen if FPGAs are the target devices (inFPGA devices the number of hardware blocks that can drivethe different LRs is limited eg 32 BUFG units are availablein Xilinx boards for clock management purposes) a LRsmerging process has to be applied MDC users are requiredto specify the target technology and themaximumnumber ofimplementable LRs This latter is compared with the numberof LRs determined by the compare and split identificationprocess and if necessary the LRs merging process is appliedTwo LRs at a time are unified (details on how to mergedifferent LR sets can be found in [17] where two mergingstrategies (a power-aware one and a number-aware one) arepresented) until the constraint fixed by the user is met Thisprocess leads to a suboptimal system implementation eachDPN while activating its corresponding LRs may also acti-vate some resources that do not contribute to its computationleading to extra unnecessary power consumption

MDC power management extension during the HDLgeneration phase provides also the implementation of the

chosen power management strategy upon the identified LRsIt blindly applies the selected strategy to all the identified LRswithout any warranty on the approach effectiveness Clockgating acts only on the dynamic contribute of the powerconsumption and requires a minimum logic overhead onthe final platform Indeed the simplest implementation isachieved by means of one AND gate for each LR plus oneunique Enable Generator to properly set the enable signals ofthe AND gates according to the desired functionality On thecontrary power gating is able to reduce both power contribu-tions static and dynamic by shutting off the power supply ofthe region However it is quitemore invasive since it requiresone power switch for each LR one state retention cell for eachFlip-Flop whose state has to be kept also when the corre-sponding LR is off and one isolation cell for each bit-wisewirethat goes from a disabled LR to an enabled one A differentclock gating cell (again an AND gate) is required for eachLR according to the switching-off protocol for the properoperation of the retention cells (details on the power gatingswitch onoff protocol can be found in [11]) Furthermore

6 Journal of Electrical and Computer Engineering

A

D

F

E

G

SB_0B

SB_2

SB_1

C

Mer

ging

CDIn2 Out1E

CAIn1 Out1B

F

AIn1

In3C Out1G

AIn1

DIn2

FIn3

E

G

SB_0B

SB_1

SB_2Out1

C

Power management

Power saving

Powergating

ClockgatingA

D

F

E

G

SB_0B

SB_1 SB_2

C

Logic regions

HDL generation

Clock gated platform Power gated platform

A

In1

D

In2

FIn3

E

G

B

SB_1

SB_2

Out1C

SB_0

ID net

CLK

Enab

le g

ener

ator

CD1

CD2 always on CD

CD3

CD4

CD5

en1

en4

en5

en3

ck ck

ck

ck

ck ck

ck

A

In1

D

In2

F

In3

E

G

B SB_1

SB_2

Out1C

PD1

PD2 always on PD

PD3

PD4

PD5

PD4State

retentionStateretention

Stateretention

Stateretention

Stateretention

Stateretention ISOL

ISOL

ISOLISOL

SB_0 ISOL

Powerswitch

Powerswitch

Powerswitch

ISOL

Powerswitch

ID net

Power controller

Logi

c reg

ions

iden

tifica

tion

120572java

120573java

120574java

V998400120572

V998400120573

V998400120574

LR1 LR3

LR2LR4

LR5

VDD

VDD

VDD

VDD

Figure 2 Step-by-step example of the MDC baseline and dynamic power management features Saving strategies are blindly applied by thedynamic power manager on each identified LR

one Power Controller block (involving a different finite statemachine for each LR) is needed to properly drive the insertedpower switch state retention and isolation cells

312 Step-by-Step Example In order to clarify the featuresprovided by MDC baseline functionality and its relateddynamic power manager extension this section describesa step-by-step example of the whole flow Three differentinput functionalities labelled 120572 120573 and 120574 are consideredand modelled as DPNs As first step of the baseline MDCfunctionality the DPNs are parsed by the Orcc font-end andtranslated into Java IR graphs Figure 2 depicts an overview ofthe whole flow starting from these input IRs (120572119895119886V119886 120573119895119886V119886and 120574119895119886V119886) At this level MDC combines the dataflows intoa reconfigurable IR inserting the SBox actors (SB in Figure 2)Three SBoxes are required to share actor 119860 between 120572 and 120574and actor 119862 among all the three functionalities

Once the reconfigurable IR has been derived the dynamicpower manager can identify the corresponding LRs Thestarting 1198811015840119894 sets are

(i) 1198811015840120572 = 119860 119861 119862 SB 0 SB 1 SB 2(ii) 1198811015840120573 = 119863 119864 119862 SB 2

(iii) 1198811015840120574 = 119860 119865 119866 119862 SB 0 SB 1 SB 2The compare and split identification process produces fivedifferent LRs

(i) LR1 = 119861(ii) LR2 = 119862 SB 2(iii) LR3 = 119863 119864(iv) LR4 = 119860 SB 0 SB 1(v) LR5 = 119865 119866

LR4 and LR2 involve shared resources being activatedrespectively by 120572 and 120574 and by 120572 120573 and 120574 while the remain-ing LRs involve nonshared resources LR1 is activated only by120572 LR3 only by 120573 and LR5 only by 120574

At this point the selected power saving strategy is appliedduring the CGR systemHDL generation Figure 2 shows boththe final designs resulting from the application of clock gatingand power gating

The clock gated platform is shown in the bottom leftcorner of Figure 2 In this case the identified LRs become theClock Domains (CDs) of the resulting architecture meaningthat the involved actors are all driven by the same gatedclock SBoxes are not included in any CD since they are fully

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

4 Journal of Electrical and Computer Engineering

a framework capable of CGR systems characterization MDCprovides a comprehensive design suite automating severaldevelopment tasks of the synthesis and development of CGRsystems within design flows targeting both FPGA [29] andASICThe tool provides extensive support to dynamic powermanagement [17 18] addressing power-constrained designcases scenarios completely automating implementation andcontrol of clock gating and power gating strategies in thefinal CGR platform Nevertheless such techniques are notcurrently addressed in a hybrid manner the users mustchoose the approach to be used in the design without ana priori automated analysis process Such an unsupportedselection may easily lead to suboptimal implementations onthe final platform

In the following Section 31 provides an overview ofthe MDC baseline functionality and of the current powermanagement support Section 32 discusses the proposedpowermodels and automated selection algorithm identifyingoptimal powermanagement strategy inCGR systems In bothsections step-by-step examples are presented to clarify themethodology

31 The Multi-Dataflow Composer Tool MDC automatesgeneration andmanagement ofCGR systems facing the com-plex mapping of multiple applications onto a single recon-figurable architecture It automates the mapping process andguarantees the minimization of hardware resources allowingfor significant areaenergy savings [12 42] In literature thisproblem is known as datapath merging and it deals with thecombination of a set of input datapaths described by meansof graphs onto a single reconfigurable datapath It aims atmaximally sharing (among the different input graphs) bothprocessing nodes and connections

MDC is naturally compliant with the RVC-CAL for-malism and natively supports Dataflow Process Network(DPN) models as input Currently it is interfaced with theOpen RVC-CAL Compiler Orcc [43] The Orcc front-endis responsible for parsing one-by-one the high-level DPNspecifications of the different datapaths that MDCwill mergewithin the CGR system Please note that MDC can beinterfaced with other graph parsers so that it will be ableto be easily adapted to any other dataflow-based modellingenvironment

As depicted by Figure 1 the MDC baseline flow involvesthree main phases

(1) The input DPNs parsing performed by the Orccfront-end translates the RVC-CAL specifications intoJava Intermediate Representations (IRs) which arebasically directed graphs

(2) The datapath merging performed by the MDC front-end combines the IRs into a reconfigurable IRinserting (where necessary) special switching actors(responsible for properly distributing the token flowamong the different merged DPNs) keeping traceof the system programmability through a dedicatedConfiguration Table (TAB in Figure 1)

(3) The hardware platform generation performed by theMDC back-end leads to the creation of the RTL

that describes the CGR system itself where eachactor of the reconfigurable IR is mapped onto adifferent hardware FU In this phase the hardwarecommunication protocol and the HDL (HardwareDescription Language) components library (providingthe RTL descriptions of the required FUsmanually orautomatically generated) are provided as input to thetool

At the hardware level reconfiguration takes place ina single clock cycle It is achieved through low overheadswitching elements (SBoxes) that allow the sharing of com-mon resources among different input DPNs SBoxes are sim-ple combinatorial multiplexers and demultiplexers whoseconfiguration is stored into dedicated Look-Up tables thataccording to the Configuration Table compute the selectorsnecessary for the correct data forwarding in order to imple-ment the requested functionality

311 Automated Power Management Dealing with reconfig-urable architectures and in particular with CGR systemsthe power consumption has to be carefully taken into con-sideration Such a kind of systems is affected by resourceredundancymainly due to the FUs that are not shared amongdifferent functionalitiesThus when a certain functionality isexecuted part of the design (not involved in the computation)is in an idle state and can uselessly consume precious powerFortunately the unused resources for each implementedfunctionality depend on the input specifications and there-fore are fixed at design-time

Given these considerations then a CGR system canbe characterized by a set of disjointed logic regions (LRs)grouping the resources that are always activeinactive at thesame time The MDC power management extension is capa-ble of automatically identifying LRs It performs the LRs iden-tification at a high-level of abstraction on the reconfigurableIR by exploiting the intrinsic modularity of the dataflowgraphsOnce the LRs have been identified theMDCdynamicpower manager automatically applies according to the userselection either clock gating [17] or power gating [18] on theresulting CGR hardware platform The identification of theminimal number of LRs is guaranteed tominimize the poweroverhead of the extra logic needed to implement the selectedpower saving strategies An overview of the power manage-ment extension is provided by Figure 1

For each input DPN 119866119894 the currently available algorithmdetermines the set 1198811015840119894 which contains all the resources of thereconfigurable IR activated by 119866119894 These are the original setsof LRs that represent the starting point for the algorithm tofind the final LRs by iteratively comparing two 1198811015840119894 sets at atime determining their possible overlapping If overlappingis found its resources are removed by the two considered setsand a new 1198811015840119895 (corresponding to a new LR) involving theseshared resources is issued1198811015840119895 groups resources that are sharedamong different input DPNs while the remaining resourcesin the two 1198811015840119894 will uniquely belong to the originally con-sidered DPNs This compare and split identification processguarantees that the number of LRs found by the MDCdynamic power manager is the minimum achievable one

Journal of Electrical and Computer Engineering 5

HDL + powercontroller +

CPF

Clock gatedHDLH

DL

gene

ratio

n

Pow

erga

ting

Cloc

kga

ting

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

id

entifi

catio

n

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TAB

IRs

(a)

(b)

HDL

ORC

C

DPNs

DPNs parsing Datapath merging Hardware platform generation

Logic regions identification Power saving application

MDC baseline flow

MDC power management extension

Figure 1 MDC design suite baseline flow (a) and corresponding power extension (b)

If this number is still too high for the considered targetplatform as it can happen if FPGAs are the target devices (inFPGA devices the number of hardware blocks that can drivethe different LRs is limited eg 32 BUFG units are availablein Xilinx boards for clock management purposes) a LRsmerging process has to be applied MDC users are requiredto specify the target technology and themaximumnumber ofimplementable LRs This latter is compared with the numberof LRs determined by the compare and split identificationprocess and if necessary the LRs merging process is appliedTwo LRs at a time are unified (details on how to mergedifferent LR sets can be found in [17] where two mergingstrategies (a power-aware one and a number-aware one) arepresented) until the constraint fixed by the user is met Thisprocess leads to a suboptimal system implementation eachDPN while activating its corresponding LRs may also acti-vate some resources that do not contribute to its computationleading to extra unnecessary power consumption

MDC power management extension during the HDLgeneration phase provides also the implementation of the

chosen power management strategy upon the identified LRsIt blindly applies the selected strategy to all the identified LRswithout any warranty on the approach effectiveness Clockgating acts only on the dynamic contribute of the powerconsumption and requires a minimum logic overhead onthe final platform Indeed the simplest implementation isachieved by means of one AND gate for each LR plus oneunique Enable Generator to properly set the enable signals ofthe AND gates according to the desired functionality On thecontrary power gating is able to reduce both power contribu-tions static and dynamic by shutting off the power supply ofthe region However it is quitemore invasive since it requiresone power switch for each LR one state retention cell for eachFlip-Flop whose state has to be kept also when the corre-sponding LR is off and one isolation cell for each bit-wisewirethat goes from a disabled LR to an enabled one A differentclock gating cell (again an AND gate) is required for eachLR according to the switching-off protocol for the properoperation of the retention cells (details on the power gatingswitch onoff protocol can be found in [11]) Furthermore

6 Journal of Electrical and Computer Engineering

A

D

F

E

G

SB_0B

SB_2

SB_1

C

Mer

ging

CDIn2 Out1E

CAIn1 Out1B

F

AIn1

In3C Out1G

AIn1

DIn2

FIn3

E

G

SB_0B

SB_1

SB_2Out1

C

Power management

Power saving

Powergating

ClockgatingA

D

F

E

G

SB_0B

SB_1 SB_2

C

Logic regions

HDL generation

Clock gated platform Power gated platform

A

In1

D

In2

FIn3

E

G

B

SB_1

SB_2

Out1C

SB_0

ID net

CLK

Enab

le g

ener

ator

CD1

CD2 always on CD

CD3

CD4

CD5

en1

en4

en5

en3

ck ck

ck

ck

ck ck

ck

A

In1

D

In2

F

In3

E

G

B SB_1

SB_2

Out1C

PD1

PD2 always on PD

PD3

PD4

PD5

PD4State

retentionStateretention

Stateretention

Stateretention

Stateretention

Stateretention ISOL

ISOL

ISOLISOL

SB_0 ISOL

Powerswitch

Powerswitch

Powerswitch

ISOL

Powerswitch

ID net

Power controller

Logi

c reg

ions

iden

tifica

tion

120572java

120573java

120574java

V998400120572

V998400120573

V998400120574

LR1 LR3

LR2LR4

LR5

VDD

VDD

VDD

VDD

Figure 2 Step-by-step example of the MDC baseline and dynamic power management features Saving strategies are blindly applied by thedynamic power manager on each identified LR

one Power Controller block (involving a different finite statemachine for each LR) is needed to properly drive the insertedpower switch state retention and isolation cells

312 Step-by-Step Example In order to clarify the featuresprovided by MDC baseline functionality and its relateddynamic power manager extension this section describesa step-by-step example of the whole flow Three differentinput functionalities labelled 120572 120573 and 120574 are consideredand modelled as DPNs As first step of the baseline MDCfunctionality the DPNs are parsed by the Orcc font-end andtranslated into Java IR graphs Figure 2 depicts an overview ofthe whole flow starting from these input IRs (120572119895119886V119886 120573119895119886V119886and 120574119895119886V119886) At this level MDC combines the dataflows intoa reconfigurable IR inserting the SBox actors (SB in Figure 2)Three SBoxes are required to share actor 119860 between 120572 and 120574and actor 119862 among all the three functionalities

Once the reconfigurable IR has been derived the dynamicpower manager can identify the corresponding LRs Thestarting 1198811015840119894 sets are

(i) 1198811015840120572 = 119860 119861 119862 SB 0 SB 1 SB 2(ii) 1198811015840120573 = 119863 119864 119862 SB 2

(iii) 1198811015840120574 = 119860 119865 119866 119862 SB 0 SB 1 SB 2The compare and split identification process produces fivedifferent LRs

(i) LR1 = 119861(ii) LR2 = 119862 SB 2(iii) LR3 = 119863 119864(iv) LR4 = 119860 SB 0 SB 1(v) LR5 = 119865 119866

LR4 and LR2 involve shared resources being activatedrespectively by 120572 and 120574 and by 120572 120573 and 120574 while the remain-ing LRs involve nonshared resources LR1 is activated only by120572 LR3 only by 120573 and LR5 only by 120574

At this point the selected power saving strategy is appliedduring the CGR systemHDL generation Figure 2 shows boththe final designs resulting from the application of clock gatingand power gating

The clock gated platform is shown in the bottom leftcorner of Figure 2 In this case the identified LRs become theClock Domains (CDs) of the resulting architecture meaningthat the involved actors are all driven by the same gatedclock SBoxes are not included in any CD since they are fully

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 5

HDL + powercontroller +

CPF

Clock gatedHDLH

DL

gene

ratio

n

Pow

erga

ting

Cloc

kga

ting

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

id

entifi

catio

n

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TAB

IRs

(a)

(b)

HDL

ORC

C

DPNs

DPNs parsing Datapath merging Hardware platform generation

Logic regions identification Power saving application

MDC baseline flow

MDC power management extension

Figure 1 MDC design suite baseline flow (a) and corresponding power extension (b)

If this number is still too high for the considered targetplatform as it can happen if FPGAs are the target devices (inFPGA devices the number of hardware blocks that can drivethe different LRs is limited eg 32 BUFG units are availablein Xilinx boards for clock management purposes) a LRsmerging process has to be applied MDC users are requiredto specify the target technology and themaximumnumber ofimplementable LRs This latter is compared with the numberof LRs determined by the compare and split identificationprocess and if necessary the LRs merging process is appliedTwo LRs at a time are unified (details on how to mergedifferent LR sets can be found in [17] where two mergingstrategies (a power-aware one and a number-aware one) arepresented) until the constraint fixed by the user is met Thisprocess leads to a suboptimal system implementation eachDPN while activating its corresponding LRs may also acti-vate some resources that do not contribute to its computationleading to extra unnecessary power consumption

MDC power management extension during the HDLgeneration phase provides also the implementation of the

chosen power management strategy upon the identified LRsIt blindly applies the selected strategy to all the identified LRswithout any warranty on the approach effectiveness Clockgating acts only on the dynamic contribute of the powerconsumption and requires a minimum logic overhead onthe final platform Indeed the simplest implementation isachieved by means of one AND gate for each LR plus oneunique Enable Generator to properly set the enable signals ofthe AND gates according to the desired functionality On thecontrary power gating is able to reduce both power contribu-tions static and dynamic by shutting off the power supply ofthe region However it is quitemore invasive since it requiresone power switch for each LR one state retention cell for eachFlip-Flop whose state has to be kept also when the corre-sponding LR is off and one isolation cell for each bit-wisewirethat goes from a disabled LR to an enabled one A differentclock gating cell (again an AND gate) is required for eachLR according to the switching-off protocol for the properoperation of the retention cells (details on the power gatingswitch onoff protocol can be found in [11]) Furthermore

6 Journal of Electrical and Computer Engineering

A

D

F

E

G

SB_0B

SB_2

SB_1

C

Mer

ging

CDIn2 Out1E

CAIn1 Out1B

F

AIn1

In3C Out1G

AIn1

DIn2

FIn3

E

G

SB_0B

SB_1

SB_2Out1

C

Power management

Power saving

Powergating

ClockgatingA

D

F

E

G

SB_0B

SB_1 SB_2

C

Logic regions

HDL generation

Clock gated platform Power gated platform

A

In1

D

In2

FIn3

E

G

B

SB_1

SB_2

Out1C

SB_0

ID net

CLK

Enab

le g

ener

ator

CD1

CD2 always on CD

CD3

CD4

CD5

en1

en4

en5

en3

ck ck

ck

ck

ck ck

ck

A

In1

D

In2

F

In3

E

G

B SB_1

SB_2

Out1C

PD1

PD2 always on PD

PD3

PD4

PD5

PD4State

retentionStateretention

Stateretention

Stateretention

Stateretention

Stateretention ISOL

ISOL

ISOLISOL

SB_0 ISOL

Powerswitch

Powerswitch

Powerswitch

ISOL

Powerswitch

ID net

Power controller

Logi

c reg

ions

iden

tifica

tion

120572java

120573java

120574java

V998400120572

V998400120573

V998400120574

LR1 LR3

LR2LR4

LR5

VDD

VDD

VDD

VDD

Figure 2 Step-by-step example of the MDC baseline and dynamic power management features Saving strategies are blindly applied by thedynamic power manager on each identified LR

one Power Controller block (involving a different finite statemachine for each LR) is needed to properly drive the insertedpower switch state retention and isolation cells

312 Step-by-Step Example In order to clarify the featuresprovided by MDC baseline functionality and its relateddynamic power manager extension this section describesa step-by-step example of the whole flow Three differentinput functionalities labelled 120572 120573 and 120574 are consideredand modelled as DPNs As first step of the baseline MDCfunctionality the DPNs are parsed by the Orcc font-end andtranslated into Java IR graphs Figure 2 depicts an overview ofthe whole flow starting from these input IRs (120572119895119886V119886 120573119895119886V119886and 120574119895119886V119886) At this level MDC combines the dataflows intoa reconfigurable IR inserting the SBox actors (SB in Figure 2)Three SBoxes are required to share actor 119860 between 120572 and 120574and actor 119862 among all the three functionalities

Once the reconfigurable IR has been derived the dynamicpower manager can identify the corresponding LRs Thestarting 1198811015840119894 sets are

(i) 1198811015840120572 = 119860 119861 119862 SB 0 SB 1 SB 2(ii) 1198811015840120573 = 119863 119864 119862 SB 2

(iii) 1198811015840120574 = 119860 119865 119866 119862 SB 0 SB 1 SB 2The compare and split identification process produces fivedifferent LRs

(i) LR1 = 119861(ii) LR2 = 119862 SB 2(iii) LR3 = 119863 119864(iv) LR4 = 119860 SB 0 SB 1(v) LR5 = 119865 119866

LR4 and LR2 involve shared resources being activatedrespectively by 120572 and 120574 and by 120572 120573 and 120574 while the remain-ing LRs involve nonshared resources LR1 is activated only by120572 LR3 only by 120573 and LR5 only by 120574

At this point the selected power saving strategy is appliedduring the CGR systemHDL generation Figure 2 shows boththe final designs resulting from the application of clock gatingand power gating

The clock gated platform is shown in the bottom leftcorner of Figure 2 In this case the identified LRs become theClock Domains (CDs) of the resulting architecture meaningthat the involved actors are all driven by the same gatedclock SBoxes are not included in any CD since they are fully

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

6 Journal of Electrical and Computer Engineering

A

D

F

E

G

SB_0B

SB_2

SB_1

C

Mer

ging

CDIn2 Out1E

CAIn1 Out1B

F

AIn1

In3C Out1G

AIn1

DIn2

FIn3

E

G

SB_0B

SB_1

SB_2Out1

C

Power management

Power saving

Powergating

ClockgatingA

D

F

E

G

SB_0B

SB_1 SB_2

C

Logic regions

HDL generation

Clock gated platform Power gated platform

A

In1

D

In2

FIn3

E

G

B

SB_1

SB_2

Out1C

SB_0

ID net

CLK

Enab

le g

ener

ator

CD1

CD2 always on CD

CD3

CD4

CD5

en1

en4

en5

en3

ck ck

ck

ck

ck ck

ck

A

In1

D

In2

F

In3

E

G

B SB_1

SB_2

Out1C

PD1

PD2 always on PD

PD3

PD4

PD5

PD4State

retentionStateretention

Stateretention

Stateretention

Stateretention

Stateretention ISOL

ISOL

ISOLISOL

SB_0 ISOL

Powerswitch

Powerswitch

Powerswitch

ISOL

Powerswitch

ID net

Power controller

Logi

c reg

ions

iden

tifica

tion

120572java

120573java

120574java

V998400120572

V998400120573

V998400120574

LR1 LR3

LR2LR4

LR5

VDD

VDD

VDD

VDD

Figure 2 Step-by-step example of the MDC baseline and dynamic power management features Saving strategies are blindly applied by thedynamic power manager on each identified LR

one Power Controller block (involving a different finite statemachine for each LR) is needed to properly drive the insertedpower switch state retention and isolation cells

312 Step-by-Step Example In order to clarify the featuresprovided by MDC baseline functionality and its relateddynamic power manager extension this section describesa step-by-step example of the whole flow Three differentinput functionalities labelled 120572 120573 and 120574 are consideredand modelled as DPNs As first step of the baseline MDCfunctionality the DPNs are parsed by the Orcc font-end andtranslated into Java IR graphs Figure 2 depicts an overview ofthe whole flow starting from these input IRs (120572119895119886V119886 120573119895119886V119886and 120574119895119886V119886) At this level MDC combines the dataflows intoa reconfigurable IR inserting the SBox actors (SB in Figure 2)Three SBoxes are required to share actor 119860 between 120572 and 120574and actor 119862 among all the three functionalities

Once the reconfigurable IR has been derived the dynamicpower manager can identify the corresponding LRs Thestarting 1198811015840119894 sets are

(i) 1198811015840120572 = 119860 119861 119862 SB 0 SB 1 SB 2(ii) 1198811015840120573 = 119863 119864 119862 SB 2

(iii) 1198811015840120574 = 119860 119865 119866 119862 SB 0 SB 1 SB 2The compare and split identification process produces fivedifferent LRs

(i) LR1 = 119861(ii) LR2 = 119862 SB 2(iii) LR3 = 119863 119864(iv) LR4 = 119860 SB 0 SB 1(v) LR5 = 119865 119866

LR4 and LR2 involve shared resources being activatedrespectively by 120572 and 120574 and by 120572 120573 and 120574 while the remain-ing LRs involve nonshared resources LR1 is activated only by120572 LR3 only by 120573 and LR5 only by 120574

At this point the selected power saving strategy is appliedduring the CGR systemHDL generation Figure 2 shows boththe final designs resulting from the application of clock gatingand power gating

The clock gated platform is shown in the bottom leftcorner of Figure 2 In this case the identified LRs become theClock Domains (CDs) of the resulting architecture meaningthat the involved actors are all driven by the same gatedclock SBoxes are not included in any CD since they are fully

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 7

combinatorial modules It can be noticed by Figure 2 that theclock gating overhead is limited to four AND gates (LR2 isactivated by all the implemented functionalities and does notneed to be turned off) and oneEnable Generator that properlyassigns the clock enable values

The power gated platform is depicted in the bottomright corner of Figure 2 LRs define the architecture powerdomains (PDs) where in this case also SBoxes are taken intoconsideration Power gating turns off the whole PD powersupply and it has effect also on combinatorial blocks Figure 2clearly shows that the logic overhead of power gating is largerthan the clock gating one In the power gated platform powerswitches state retention cells isolation cells and one PowerController are inserted Please note that clock gating cellsare not reported for simplicity Again the logic necessary toswitch off LR2 is avoided since this region is an always on one(being activated by all the input DPNs)

32 Automated Power Management with Hybrid Clock andPower Gating To overcome the limits of a blindly appliedunique power management strategy we propose in thispaper a power estimation flow capable of (1) characterizingat a high-level of abstraction the LRs identified by the MDCpower extension and of (2) autonomously applying the opti-mal power reduction technique for each LR Power and clockgating overhead are estimated based on LRs characteristicsbefore any physical implementation This strategy is meantfor ASIC technologies which allow hybrid power and clockgating support over the same CGR design

The estimation is based on two sets of models thatdetermine the static and dynamic consumptions of each LRwhen clock gating or power gating are appliedThe proposedmodels are derived after a single logic synthesis of the baselineCGR system generated by MDC carried out with commer-cial synthesis tools from the analysis of the power reportsobtained after netlist simulation Such a synthesis constitutesthe only implementation effort required for the designerbesides the characterization of technique-specific blocks suchas the Enable Generator or the Power Controller Models aretechnology-dependent since they include parameters that arecharacteristic of the chosen target technology library as it willbe discussed in Section 324

321 Power Gating Static Power Consumption Model Staticpower can be estimated on the basis of the leakage contributesprovided for each cell by the targeted ASIC library Givenany hardware FU (uniquely corresponding to an actor ofthe reconfigurable IR) its static power can be obtained bysumming up the single contributions of the adopted cellsThestatic power consumption term is tightly related to the LRarea the more cells are included in the considered region themore is its corresponding static dissipation

The proposed model for the static power consumption isdefined as follows119875lkg (LR119894) = 119875lkgON (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (RC) lowast rtn

+ 119875lkg (reg) lowast reg minus rtnreg

] lowast TiON + [119875lkg (ISOON)

lowast TiON + 119875lkg (ISOOFF) lowast TiOFF] lowast iso+ [119875lkg (ContrON) lowast TiON + 119875lkg (ContrOFF)lowast TiOFF] + [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF)lowast TiOFF]

(1)

Dealing with a prospective power gating implementationthe static estimation (1) for each LR involves two terms119875lkgON(LR119894) corresponds to the static consumption whenthe LR is active and Ext Overlkg(LR119894) refers to the poweroverhead due to the additional power gating logic Thissecond term does not consider the power switch overheadsince it is not included in the prelayout netlist Power gatingprevents by definition any static dissipation on the LR whendisabled therefore (1) does not present any 119875lkgOFF(LR119894)119875lkgON(LR119894) is obtained as the multiplication of the LRactivation time TiON and the sum of leakage power of theinvolved actors considering separately combinatorial andsequential logicThe former119875lkg(cmb) is equal to the leakageof the combinatorial cells within the considered LRThe latteris related to the number of registers (reg) within the LR andtheir need (according to the implemented functionality) ofpreserving or not their status bymeans of state retention cellswhen the region is inactive Then it involves in turn twoterms The first one refers to the registers whose state can belost and it is estimated on the basis of the static consumptionof the sequential cells (119875lkg(reg)) as an average on the numberof registers that are not retainedThe second one refers to theretention cells and it is estimated starting from the number ofregisters whose state has to be maintained (rtn) multipliedby the leakage of a single state retention cell (119875lkg(RC)) whosevalue is retrieved from the target ASIC library

Ext Overlkg(LR119894) is composed of three terms the firstone is related to the isolation cells (iso) the second one tothe Power Controller and the third one to the clock gatingcell (the power gating switch off protocol requires applyingclock gating at the region level before retaining the registersvalue) Note that unlike119875lkgON for the three abovementionedterms Ext Overlkg characterizes the LR static consumption inboth its on and off states In the on state the model accountsfor the static consumption in the on state (eg 119875lkg(ISOON))multiplied by the activation time TiON and by the overallnumber of cells within the LR (eg iso) In the off statethe model accounts for the static consumption in the offstate (eg 119875lkg(ISOOFF)) multiplied by the inactive time TiOFFand by the overall number of cells within the LR (eg iso)Please note that there is just one Power Controller for allthe LRs and one clock gating cell per LR but an a prioricharacterization phase would be required to the designersince their consumption values cannot be retrieved directlyfrom any ASIC library

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

8 Journal of Electrical and Computer Engineering

322 Power Gating Dynamic Power Consumption ModelEstimating the dynamic power is more complex than esti-mating the static one since this term strongly depends onthe nodes switching activity Frequently commercial tools(eg Cadence Encounter Digital Implementation System)consider dynamic power as composed of two main termsas depicted by (2) a net contribution due to the powerdissipated throughout the wires linking the cells and aninternal contribution due to the dissipation occurring insidethe cells [44]

119875dyn = 119875net + 119875int= 12119891119881

2119863119863sum

net119895119862load119895SW119895 + 119891sum

cell119894

119875119894SW119894 (2)

The operating frequency 119891 influences both terms 119875netaccounts for the load capacitance of each net119895 (bearing aspecific capacitance 119862load119895) and the related switching activity(SW119895) whereas 119875int depends on the power per MHz dis-sipated by each cell (119875119894) and the related switching activity(SW119894)

Currently the developed model is able to estimate onlythe 119875int contribution that can be expressed for each single LRas follows

119875int (LR119894) = 119875intON (LR119894) + Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (RC) lowast rtn

+ 119875int (reg) lowast reg minus rtnreg

] lowast TiON + [119875int (ISOON)lowast TiON + 119875int (ISOOFF) lowast TiOFF] lowast iso+ [119875int (ContrON) lowast TiON + 119875int (ContrOFF)lowast TiOFF] + [119875int (CGON) lowast TiON + 119875int (CGOFF)lowast TiOFF]

(3)

Considering a prospective power gating implementation thedifferent parts of (3) basically reflect the ones of (1)Themaindifference among the static power model and the dynamicone is that this latter requires accurate data in terms ofnodes switching activity For this reason the netlist of thebaseline CGR system is not sufficient to retrieve accuratevalues from the power reports and one different simulationof the netlist for every implemented functionality is requiredThus dynamic power model takes into consideration thereal system switching activity provided by the hardwaresimulations

The 119875net term of (2) is not currently addressed in ourmodel Nevertheless as demonstrated further on in this work(please see Tables 8 9 12 and 14) neglecting this termseems not to affect the optimal identification of the regionto be gated We have already planned to extend the proposedmethodology to include the nets contribution in a near future

323 Clock Gating Static and Dynamic Power ConsumptionModels Clock gating static and dynamic models are lesscomplicated than the power gating ones since clock gatingrequires a very low logic overhead and it positively acts onlyon the dynamic dissipation Equations (4) and (5) report themodels adopted respectively for the static power estimationand for the dynamic power one referring to a clock gateddesign119875lkg (LR119894) = 119875lkg (LR119894) + Ext Overlkg (LR119894)= sum

actorsisinLR119894

[119875lkg (cmb) + 119875lkg (reg)]

+ [119875lkg (EnabON) lowast TiON + 119875lkg (EnabOFF) lowast TiOFF]+ [119875lkg (CGON) lowast TiON + 119875lkg (CGOFF) lowast TiOFF]

(4)

119875int (LR119894) = 119875int (LR119894) + Ext Overint (LR119894)= 119875int (combLR119894) + 119875intON (seqLR119894)+ Ext Overint (LR119894)= sum

actorsisinLR119894

[119875int (cmb) + 119875int (reg) lowast TiON]

+ [119875int (EnabON) lowast TiON + 119875int (EnabOFF) lowast TiOFF]+ [119875int (CGON) lowast TiON + 119875int (CGOFF) lowast TiOFF]

(5)

At the logic region level (4) considers always the combi-natorial and sequential contributions for both the on or offstates since clock gating does not affect the system leakagewhereas (5) considers always the combinatorial part for boththe on or off states since combinatorial logic cannot benefitfromclock gating and the sequential contribution only duringthe LR active time The overhead Ext Overint(LR119894) is givenby the clock gating cell and the Enable Generator Pleaseremember that implementing clock gating management at acoarse-grained level just one clock gating cell per LR has tobe inserted within the system Equation (5) is pretty muchthe same as (4) a part from the fact that dealing with thedynamicmodel clock gating effects are estimated by omittingthe contribute of sequential logic when the LR is off324 ParametersDiscussion Theproposedmodels are deter-mined by the intrinsic features of the LRs In particular theyconsider the following

(i) Architectural Parameters LRs composition deter-mines the amount of involved combinatorial andsequential cells

(ii) Functional Parameters LRs behaviour defines theregion activation time and if its status has to bepreserved or not

(iii) Technological Parameters Target technology has animpact on the ratio between dynamic and static power(as it will be demonstrated in Section 42) and on thedifferent cells characterization

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 9

Table 1 reports for each parameter considered in (1)(3) (4) and (5) their classification A deeper explanationabout 119875lkgint(cmb) and 119875lkgint(reg) is necessary They are notassociated with any specific parameters class indeed theydepend on type and number of involved cells composingthe considered LR and also on the system switching activity(especially for the internal contribute) These values aregathered by the reports of the baseline CGR system netlistassuming that the amount and type of cells composing theFUs do not change as power saving strategies are applied(except for the retained registers)

325 Step-by-Step Example The example proposed in Fig-ure 2 shows the merging process of three DPNs in a CGRsystem where five LRs have been identified Equations (1)(3) (4) and (5) can be applied to all of them However asalready discussed LR2 can be discarded being common toall the input DPNs it is always active and does not requireto be switched-off As defined in the previous section theparameters reported in Table 2 are extracted by the referencetechnology library or characterized by synthesis trials (seethe definition provided in Table 1) The power consumptionvalues in Table 3 have been extracted by the synthesis reportsof the baseline (with no power saving applied) CGR platformand required three hardware simulations (one for each inputkernel)These simulations are necessary to correctly estimatethe internal power consumption of the different LRs takinginto account the real switching activity of the design Inpractice power values are determined as an average of thoseobtained according to the different switching activity profiles

Starting from the data in Tables 3 and 2 here follows thedetailed equations characterization for LR5 which includeactors 119865 and G

When power gating is applied the static power consump-tion of LR5 is derived according (1) as follows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (RC) lowast rtn119865

+ 119875lkg (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875lkg (comb119866)

+ 119875lkg (RC) lowast rtn119866 + 119875lkg (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875lkg (ISOON)

lowast 1198795ON + 119875lkg (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875lkg (ContrON) lowast 1198795ON + 119875lkg (ContrOFF)lowast 1198795OFF] + [119875lkg (CGON) lowast 1198795ON + 119875lkg (CGOFF)lowast 1198795OFF] = [(213 + 1715 lowast 128) + (273 + 1715lowast 64 + 1385 lowast 05)] lowast 03 + [427 lowast 03 + 139lowast 07] lowast 32 + [9544 lowast 03 + 8863 lowast 07] + [577lowast 03 + 471 lowast 07] = 1509219

(6)

The internal power consumption is given by (3)

119875int (LR5) = [(119875int (comb119865) + 119875int (RC) lowast rtn119865

+ 119875int (reg119865) lowast reg119865 minus rtn119865reg119865 ) + (119875int (comb119866)

+ 119875int (RC) lowast rtn119866 + 119875int (reg119866)lowast reg119866 minus rtn119866

reg119866 )] lowast 1198795ON + [119875int (ISOON)lowast 1198795ON + 119875int (ISOOFF) lowast 1198795OFF] lowast iso5+ [119875int (ContrON) lowast 1198795ON + 119875int (ContrOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 38325 lowast 128) + (363 + 38325lowast 64 + 44068 lowast 05)] lowast 03 + [27 lowast 03 + 0 lowast 07]lowast 32 + [1449 lowast 03 + 1488 lowast 07] + [169 lowast 03+ 292 lowast 07] = 3021772

(7)

When clock gating is considered (4) and (5) are computed asfollows

119875lkg (LR5) = [(119875lkg (comb119865) + 119875lkg (reg119865))+ (119875lkg (comb119866) + 119875lkg (reg119866))] + [119875lkg (EnabON)lowast 1198795ON + 119875lkg (EnabOFF) lowast 1198795OFF] + [119875lkg (CGON)lowast 1198795ON + 119875lkg (CGOFF) lowast 1198795OFF] = [(213 + 1232)+ (273 + 1385)] + [8451 lowast 03 + 7654 lowast 07]+ [577 lowast 03 + 471 lowast 07] = 318696

119875int (LR5) = [(119875int (comb119865) + 119875int (reg119865) lowast 1198795ON)+ (119875int (comb119866) + 119875int (reg119866) lowast 1198795ON)]+ [119875int (EnabON) lowast 1198795ON + 119875int (EnabOFF)lowast 1198795OFF] + [119875int (CGON) lowast 1198795ON + 119875int (CGOFF)lowast 1198795OFF] = [(537 + 22489 lowast 03)+ (363 + 44068 lowast 03)] + [1351 lowast 03 + 1320 lowast 07]+ [169 lowast 03 + 292 lowast 07] = 224518

(8)

Table 4 summarizes all the values achieved applying theproposed static and dynamic models to all the different logicregions

326 Hybrid Clock and Power Gating Support and Integrationin MDC The discussed models (described in (1) (3) (4)and (5)) have been integrated in the MDC design flow inorder to implement a fully automated power management

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

10 Journal of Electrical and Computer Engineering

Table1Parameter

classificatio

nTabler

eportsfore

achconsidered

parametertypolog

y(architecturalfun

ctionalandtechno

logical)

descrip

tion

andextractio

nmetho

d

Parameter

Arch

Funct

Tech

Descriptio

nEx

tractio

n

reg

xNum

bero

fsequentialcellsin

thec

onsid

ered

LRProvided

bythes

ynthesisrepo

rts

iso

xNum

bero

festim

ated

isolationcells

inthe

desig

n

Obtainedby

coun

tingthen

umbero

fwire

sthat

conn

ectthe

different

LRam

ongeach

otherin

thed

ataflow

mod

elTi

ON

xAc

tivationtim

eFo

undby

means

ofah

igh-level(directlyon

the

dataflo

w)p

rofilingof

thetargetedscenario

TiOFF

xOfftim

e

rtn

xNum

bero

fretentio

ncells

inthed

esign

Stric

tlyrelatedto

theL

Rfunctio

nalityandis

chosen

bythed

esigner

119875 lkgint(RC

)x

Power

estim

ationof

retentionandiso

lation

cells

Determined

bythes

electedsynthesis

process

andcanbe

obtained

with

outany

implem

entatio

nrun

119875 lkgint(IS

OON)

x119875 lkgint(IS

OOFF)

x119875 lkgint(Co

ntr O

N)

xPo

wer

estim

ationof

PGandCG

controller

whentheL

Rison

oroff

Itrequ

iresb

eing

characteriz

edaccording

tothetargettechn

olog

ywith

dedicatedsynthesis

trials

119875 lkgint(Co

ntr O

FF)

x119875 lkgint(En

abON)

x119875 lkgint(En

abOFF)

x119875 lkgint(cm

b)x

xx

Power

consum

edby

combinatoria

land

sequ

entia

lcellswith

inthec

onsid

ered

LRGatheredby

ther

eportsof

theb

aseline

CGR

syste

mnetlist

119875 lkgint(re

g)x

xx

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 11

Table 2 Contributions of static and internal power consumptionextracted by the reference technology library or characterized bysynthesis trials

Parameters lkg power [nW] int power [nW]EnabON 8451 1351EnabOFF 7654 1320ContrON 9544 1449ContrOFF 8863 1488CGON 577 169CGOFF 471 292ISOOn 427 27ISOOFF 139 0119875(RC) 1715 38325

PG set is emptyCG set is emptyforeach LR119894 in set LRs do

evaluate area(LR119894 areath)endfunction evaluate area(LR119894 areath)calculate LR119894 areaif areaLR gt areath then

evaluate PG(LR119894)else

evaluate CG(LR119894)endfunction evaluate PG(LR119894)estimate PG total overhead

if PG total overhead lt 0 thenestimate CG total overheadif PG total overhead lt CG total overhead then

add LR119894 to PG setelse

add LR119894 to CG setend

elseevaluate CG(LR119894)

endevaluate CG(LR119894)function evaluate CG(LR119894)estimate CG total overheadif CG total overhead lt 0 then

add LR119894 to CG setend

Algorithm 1 Automatic power saving strategy selection for CGRsystems

strategy Designers are guided towards the optimal solutionfor each LR rather than choosing a one-fit-to-all switching-off technique for all of them

This automated selection flow is implemented as reportedin Algorithm 1 For each LR identified by the MDC powermanagement extension Algorithm 1 executes the followingsteps embodied by different functions

(1) Area Thresholding (See evaluate area Function) As previ-ously discussed power gating is a quite invasive technique

requiring a lot of extra logic to be inserted in the nonswitch-able always on domain Thus for small LRs we can assumethat it will not bring any benefit so that power gating is notto be considered for implementation Indeed clock gatingmay still be beneficial due to its very small additional logicamount

(2) PowerGatingOverhead Estimation (See evaluate PGFunc-tion in Algorithm 1) Power gating cost is estimated in order tofind out if it can lead to power saving or not The prospectivepower and clock gating implementations are compared on thebasis of their overall consumption Equation (1) is applied andsummed to (3) if there is not total power saving the algorithmgoes to the clock gating overhead estimation On the contraryif there is saving it has to be compared with the sum of (4)and (5) to determine whether the current LR may benefitfrom power gating (despite its larger overhead) or from clockgating

(3) Clock Gating Overhead Estimation (See evaluate CG Func-tion in Algorithm 1) Clock gating cost is estimated to investi-gate the possibility of achieving power saving with this tech-nique If the LR clock gating achievable saving does notcounterbalance its implementation costs the LR is discardedThis means that when the MDC back-end generates the RTLdescription of the CGR system the LR logic is included in thealways on domain On the contrary if the clock gating leadsto an overall saving in terms of total power the LR will beclock gated during the implementation

The output of Algorithm 1 is the classification of theLRs stating which one should be power gated (see PG set inAlgorithm 1) which one should be clock gated (see CG setin Algorithm 1) and which ones should be included in thealways on domain

Figure 3 provides an overviewof themodified design flowAs it can be noticed theMDC tool and its powermanagementextension are directly interfaced with the logic synthesizerAlgorithm 1 is implemented within the Power Analysis blockMDC baseline tool provides the HDL description of the plainCGR system and all the scripts to perform the synthesis ofthe CGR design and all the different hardware simulations(one for each inputDPNs) as required by the proposed powerestimationmodelsThe power reports are then fed back to theMDC power management extension and parsed within thePower Analysis to execute Algorithm 1 The LRs classification(see LR class in Figure 3) generated by the Power Analysisblock is used by the CGPG HDL Generation block toautomatically define the hybrid clock and power gatingpower management support for the given CGR design

Summarizing this flow with respect to what is discussedin Section 31 does not require designers to opt for a specificpower management technique On the basis of the proposedpower estimation models and by linking MDC with a logicsynthesis tool the presented flow is capable of overcomingthe limit of providing a one-fit-to-all solution Each LR in aCGR design is supported (where necessary) with the optimalpower management technique

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

12 Journal of Electrical and Computer Engineering

Table 3 Parameter and power consumption of each LR extracted by the synthesis reports of the baseline CGR platform

Logic region Kernel 119879ON Actors iso reg rtnLR1LR1 120572 01 B 32 514 24LR3 120573 06 D E 32 8 8LR4 120572 120574 04 A SB 0 SB 1 96 256 64LR5 120574 03 F G 32 265 192

Actor Power [nW]lkg seq int seq lkg comb int comb reg rtn

B 801 104987 121411 3916599 512 24D 48 1104 51 319 4 4E 56 1437 53 198 4 4A 3264 89238 0 0 256 64SB 0 0 0 307 409 mdash mdashSB 1 0 0 225 350 mdash mdashF 1232 22489 213 537 128 128G 1385 44068 273 363 128 64

Table 4 Resulting power consumption of the different LRs when the proposed models are applied

Logic region lkg PG [nW] int PG [nW] lkg CG [nW] int CG [nW]LR1 124069 40435871 12229415 392870060LR3 34267 388444 29467 3599LR4 206211 3870508 388086 55986LR5 150958 3071272 318696 224518

327 Step-by-Step Example In this section a step-by-stepexample of the application of Algorithm 1 is presentedconsidering the same example proposed in Figure 2 In thatcase MDC LRs identification led to determining five LRs andthe user-specified power management technique is blindlyapplied to all of them except LR2 This region is used by allthe inputDPNs thus it is never disabled and does not requireany power management support In the following step-by-step example shown in Figure 4 the threshold on the area(areath) is set to 5

(i) LR1 is processed by invoking evaluate area(LR1 5)(a) Its area is calculated areaLR1 = 52 of total area(b) areaLR1 gt areath so that a prospective power

gating implementation on LR1 is taken intoconsideration by invoking evaluate PG(LR1)(1) The static and the dynamic overheads are

estimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is calculated by subtractingthe power consumption of the LRwhen PGis applied to the power consumption of theLR in the baseline design the result is thendivided by the total power consumptionof the baseline design in order to estimatethe total percentage power variation Thepower variation when PG is applied toregion LR1 is minus8645 Since this value isnegative power gating may be convenient

if its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus215

(3) Power gating is more beneficial than clockgating determining overall a larger powersaving Thus LR1 is added to PG set

(ii) LR3 is processed by invoking evaluate area(LR3 5)(a) Its area is calculated areaLR3 = 04of total area(b) areaLR3 lt areath so that a prospective clock

gating implementation on LR3 is consideredstraight away by invoking evaluate CG(LR3)(1) Equations (4) and (5) are evaluated to

determine clock gating overhead on theoverall consumption CGover = +001

(2) Clock gating is not beneficial since itsoverhead is positive Thus LR3 is discardedand no power management policy will beapplied to it

(iii) LR4 is processed by invoking evaluate area(LR4 5)(a) Its area is calculated areaLR4 = 7 of total area(b) areaLR4 gt areath so that a prospective power

gating implementation on LR4 is taken intoconsideration by invoking evaluate PG(LR4)

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 13

CGP

G H

DL

gene

ratio

n

HDL actorslibrary

Communicationprotocol

IR

TAB Logi

c reg

ions

iden

tifica

tion

Communicationprotocol

HD

L ge

nera

tion

Mer

ging

pro

cess

HDL actorslibrary

IR

TABIR

HDL

Synthesizer

Scripts

Pow

er an

alys

is

Reports

OptimalHDL

+ CPF

ORC

C

DPNs

DPNs parsing Datapath mergingHardware platform

generation

Logic regions identification

Power saving applicationHybrid clock and power gating extension

MDC baseline flow

LR class

Figure 3 Enhanced MDC design suite integration of the automated hybrid clock and power gated support

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus12 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equation (4) is calculated and summedup to (5) to determine clock gating over-head on the overall consumption which isminus20

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR4 is added to CG set

(iv) Finally LR5 is processed by invoking evaluatearea(LR5 5)(a) Its area is calculated areaLR5 =15 of total area(b) areaLR5 gt areath so that a prospective power

gating implementation on LR5 is taken intoconsideration by invoking evaluate PG(LR5)

(1) The static and the dynamic overheads areestimated respectively applying (1) and (3)The power gating overhead on the overallconsumption is minus089 Since this value isnegative power gating may be convenientif its total saving is larger than the clockgating one

(2) Equations (4) and (5) are evaluated todetermine clock gating overhead on theoverall consumption which is minus104

(3) Clock gating is more beneficial than powergating determining overall a larger powersaving Thus LR5 is added to CG set

The resulting hardware designwith the hybrid applicationof clock gating and power gating is shown in Figure 5Comparing this design with the two reported in Figure 2 wecan notice as now the power gating is applied only to regionLR1 (called PD1 in the figure) while the clock gating is appliedto regions LR4 and LR5 (called CD4 and CD5 in the figure)the SBoxes SB 0 and SB 1 included in region LR4 are purelycombinatorial so they are not affected by the application

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 14: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

14 Journal of Electrical and Computer Engineering

LRs Area Base (nW) PG (nW) (over) CG (nW) (over)

52 4143798 416765 (minus8645) 4050955 (minus215)

04 3266 4227 (+002) 3894 (+001)7 93793 40767 (minus12) 9479 (minus20)

15 70560 32222 (minus089) 25639 (minus104)

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

PG_set

CG_set

Threashold Area_th = 5

Evaluate_area

LR_area gt Area_th rarr Evaluate_PG

LR_area lt Area_th rarr Evaluate_CG

CG Total Overhead lt 0 rarr Donrsquot apply Power Saving

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

PG Total Overhead lt 0 rarr Evaluate_CG

Evaluate_area

Evaluate_area

PG Total Overhead lt 0 rarr Evaluate_CG

Base design consumption 4311417nWIs empty

(LR1 5)

(LR1)

(LR1)

(LR4 5)

LR_area gt Area_th rarr Evaluate_PG (LR4)

(LR4)

(LR5 5)

LR_area gt Area_th rarr Evaluate_PG (LR5)

(LR5)

(LR3 5)

(LR3)

LR3

LR3

LR1

LR1

LR1

LR1

LR1

LR1

LR4

LR1

LR5

LR4

LR3

LR4

LR4

LR5

LR5

PG Total Overhead lt CG Total Overhead rarr Add LR1 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR4 to PG_set

PG Total Overhead gt CG Total Overhead rarr Add LR5 to PG_set

LR4 LR5

Figure 4 Step-by-step example of the enhanced MDC power extension implementing Algorithm 1

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 15: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 15

AIn1

DIn2

FIn3

E

G

B SB_1

SB_2

Out1C

PD1CD4

CD5

CD4State

retention ISOL

SB_0

Powerswitch

ID n

et

Power controllerClk

CG CG CG

VDD

Figure 5 Enhanced MDC design suite hardware platform with hybrid application of clock gating and power gating methodologies

of the clock gating All the remaining logic which includesregion LR3 is always on

4 Assessments

In order to assess the proposed power estimation flowand the effectiveness of the hybrid clock and power gatingmanagement in this section we discuss two use cases whichare completely different in terms of both behaviour andresulting power consumption contributions The first onedeals with a simple FFT algorithm implemented on a 90 nmCMOS technology and it has beenmainly adopted to evaluatein detail the proposed flow The second one presents a morecomplex scenario An image coprocessing unit acceleratinga zoom application has been implemented both on a 90 nmand on a 45 nm CMOS technology in order to access therobustness of the proposed flow with different technologyparameters

In the following Section 41 discusses the evaluationphase involving the FFT use case while Section 42 describesthe experimental results conducted to validate the approachon the zoom application Finally Section 44 details thebenefits of adopting the proposedmodels and their correlatedflow

41 Evaluation Phase This section deeply discusses theresults obtained considering the FFT use case targeting a90 nm CMOS technology

411 Fast Fourier Transform Algorithm Fast Fourier Trans-form (FFT) is an optimised algorithm for the DiscreteFourier Transform (DFT) calculation It is widely adopted inseveral applications ranging from the solving of differentialequations to the digital signal processing We refer to theoriginal DFT equation

119883119896 =119873

sum119899=0

119909119899119890minus1198942120587119896(119899119873) 119896 = 0 1 119873 minus 1 (9)

The FFT algorithm that has been adopted for this usecase has been proposed by Cooley and Tukey [45] It aimsat speeding up the calculation of a given size 119873 DFT byconsidering smaller DFTs of size 119903 called radix To obtainthe whole original DFT119872 stages of size 119903DFTs are requiredwhere 119873 = 119903119872 Small DFTs have to be multiplied by the so-called twiddle factors according to the decimation in timevariant of the algorithmWhen the radix 119903 = 2 the DTFs takethe name of butterflies by their block scheme The equationsdescribing a butterfly are

1198830 = 1199090 + 1199091119908119896119899 1198831 = 1199090 minus 1199091119908119896119899

(10)

where 1198830 and 1198831 are the outputs while 1199090 and 1199091 are thecorresponding inputs 119908119896119899 are the twiddle factors defined as

119908119896119899 = 119890minus1198942120587119896(119896119873) (11)

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 16: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

16 Journal of Electrical and Computer Engineering

x(k = 0)

x(k = 4)

x(k = 2)

x(k = 6)

x(k = 1)

x(k = 5)

x(k = 3)

x(k = 7)

minus1

minus1

minus1

minus1

minus1

minus1 minus1

minus1

minus1

minus1

minus1

minus1

X(n = 0)

X(n = 1)

X(n = 2)

X(n = 3)

X(n = 4)

X(n = 5)

X(n = 6)

X(n = 7)

w02

w02

w02

w02 w1

4 w38

w28

w18

w08

w04

w14

w04

Figure 6 FFT use case original design with 12 radix-2 butterflies for an FFT of size 8 Twiddle factors 119908119896119899 are calculated according to (11)

where 119896 and 119899 are integers depending on the butterfly positionin the FFT

The adopted use case involves a radix-2 FFT of size 8as depicted in Figure 6 obtained by means of three stagesinvolving four butterflies each meaning 12 butterflies overall(119903 = 2 119872 = 3 119873 = 119903119872 = 23 = 8) Stages have beenpipelined to keep the system critical path short The baseline12 butterflies design then requires three clock periods for theoutputs elaboration

From the baseline 12 butterfly design several variantshave been derived through the decrease of the involvedbutterflies number In such a way the available resources ofthe designmust bemultiplexed in time and reusedThereforethe overall computation latency increases and the throughputbecomes lower In particular four size 8 FFT configurationsare considered

(i) 12b is the baseline 12 butterflies FFT design taking 3clock periods to finalize the transform

(ii) 4b involves 4 butterflies for an overall executionlatency of 6 cycles

(iii) 2b involves 2 butterflies for an overall executionlatency of 12 cycles

(iv) 1b involves 1 single butterfly for an overall executionlatency of 24 cycles

412 FFT CGR System Implementation The abovemen-tioned configurations have been modelled as dataflow net-works and the corresponding CGR system has been assem-bled with MDC The activation percentage resource utiliza-tion and power consumption of each FFT variant are shownin Table 5 In general the higher the number of butterflies isthe more the corresponding area and dissipation are

30003200340036003800400042004400460048005000

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC

Base

Pow

er (120583

W)

Figure 7 FFT use case latency versus power consumption trade-offfor the 4 different 8-size FFT configurations

The main purpose of the resulting CGR system is toenable several trade-off levels between power dissipationand throughput as illustrated in Figure 7 Such a systemis capable of dynamically switching among the differentconfigurations fitting to external environment requests Forinstance in a battery operated environment when the batterylevel becomes lower than a given threshold some throughputcan be waived to consume less power

MDC identifies 8 different LRs in the CGR system TheLRs activated by each FFT variant are listed in Table 5 whiletheir characteristics are reported in Table 6 In this tablegiven any LR its activation time (119879ON) has been obtainedsumming up the activation times of the FFT configurationsactivating the same region (provided in Table 5) For exampleLR2 is activated by 1b 2b and 4b Its 119879ON is 067 which is thesum of 119879ON(1b) = 042 119879ON(2b) = 021 and 119879ON(4b) = 004413 PowerModelling andHybrid PowerManagement Assess-ment As explained in Section 32 the proposed flow requires

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 17: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 17

Table 5 FFT use case features of the different configurations integrated on the CGR design Data refer to a 90 nm CMOS target technology

FFT configuration 119879ON LRs Area Static [nW] Internal [nW]1b 042 (2 6 8) 1286 1750407 13072592b 021 (2 5 7 8) 2081 1752106 15398554b 004 (2 3 4 7) 3528 1757485 202095312b 033 (1 3 7 8) 9803 1776056 2411257

Table 6 FFT use case logic regions architectural and functional characteristics

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8119879ON 033 067 037 004 021 042 058 096iso(119890) 1024 1112 512 2324 1948 1756 256 5259rtn 1024 518 256 0 0 0 128 512reg 1024 518 256 0 0 0 128 512iso(119903) 1024 1032 512 2306 1926 1740 256 4934

LR7 792LR6 043

LR5 045

LR4 044

LR3 158

LR2 065

LR1 6248

LR8 136

LR1 seq 091 comb 9909LR2 seq 5892 comb 4107LR3 seq 090 comb 9909LR4 seq 0 comb 100

LR5 seq 0 comb 100LR6 seq 0 comb 100LR7 seq 793 comb 090LR8 seq 2799 comb 7201

Area LRs FFT 90nm

Figure 8 FFT use case area percentage per LR

a preliminary synthesis of the baseline (without any imple-mented power saving strategy) CGR system From the syn-thesized design it is possible to retrieve area occupancy andlogic composition (combinatorial and sequential contributes)of the 8 LRs as depicted in Figure 8 The area is given interms of percentage with respect to the overall system areaThe biggest region is LR1 it occupies more than 60 of thewhole system area It is the region that mainly impacts powerconsumption Furthermore it is quite entirely combinatorial(9918) so that power gating should be a very suitablestrategy for this LR By Figure 8 it is possible to notice thatLR2 LR4 LR5 LR6 and LR8 are extremely small For allthese regions the proposed power modelling strategy can beextremely beneficial to investigate if power saving techniquesmay lead or not to an effective power saving Figure 8 alsosuggests that LR4 LR5 and LR6 cannot benefit from clockgating since they are fully combinatorial

001

011

014

189

014

018

178

Power consumption variation after power gating integration

EstReal

minus19

2minus1

92

minus06

7

minus62

2minus6

28minus0

59

minus05

4

minus25

37

minus25

22

minus35minus30minus25minus20minus15minus10

minus505

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR4 LR5 LR6 LR7 LR8

Figure 9 FFT use case comparison between the estimated and realpower variation due to the power gating integration

In order to evaluate the proposed power model Figures9 and 10 compare the estimated and measured (retrievedfrom the postsynthesis reports) overhead respectively dueto power gating and clock gating In both cases the reportedpower refers to both the static and internal contributions astaken into consideration by the power model The remainingterm the net one is neglected The error of neglecting thenet contribution is discussed in Section 414 The proposedpower models are generally able to accurately approximatethe power saving strategy overhead As expected LR1 isthe region with the highest power saving regardless of theconsidered strategy (please note that a negative overheadimplies a saving in power) It is interesting to notice that LR8despite being one of the smallest regions does not provideany saving if power gated but it can achieve a little powerreduction when clock gated

The static and internal power estimations obtained byapplying (1) and (3) for a prospective power gating imple-mentation and (4) and (5) considering a possible clockgating implementation are shown in Table 7 Please notice

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 18: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

18 Journal of Electrical and Computer Engineering

Table 7 FFT use case detailed static and dynamic saving due to power (PG) and clock gating (CG)

LR PG overhead CG overheadStatic Internal Static Internal

LR1 minus4087 minus947 000 minus1133LR2 +023 minus132 000 minus285LR3 minus1044 minus211 000 minus259LR4 minus008 010 mdash mdashLR5 008 013 mdash mdashLR6 011 017LR7 minus344 minus039 000 minus079LR8 142 235 000 minus025

Table 8 FFT use case power gating overhead estimation step accuracy

LR PG overhead Overhead error NetEst Real Err Iso Rtn Contr Err

LR1 minus2522 minus2537 059 518 464 075 103LR3 minus628 minus622 107 1634 297 076 126LR7 minus192 minus192 024 924 108 083 136

Table 9 FFT use case clock gating overhead estimation step accu-racy

LR CG overhead Overhead NetEst Real Err CGcell Err Err

LR1 minus573 minus577 006 083 018LR2 minus142 minus142 012 103 040LR3 minus134 minus138 005 084 021LR7 minus039 minus039 005 096 022LR8 minus012 minus012 029 108 047

Power consumption variation after clock gating integration

EstReal

minus01

2

minus01

2

minus03

9

minus03

9

minus12

9

minus12

9

minus14

2

minus14

2

minus56

4

minus56

5

minus7

minus6

minus5

minus4

minus3

minus2

minus1

0

Pow

er v

aria

tion

()

LR1 LR2 LR3 LR7 LR8

Figure 10 FFT use case comparison between the estimated and realpower variation due to the clock gating integration

that clock gating static overhead is not appreciable since onesingle clock gating cell is required per LR

Algorithm 1 (see Section 326) implies a preliminaryarea thresholding step Two different thresholds have beenconsidered for the algorithm evaluation

(i) DAT 5 Threshold set to 5 Regions with areaabove the 5 are LR1 LR3 and LR7 so that thepower gating overhead estimation step is performedfor each of them All the considered regions lead to anoverall saving (negative overhead) larger than thoseachievable with a prospective clock gating implemen-tation therefore they are selected as eligible regionsfor power gating Clock gating overhead estimation isperformed on all the remaining subthreshold regionsThe regions capable of providing saving when clockgated are LR2 and LR8 since LR4 LR5 and LR6are fully combinatorial Thus clock gating will beimplemented only on LR2 and LR8

(ii) DAT 10 Threshold set to 10 Only LR1 and LR3are above the area threshold and as occurred also forDAT 5 they both achieve power saving if imple-mented with power gating strategies In this secondcase the clock gating overhead estimation step isperformed also on LR7 which results in a negativeoverheadThen the regions to be clock gated are LR2LR7 and LR8 while LR4 LR5 and LR6 are againdiscarded

To access the proposed flow five designs have been assem-bled

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where power gating isapplied blindly to all the regions

(iii) CG full the CGR design where clock gating isapplied blindly to all the regions

(iv) DAT 5 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 5 in Algorithm 1

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 19: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 19

+0

04

Static Internal Totalminus2

746

minus2

748

minus2

129

minus2

758

minus51

21

minus54

63

minus53

46 minus4

288

minus4

499

minus4

216

minus8

73

0

500

1000

1500

2000

2500

3000

Pow

er (120583

V)

BaseCG_full (+003)PG_full (+466)

DAT_5 (+218)DAT_10 (+197)

FFT 90nm

Figure 11 FFT use case comparison between the base designand the four gated designs Legend shows in brackets the powermanagement area overhead for each design with respect to the Baseone

(v) DAT 10 derived with the proposed automated flowcapable of hybrid power and clock gating supportsetting the AreaThreshold to 10 in Algorithm 1

These designs have been synthesized with Cadence RTLCompiler targeting the same 90 nm CMOS technologyadopted to synthesise and simulate the baseline CGR designwhose results have fed the Power Analysis block of theproposed enhanced power management flow to assembleDAT 5 and DAT 10

The power consumption (internal static and total) ofthese designs is depicted in Figure 11 The reported datacorrespond to the real power consumed by the synthesiseddesigns Since LR1 occupies more than the 60 of the designarea and it is mainly combinatorial little differences amongthe entirely power gated design (PG full) and the hybrid clockand power gating ones (DAT 5 and DAT 10) are visibleNevertheless DAT 5 achieves the largest power saving(minus4512) among all the designs validating the proposedhybrid and selective management with respect to a one-fit-to-all solution CG full capable of diminishing only thedynamic power consumption is the worst design amongthose applying power management

The area overhead of the implemented power manage-ment strategies reported in the legend of Figure 11 provesthat the proposed hybrid management leads to less areahungry designs than the entirely power gated one In factDAT 5 and DAT 10 present half of the area overheadof PG full CG full data confirms that clock gating hasa very little impact on the baseline design presenting anegligible area overhead (two orders of magnitude smallerthan DAT 5 and DAT 10)

For the sake of completeness in Figure 12 the trade-offlevels between power and latency (and in turn throughput)are illustrated for all the considered designs The trade-offcurves demonstrate that power management strategies aregenerally extremely beneficial within a CGR scenario

1b - 24 CC 2b - 12 CC 4b - 6 CC 12b - 3 CC100015002000250030003500400045005000

Pow

er (120583

V)

BaseCG_fullPG_full

DAT_5DAT_10

Figure 12 FFT use case atency versus power consumption trade-off for the 4 different 8-size FFT configurations when gated designsare adopted

414 Accuracy and Errors The accuracy of the proposedpower modelling approach is assessed by Tables 8 and 9respectively considering the power gating overhead estima-tion step and the clock gating overhead estimation step ofAlgorithm 1 These tables in each row report the estimationerrors with respect to the real consumption of the baselineCGR system where the given power saving strategy (iepower gating in Table 8 and clock gating in Table 9) is appliedonly on the LR specified in the first column

Looking at Table 8 power overhead estimations per LRdemonstrate to be very accurate leading to errors that arealways below 11 Also errors related to the estimationof state retention and Power Controller overhead are quitelow (resp below 5 and 1) The isolation cells overheadestimation is less precise resulting in an error of 1636 forPD3 due to the fact that the static and internal values of119875(ISOON) and 119875(ISOOFF) are characterized as average valuesthe same for each LR Nevertheless this error has no visibleimpact on the total estimation one that is 107

Table 9 depicts an overview of the estimations accuracyfor the clock gating overhead In this case estimations areevenmore accurateThe error on the clock gating overhead isalways below the 03The error due to the clock gating cellsoverhead is very limited too being always under the 11

Tables 8 and 9 report also the errors caused by omittingthe power net term (column net (Err)) in (3) and (5) Thiserror is obtained by comparing the estimated overhead (notcomprehensive of the net contribute) with the real measuredoverhead comprehensive of the net term as extracted by thesynthesis reports The net error is higher in the power gatingoverhead estimation (136 for LR7) with respect to the clockgating one (at maximum 047 for LR8) since power gatingrequires more additional logic to be implemented

42 Validation Phase In order to validate the proposedapproach a second use case has been assessed targeting thesame 90 nm technology used for the FFT use case and asmaller 45 nm library The reconfigurable computing core ofan image coprocessing unit accelerating a zoom applicationhas been assembled Its characteristics are discussed in Sec-tion 421 while Sections 422 and 423 analyse the achievedresults

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 20: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

20 Journal of Electrical and Computer Engineering

Table 10 Zoom coprocessor use case computational kernels distinctive features

Kernel actors 119879ON Functionality LRsabs 1 003 Absolute value calculation (4)chgb 7 033 Bilevelgrayscale block checking (5 6 7 12 13)cubic 10 006 Linear combination calculation (1 5 9 10 13)cubic conv 6 009 Cubic filter convolution (5 7 8 9)median 9 006 Median calculation (1 3 5 11 13)min max 1 001 Maximumminimum finding (11)sbwlabel 17 042 Edge block checking (2 4 5 6 13)

Table 11 Zoom coprocessor at 90 nm CMOS technology characterization of the hybrid clock and power gated designs achieved with theproposed automated flow DAT 5 area threshold 5 DAT 10 area threshold 10 NA stands for not assigned and includes those LRsthat placed in the always on domain

Design gtTh PG set CG set NA

DAT 5 LR1 LR2 LR3 LR6 LR8LR10 LR12 LR13 LR1 LR2 LR3 LR8 LR10 LR12 LR4 LR6 LR7 LR9 LR11 LR13 LR5

DAT 10 LR1 LR8 LR1 LR8 LR2 LR3 LR4 LR6 LR7 LR9 LR10LR11 LR12 LR13 LR5

421 CGR System of the Zoom Coprocessor The zoomapplication is meant to scale an image according to thegiven zooming factor Missing pixels of the zoomed imageare calculated by adaptively interpolating the neighbouringvalues The original application is a C program It has beenexecuted and profiled on a general purpose machine inorder to identify the most computational intensive portionsof the code (computational kernels) with the intention ofaccelerating them on a CGR hardware accelerator Sevencomputational kernels have been identified and modelled asdataflow networks Table 10 summarizes the kernels com-position (in terms of number of dataflow actors) activationtime and main functionality Then these dataflow kernelshave been combined by MDC to obtain a multidataflowspecification constituting the computing core of the CGRaccelerator in charge of accelerating the zoom application

Thirteen LRs are identified on the CGR zoom coproces-sor The main difference between this scenario and the FFTone is that in the zoom coprocessor it is not necessary toretain the status of any kernel when switching among themThis means that applying power gating no retention cells areneeded in the identified regions

422 Zoom Coprocessor Validation Results at 90 nm CMOSTechnology This section ismeant to provide the discussion ofthe achieved results in the zoom coprocessor scenario usingthe same 90 nmCMOS technology adopted for the FFT CGRdesigns assessment From the implementation point of viewwe will discuss the same designs we considered for the FFTuse case defined as follows

(i) Base the baseline CGR design without any powersaving

(ii) PG full the CGR design where all the 13 LRs arepower gated

(iii) CG full the CGR design where all the 13 LRs areclock gated

(iv) DAT 5 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 5in Algorithm 1

(v) DAT 10 the CGR design where hybrid power andclock gating support are implemented bymeans of theproposed flow setting the Area Threshold to 10 inAlgorithm 1

Please refer to Table 11 for the composition of DAT 5 andDAT 10

Figure 13 depicts static internal and total power con-sumption for each considered design In this case thedifferences among CG full PG full DAT 5 and DAT 10are not so evident The reason is that in this scenario thedynamic power consumption (due to the internal power) isconsiderably higher than the static one As you can see inthe reported histograms on average there are approximatelymore than two orders of magnitude of difference Powergating and clock gating demonstrate to be equally capable ofcutting down the internal power consumptionLR5 is the onlyregion that Algorithm 1 completely discards by any form ofpower management both in the DAT 5 design and in theDAT 10one It is fully combinatorial therefore clock gatingdoes not provide any positive effect on it Nevertheless it is sosmall (065 of the whole system area) that if power gated itcannot provide any substantial benefit A closer observationof the histograms confirms what we already got for the FFTdespite the similar trend for all the designs which lead tomore than the 62 of power saving DAT 5 consumes lessthan any other (6261 of saving) while the CG full design isthe less beneficial (6229 of saving)

Focusing on the static histograms as expected theCG full design introduces a small overhead with respect to

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 21: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 21

Table 12 Zoom coprocessor at 90 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus6322 minus6331 015 018 minus6307 minus6313 009 017LR2 minus23464 minus23463 000 126 minus23369 minus23373 002 167LR3 minus6537 minus6544 010 165 minus6507 minus6514 010 159LR4 mdash mdash mdash mdash minus0958 minus0964 068 134LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 mdash mdash mdash mdash minus0870 minus0878 081 122LR7 mdash mdash mdash mdash minus1033 minus1039 063 015LR8 minus7062 minus7058 005 202 minus6987 minus6986 001 183LR9 mdash mdash mdash mdash minus2436 minus2443 027 155LR10 minus5498 minus5503 010 165 minus5474 minus5480 011 157LR11 mdash mdash mdash mdash minus3329 minus3285 132 367LR12 minus4278 minus4288 023 183 minus4267 minus4273 015 186LR13 mdash mdash mdash mdash minus0649 minus0656 101 113

+2

70

lkg int tot

minus62

38

minus62

61

minus62

34

minus62

29

minus62

66

minus62

67

minus62

39

minus62

67

minus14

54

minus51

91

minus53

08

0

2000

4000

6000

8000

10000

12000

Pow

er (120583

V)

BaseCG_full (173)PG_full (+640)

DAT_5 (+455)DAT_10 (+319)

Figure 13 Zoom coprocessor at 90 nmCMOS technology compar-ison between the base design and the four gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

baseThat is due to the 12 (one for each region but LR5) clockgating cells introduced in the always on domain of this designwhich never contribute to save any static power consumptionWhen power gating is applied there is always a benefit interms of static power consumption DAT 5 saving is slightlyhigher than the PG full one both being over 51 DAT 10is still beneficial but its saving is limited to the 15 Pleasenote that the difference between DAT 5 and DAT 10 (interms of static consumption) demonstrates that in the areathresholding step of the proposed Algorithm 1 it is better toopt for small area threshold values to achieve higher savingresults

In terms of area occupancy reported in the legend ofFigure 13 the PG full design is the one with the largestoverhead +64 DAT 5 which is the most beneficial in

terms of power shows a slightly smaller overall overhead+455 of the whole system area The CG full is again theless invasive one leading just to +173 of area overheadSummarizing DAT 5 constitutes the optimal solution forthe zoom coprocessor scenario considering a 90 nm tech-nology DAT 10 which is less beneficial than DAT 5 insaving static power consumption is a better solution than afully power gated design presenting basically the same powersaving (minus6238 for DAT 10 versus minus6229 for PG full)but a smaller area overhead (+319 for DAT 10 versus+64 for PG full)

Table 12 reports the estimation error of the proposedautomated hybrid power management design flow whenthe power saving percentages for the considered domainsrespectively considering power gating (power gating overheadestimation) and clock gating (clock gating overhead estima-tion) are evaluated PG saving errors are always below03 and CG saving ones do not exceed 15 These dataconfirm the accuracy of the proposed models as in theFFT use case Table 12 for both power and clock gatingimplementations depicts also the error of neglecting the netterm in the dynamic power consumption Again as in thepreviously discussed scenario models are not affected by thissimplification

423 Zoom Coprocessor Validation Results at 45 nm CMOSTechnology In order to provide a robust validation of theproposed approach we decided to assess the same zoomcoprocessor designs on a different technology targeting a45 nm CMOS technology The idea is assessing what changeswhen the ratio between the static and the dynamic con-sumption is varied Here follows the list of the implementeddesigns

(i) Base the same as in the 90 nm synthesis trial(ii) PG full the same as in the 90 nm synthesis trial(iii) CG full the same as in the 90 nm synthesis trial

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 22: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

22 Journal of Electrical and Computer Engineering

+3

53

lkg int tot

minus59

76

minus61

43

minus61

84

minus61

57

minus58

83

minus62

34

minus62

29

minus62

30

minus62

05

minus62

35

minus14

34

minus46

28

minus53

78

minus53

07

0

500

1000

1500

2000

2500

Pow

er (120583

V)

BaseCG_full (+182)PG_full (+690)

DAT_1 (+602)DAT_5 (+518)DAT_10 (+335)

Figure 14 Zoom Co-Processor at 45 nm CMOS technology Com-parison between the base design and the five gated designs Legendshows in brackets the power management area overhead for eachdesign with respect to the Base one

(iv) DAT 1 the CGR design where hybrid power andclock gating support are implemented by means ofthe proposed flow setting the Area Threshold to 1in Algorithm 1

(v) DAT 5 the same as in the 90 nm synthesis trial(vi) DAT 10 the same as in the 90 nm synthesis trial

Targeting a smaller technology and having already estab-lished that the 10 area threshold leads to power resultscomparable to those of the fully power gated design wehave decided to add in this second trial an additional designDAT 1 Setting the area threshold to 1 quite all the LRs areconsidered for a prospective power gating implementationPlease refer to Table 13 for the composition of DAT 1DAT 5 and DAT 10

The power consumption in terms of static internal andtotal contributions is illustrated in Figure 14 The dynamicpower consumption is still higher than the static one deter-mining the overall trend of the total power However withthe scaling of the channel length the ratio among internaland static power on average has decreased from a factor of100 to approximately 10 In this second trial the influenceof the static power consumption is partially reflected onthe total one Technology scaling and the different staticversus dynamic power ratio are such that PG full is capableof providing better overall saving results than DAT 5 andDAT 10 At 45 nm technology designers are required toselect a very low area threshold in Algorithm 1 to achievereally optimal results DAT 1which basically excludes frompower gating only the 3 LRs saves up to 6184 of total basepower and represents the optimal design solution for thezoom coprocessor in this second synthesis trial Please notealso that lowering down the area threshold the area of theoptimal design and that of the fully power gated one are prettysimilar

We can conclude that as technology gets smaller thearea thresholding step of the proposed algorithm is lessbeneficial still in the automated flow its presence makes theoverall process more robust avoiding useless iterations onnot convenient by construction designs when the technologyare not so constrained or the ratio among static and dynamicconsumption is larger

The accuracy of the proposedmodels targeting the 45 nmCMOS technology is reported in Table 14 which containboth power gating and clock gating estimation errors Themodels even neglecting the net contribution in the discussedequations are extremely accurate (the error never exceeds370)

43 Power Switch Overhead The sleep transistors areinserted in the design during the place-and-route processand their overhead is strictly use case dependent since it isrelated to the aspect ratio of the macro and to the style ofpower routing that is selected in the target design Since theproposed power estimation model is based on the synthesisof the design the contribution of these cells is not consideredyet

The insertion of headerfooter switches (we used headerones in this work as reference) determines two kinds of poweroverhead (1) a leakage-related overhead (2) the dynamicpower dissipated during sleep and wake-up transitionAnother overhead that has to be taken into consideration isthe time necessary to wake up the power domain For theproper operation of the power gatingmethodology the gatinglogic has to be enableddisabled according to a switch onoffprotocol [11] that requires 4 clock cycles for each transitionThus when a kernel is off at least 4 clock cycles are necessarybefore it is switched on and then it can receive new input data

The leakage-related contribution is fixed by the technol-ogy library and it is always present regardless of the ONOFFstate of the power domain It could be inserted in (1) as119875lkg(SW) lowast switches where 119875lkg(SW) is the static powerconsumption of the considered power switch as reported inthe technology library and switches is the number of powerswitches inserted in the power domain In the library that wetook as reference a header switch has the same leakage powerof a 64-bit wide register and one column of119873 switches can beused to switch down119873 horizontal virtual power stripes of thepower grid If we assume to use only one vertical real powerstripe to supply power to 119873 horizontal virtual power stripesof a power domain only one column of119873 power switches isinserted In this case gating the virtual power supply easilysaves enough leakage power to counterbalance the leakagedissipation of the inserted switch

The dynamic power contribution is only relevant whenintervals between successive kernel switches are in the orderof tens of cycles (Hu et al [46]) When the computation ofthe kernels last tens of cycles also the wake-up time is notrelevant Thus in designs with low switching rates these twooverhead contributions could be neglected

The FFT use case is a really simple design used onlyfor the development of the power estimation model and asreported in Section 41 its kernels are far from lasting tensof cycles The zoom application adopted for the validation

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 23: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 23

Table13Z

oom

coprocessorat45n

mCM

OStechn

olog

ycharacteriz

ationoftheh

ybrid

clockand

powergateddesig

nsachievedwith

thep

ropo

sedautomated

flow

DAT

1areathresho

ld1

DAT

5areathresho

ld5

DAT

10areathresho

ld10N

Astands

forn

otassig

nedandinclu

destho

seLR

sthatare

placed

inthea

lwayso

ndo

main

Design

gtThPG

set

CGset

NA

DAT

1LR1LR2LR3LR4LR 6LR7LR8LR9LR10LR11LR12LR13

LR1LR2LR3LR4LR7LR8LR9LR10LR11LR12

LR6LR13

LR5

DAT

5LR1LR2LR3LR6LR8LR10LR12LR13

LR1LR2LR3LR8LR10LR12

LR4LR6LR7LR9LR11LR13

LR5

DAT

10

LR1LR8

LR1LR8

LR2LR3LR4LR6LR7LR9LR10LR11LR12LR13

LR5

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 24: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

24 Journal of Electrical and Computer Engineering

Table 14 Zoom coprocessor at 45 nm CMOS technology power gating overhead estimation step and clock gating overhead estimation stepaccuracy

LR PG saving Net CG saving NetEst Real Err Err Est Real Err Err

LR1 minus5686 minus5699 023 047 minus5507 minus5501 013 053LR2 minus22990 minus22982 003 102 minus22063 minus22029 015 081LR3 minus6569 minus6566 004 113 minus6273 minus6262 018 091LR4 minus0946 minus0944 015 182 minus0919 minus0917 022 152LR5 mdash mdash mdash mdash mdash mdash mdash mdashLR6 minus0747 minus0748 025 025 minus0853 minus0833 026 159LR7 minus0955 minus0957 017 136 minus0935 minus0933 021 149LR8 minus7464 minus7460 006 136 minus6724 minus6714 016 066LR9 minus2475 minus2482 025 110 minus2350 minus2345 018 109LR10 minus5473 minus5470 004 109 minus5270 minus5261 017 091LR11 minus3422 minus3361 179 258 minus3257 minus3205 163 204LR12 minus4327 minus4330 008 119 minus4176 minus4169 017 098LR13 minus0587 minus0580 125 370 minus0623 minus0621 028 185

phase of the proposed model is a real use case but it is a smallsize design where the execution of the fastest kernels lasts 24clock cyclesThen the wake-up time overhead is in the worstcase almost 17 of the total execution time If we consider abigger and more complex real use case such as interpolationfiltering for motion compensation in High Efficiency VideoCoding [47] we would achieve the condition for neglectingthe dynamic power consumption of the sleep transistorsand the wake-up time overhead This application involves 2-dimensional filters working on subblocks of pictures belong-ing to the same video sequence The smallest block cor-responding to the fastest execution time has 8 times 8 pixelsIn this case 160 overall cycles are required to filter the wholeblock and the wake-up time overhead is just 25 of the totalexecution time

44 Advantages of the Proposed Approach Considering aCGR system implementing 119873 different functionalities andpartitioned into 119896 different LRs the proposed selectionalgorithm based on the power models embodied in (1) (3)(4) and (5) requires the synthesis of the baseline CGR design(without any power saving strategy applied) and119873 hardwaresimulations each one running a different functionality (ieexecuting the different DPNs provided as input to the MDCtool) The hardware simulations are needed since the realswitching activity is essential for correct dynamic powerestimation Table 15 targeting the FFT scenario and thepower gating implementation depicts the estimated and realpower overhead when estimations are performed adoptingthe default synthesis reports (without taking into account thereal switching activity) The estimation errors are extremelyhigh when the switching activity is neglected therefore theproposed models are not capable of properly determiningwhich LRs would actually benefit of power gating

In order to understand the advantages of the proposedapproach let us compute the effort needed to determine the

Table 15 FFT use case at 90 nm CMOS technology power gatingoverhead estimation step accuracy using reports generated withoutthe real switching activity

LR PG overheadEst Real Err

LR1 minus3054 minus2610 1707LR2 minus019 minus199 9022LR3 minus2183 minus695 21430LR4 minus012 minus067 8097LR5 minus006 minus059 9011LR6 minus007 056 8660LR7 minus1539 minus264 48293LR8 minus038 minus002 194181

optimal saving strategy for each region if our flow is notadopted It is required to

(1) synthesize the baseline design without any powermanagement support

(2) synthesize one power gated design and one clockgated design for each LR

(3) perform 119873 different hardware simulations for thebaseline design to retrieve the real switching activityof the system

(4) perform 119873 different hardware simulations for eachpower gated and clock gated design to retrieve theirreal switching activity

(5) compare each power gated design and clock gateddesign in the different operating conditions withrespect to the synthesized baseline CGR design

Our flow requires only point 1 and point 3 In numbers it cor-responds to one single synthesis and119873 hardware simulationsOn the contrary not using our approach 2 lowast 119896 + 1 synthesis

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 25: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 25

(119896 for power gating evaluation 119896 for clock gating evaluationplus the baseline one) and119873lowast(2lowast119896+1) hardware simulationsare necessaryThe only simplification that may be done evenwithout adopting the proposed approach is when a givenregion is fully combinatorial This would save the effortrelated to its perspective clock gating evaluation

Dealing with the presented use cases for the FFT (Sec-tion 41) there are 119873 = 4 different functionalities and 119896 = 8LRs Among these latter 3 are fully combinatorial The pro-posed approach required one synthesis and 4 hardware sim-ulations rather than 14 synthesis (8 power gated LRs 5 clockgated LRs and the baseline design) and 56 hardware simula-tions (4 for each synthesized design) Considering the zoomcoprocessor (Section 42)119873 = 7 and 119896 = 13 with only 1 fullycombinatorial LR The proposed approach required one syn-thesis and 7 hardware simulations rather than 26 synthesis(13 power gated 12 clock gated and the baseline designs) and182 hardware simulations

5 Conclusions

In this paper we addressed the problem of power manage-ment in coarse-grained reconfigurable (CGR) systems Suchsystems are as suitable to accelerate multifunctional appli-cations as potentially energy inefficient In fact on a CGRsubstrate while a particular task is executed the resources notinvolved in the computation may potentially waste preciouspower if not properlymanagedOn top of that these architec-tures are also characterized by an intrinsic design difficultymapping several applications on the same substrate cus-tomizing the datapath is not so straightforward and requiresa deep knowledge of the target applications Dataflowmodelsof computation turned out to be very efficient for the develop-ment of automated mapping problems In our studies wehave exploited a dataflow-based approach to define a com-plete design suite for multifunctional CGR systems theMulti-Dataflow Composer (MDC) tool Besides automati-cally managing dataflow to hardware systems compositionMDC also supports the automated characterization of powerand clock gated platforms

The proposed work makes some steps further both in theMPEG-RVC field which the MDC tool belongs to and inthe definition of optimal power management strategies forCGR designs In this paper we have presented an automatedmethodology capable of estimating prior to any physicalimplementation the effectiveness and costs that power gatingor clock gating would have when implemented on top ofthe functional logic regions constituting a CGR systemThis methodology is based on static and dynamic powerestimation models that in a separate manner for each logicregion in the CGR design are capable of determining theoverhead of clock gating and power gating on the basis of thefunctional technological and architectural parameters of thebaseline CGR system These models and the correspondingestimation algorithm are applicable in any CGR scenario andare currently integrated in the MDC tool improving its basicfunctionality In fact MDC was normally applying a one-fit-to-all user-specified power reduction technique either clock

or power gating without any warranty of its effectiveness onthe different identified regions

By considering two different scenarios and adoptingdifferent ASIC technologies our assessments proved that theenhanced MDC flow is capable of guiding the designerstowards the definition of the optimal power managementsupport It is more efficient than the previous blindly appliedmethodology and the proposed models turned out to beextremely accurate Finally as demonstrated in Section 44the new flow drastically reduces the number of designs to besynthesized and simulated leading to saving both designereffort and computational time

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

Tiziana Fanni is grateful to Sardinia Regional Governmentfor supporting her PhD scholarship (POR FSE EuropeanSocial Fund 2007ndash2013 Axis IV Human Resources)

References

[1] R Hartenstein ldquoA decade of reconfigurable computing avisionary retrospectiverdquo in Proceedings of the Design Automa-tion and Test in Europe Conference and Exhibition (DATE rsquo01)pp 642ndash649 March 2001

[2] P Meloni G Tuveri L Raffo et al ldquoSystem adaptivity andfault-tolerance in NoC-based MPSoCs the MADNESS projectapproachrdquo in Proceedings of the 15th Euromicro Conference onDigital System Design (DSD rsquo12) pp 517ndash524 2012

[3] H Esmaeilzadeh E Blem R St Amant K Sankaralingamand D Burger ldquoDark silicon and the end of multicore scalingrdquoin Proceedings of the International Symposium on ComputerArchitecture pp 365ndash376 San Jose Calif USA 2011

[4] M B Taylor ldquoIs dark silicon useful Harnessing the four horse-men of the coming dark silicon apocalypserdquo in Proceedings ofthe 49th Annual Design Automation Conference (DAC rsquo12) pp1131ndash1136 San Francisco Calif USA June 2012

[5] F Oboril J Ewert and M B Tahoori ldquoHigh-resolution onlinepower monitoring for modernmicroprocessorsrdquo in Proceedingsof the Conference on Design Automation and Test in Europe pp265ndash268 March 2015

[6] Synopsys ldquoAdvanced low power techniquesrdquo httpgoogl2FqEjf

[7] S Herbert and D Marculescu ldquoAnalysis of dynamic volt-agefrequency scaling in chip-multiprocessorsrdquo in Proceedingsof the International Symposium on Low Power Electronics andDesign (ISLPED rsquo07) pp 38ndash43 August 2007

[8] S Eyerman and L Eeckhout ldquoFine-grained DVFS using on-chip regulatorsrdquo Transactions on Architecture and Code Opti-mization vol 8 no 1 article 1 2011

[9] M Arora S Manne Y Eckert I Paul N Jayasena and D MTullsen ldquoA comparison of core power gating strategies imple-mented in modern hardwarerdquo in Proceedings of the Conferenceon Measurement and Modeling of Computer Systems pp 559ndash560 2014

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 26: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

26 Journal of Electrical and Computer Engineering

[10] B Jeff ldquoAdvances in bigLITTLE technology for power andenergy savingsrdquo ARMWhite Paper 2012

[11] Power Forward Initiative A Practical Guide to Low PowerDesign 2009

[12] F Palumbo N Carta D Pani P Meloni and L Raffo ldquoThemulti-dataflow composer tool generation of on-the-fly recon-figurable platformsrdquo Journal of Real-Time Image Processing vol9 no 1 pp 233ndash249 2014

[13] N Carta C Sau F Palumbo D Pani and L Raffo ldquoA coarse-grained reconfigurable wavelet denoiser exploiting the multi-dataflow composer toolrdquo in Proceedings of the 7th Conference onDesign andArchitectures for Signal and Image Processing (DASIPrsquo13) pp 141ndash148 October 2013

[14] N Carta C Sau D Pani F Palumbo and L Raffo ldquoA coarse-grained reconfigurable approach for low-power spike sortingarchitecturesrdquo in Proceedings of the 6th International IEEEEMBS Conference on Neural Engineering (NER rsquo13) pp 439ndash442 San Diego Calif USA November 2013

[15] D Pani F Usai L Citi and L Raffo ldquoReal-time processing oftfLIFEneural signals on embeddedDSPplatforms a case studyrdquoin Proceedings of the 5th International IEEEEMBS Conferenceon Neural Engineering (NER rsquo11) pp 44ndash47 Cancun MexicoApril 2011

[16] N Carta P Meloni G Tuveri D Pani and L Raffo ldquoA customMPSoC architecture with integrated power management forreal-time neural signal decodingrdquo IEEE Journal on Emergingand Selected Topics in Circuits and Systems vol 4 no 2 pp 230ndash241 2014

[17] F Palumbo C Sau and L Raffo ldquoCoarse-grained reconfigura-tion dataflow-based power managementrdquo IET Computers andDigital Techniques vol 9 no 1 pp 36ndash48 2015

[18] F Palumbo T Fanni C Sau and PMeloni ldquoPower-awarness incoarse-grained reconfigurable multi-functional architectures adataflow based strategyrdquo Journal of Signal Processing Systems2016

[19] D Pani and L Raffo ldquoSelf-coordinated on-chip parallel com-puting a swarm intelligence approachrdquo in Parallel and Dis-tributed Computational Intelligence pp 91ndash112 Springer BerlinGermany 2010

[20] M Yan Z Yang L Liu and S Li ldquoProDFA accelerating domainapplications with a coarse-grained runtime reconfigurablearchitecturerdquo in Proceedings of the 18th IEEE InternationalConference on Parallel andDistributed Systems (ICPADS rsquo12) pp834ndash839 December 2012

[21] S M Carta D Pani and L Raffo ldquoReconfigurable coprocessorfor multimedia application domainrdquo Journal of VLSI SignalProcessing Systems for Signal Image and Video Technology vol44 no 1-2 pp 135ndash152 2006

[22] V V Kumar and J Lach ldquoHighly flexible multimode digitalsignal processing systems using adaptable components andcontrollersrdquo EURASIP Journal on Applied Signal Processing vol2006 no 1 Article ID 079595 pp 1ndash9 2006

[23] B Mei S Vernalde D Verkest H De Man and R LauwereinsldquoExploiting loop-level parallelism on coarse-grained reconfig-urable architectures using modulo schedulingrdquo in Proceedingsof the Design Automation and Test in Europe Conference andExhibition (DATE rsquo03) pp 296ndash301 March 2003

[24] F Palumbo D Pani L Raffo and S Secchi ldquoA surface tensionand coalescence model for dynamic distributed resources allo-cation in massively parallel processors on-chiprdquo in NatureInspired Cooperative Strategies for Optimization (NICSO 2007)

vol 129 of Studies in Computational Intelligence pp 335ndash345Springer Berlin Germany 2007

[25] G Ansaloni K Tanimura L Pozzi and N Dutt ldquoIntegratedkernel partitioning and scheduling for coarse-grained recon-figurable arraysrdquo IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems vol 31 no 12 pp 1803ndash18162012

[26] C C de Souza AM Lima G Araujo andN BMoreano ldquoThedatapath merging problem in reconfigurable systems complex-ity dual bounds and heuristic evaluationrdquo ACM Journal ofExperimental Algorithmics vol 10 no 22 Article ID 11806132005

[27] R Giorgi and A Scionti ldquoA scalable thread scheduling co-processor based on data-flow principlesrdquo Future GenerationComputer Systems vol 53 pp 100ndash108 2015

[28] L Verdoscia R Vaccaro and R Giorgi ldquoA clockless computingsystem based on the static dataflow paradigmrdquo in Proceedings ofthe 4th Workshop on Data-Flow Execution Models for ExtremeScale Computing (DFM rsquo14) pp 30ndash37 Edmonton CanadaAugust 2014

[29] C Sau L Fanni P Meloni L Raffo and F Palumbo ldquoReconfig-urable coprocessors synthesis in the MPEG-RVC domainrdquo inProceedings of the International Conference on ReConFigurableComputing and FPGAs (ReConFig rsquo15) pp 1ndash8 IEEE RivieraMaya Mexico December 2015

[30] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo in Proceedingsof the 15th Euromicro Conference on Digital System Design (DSDrsquo12) pp 216ndash225 September 2012

[31] L Jozwiak M Lindwer R Corvino et al ldquoASAM automaticarchitecture synthesis and application mappingrdquo Microproces-sors and Microsystems vol 37 no 8 pp 1002ndash1019 2013

[32] Y Zhang J Roivainen and A Mammela ldquoClock-gating inFPGAs a novel and comparative evaluationrdquo in Proceedings ofthe 9th Euromicro Conference onDigital SystemDesign Architec-tures Methods and Tools p 590 584 Dubrovnik CroatiaAugust-September 2006

[33] E Bezati S Casale-Brunet M Mattavelli and J W JanneckldquoCoarse grain clock gating of streaming applications in pro-grammable logic implementationsrdquo in Proceedings of the Elec-tronic System Level Synthesis Conference pp 1ndash6 2014

[34] Silicon Integration Initiative Si2 Common Power FormatSpecificationTM-Version 21 2014

[35] M Shafique L Bauer and J Henkel ldquoAdaptive energy manage-ment for dynamically reconfigurable processorsrdquo IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 33 no 1 pp 50ndash63 2014

[36] J Yi and J Kim ldquoPower modeling for digital circuits with clockgatingrdquo IEICE Electronics Express vol 12 no 24 Article ID20150817 2015

[37] H Xu R Vemuri and W-B Jone ldquoRun-time active leakagereduction by power gating and reverse body biasing an energyviewrdquo in Proceedings of the 26th IEEE International Conferenceon Computer Design (ICCD rsquo08) pp 618ndash625 IEEE October2008

[38] K Datta A Mukherjee G Cao et al ldquoCASPER embeddingpower estimation and hardware-controlled powermanagementin a cycle-accurate micro-architecture simulation platform formany-core multi-threading heterogeneous processorsrdquo Journalof Low Power Electronics and Applications vol 2 no 1 pp 30ndash68 2012

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 27: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

Journal of Electrical and Computer Engineering 27

[39] A Chhabra H Rawat M Jain et al ldquoFALPEM framework forarchitectural-level power estimation and optimization for largememory sub-systemsrdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 34 no 7 pp 1138ndash1142 2015

[40] S Li J H Ahn R D Strong J B Brockman DM Tullsen andN P Jouppi ldquoThe McPAT framework for multicore and many-core architectures simultaneously modeling power area andtimingrdquoACMTransactions on Architecture and Code Optimiza-tion vol 10 no 1 article 5 pp 1ndash29 2013

[41] D Zoni andW Fornaciari ldquoModeling DVFS and power-gatingactuators for cycle-accurate NoC-based simulatorsrdquo ACM Jour-nal on Emerging Technologies in Computing Systems vol 12 no3 article 27 pp 1ndash24 2015

[42] C Sau L Raffo F Palumbo E Bezati S Casale-Brunet andM Mattavelli ldquoAutomated design flow for coarse-grained reco-nfigurable platforms an RVC-CAL multi-standard decoderuse-caserdquo in Proceedings of the 14th International Conference onEmbedded Computer Systems Architectures Modeling and Sim-ulation (SAMOS rsquo14) pp 59ndash66 Agios Konstantinos GreeceJuly 2014

[43] Open RVC-CAL Compiler httporccsourceforgenet[44] Cadence Low Power in Encounter RRTL Compiler Product

Version 141 July 2014[45] J W Cooley and J W Tukey ldquoAn algorithm for the machine

calculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[46] Z Hu A Buyuktosunoglu V Srinivasan V Zyuban H Jacob-son and P Bose ldquoMicroarchitectural techniques for powergating of execution unitsrdquo in Proceedings of the InternationalSymposium on Low Power Electronics and Design (ISLPED rsquo04)pp 32ndash37 IEEE August 2004

[47] C M Diniz M Shafique S Bampi and J Henkel ldquoA recon-figurable hardware architecture for fractional pixel interpola-tion in high efficiency video codingrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol34 no 2 pp 238ndash251 2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 28: Research Article Modelling and Automated …downloads.hindawi.com/journals/jece/2016/4237350.pdfResearch Article Modelling and Automated Implementation of Optimal Power Saving Strategies

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of