a hybrid energy-estimation technique for extensible processors fei, y.; ravi, s.; raghunathan, a.;...

24
A Hybrid Energy- Estimation Technique for Extensible Processors Fei, Y.; Ravi, S.; Raghunathan, A.; Jha, N.K. IEEE Transactions on Computer-Aided Design of In tegrated Circuits and Systems Volume: 23 Issue: 5 Pages: 652-664 May 2004

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

A Hybrid Energy-Estimation Technique for Extensible Processors

Fei, Y.; Ravi, S.; Raghunathan, A.; Jha, N.K.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Volume: 23  Issue: 5

Pages: 652-664

May 2004

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 2/24

Abstract In this paper, we present an efficient and accurate methodology for

estimating the energy consumption of application programs running on extensible processors. Extensible processors, which are getting increasingly popular in embedded system design, allow a designer to customize a base processor core through instruction set extensions. Existing processor energy macromodeling techniques are not applicable to extensible processor, since they assume that the instruction set architecture as well as the underlying structural description of the micro-architecture remain fixed. Our solution to the above problem is a hybrid energy macromodel suitably parameterized to estimate the energy consumption of an application running on the corresponding application-specific extended processor instance, which incorporates any custom instruction extension. Such a characterization is facilitated by careful selection of macromodel parameters/variables that can capture both the functional and structural aspects of the execution of a program on an extensible processor.

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 3/24

Abstract (cont.) Another feature of the proposed energy characterization flow is the

use of regression analysis to build the macromodel. Regression analysis allows for in-situ characterization, thus allowing arbitrary test programs to be used during macromodel construction. We validated the proposed methodology by characterizing the energy consumption of a state-of-the-art extensible processor (Tensilica’s Xtensa). We used the macromodel to analyze the energy consumption of several benchmark applications with custom instructions. The mean absolute error in the macromodel estimates is only 3.3%, when compared to the energy values obtained by a commercial tool operating on the synthesized register-transfer level (RTL) description of the custom processor. Our approach achieves an average speedup of three orders of magnitude over the commercial RTL energy estimator. Our experiments show that the proposed methodology also achieves good relative accuracy, which is essential in energy optimization studies. Hence, our technique is both efficient and accurate.

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 4/24

Outline

What’s the problem Introduction & related work Extensible processor energy macromodel

requirements Proposed energy estimation methodology Experimental results and evaluation Conclusions

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 5/24

What’s the Problem

Existing processor energy estimation framework is impractical for use in energy optimization done in the ASIP design cycle The extension to the base processor ISA is not fixed The number of configurations/extensions is large

It’s essential to have a fast and accurate energy estimation of an application running on an extensible processor for each candidate configuration in energy optimization studies

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 6/24

Related Work

Structural macromodeling Characterize energy consumption of it’s constituent h

ardware module

E =∑Em1,i(bit transition) + ∑Em2,i(bit transition) + …… + ∑Emk,i(bit transition) ( Em1,i(bit transition) denote energy per access of the module1)

Advantage: High accuracy Disadvantage:

1) Low efficiency (RTL simulation of a processor is extremely slow)

2) Require RTL hardware description of the processor

Suitable for energy estimation of a processor core

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 7/24

Related Work (cont.)

Instruction-level macromodeling Characterize energy consumption of each instruction of th

e processorE = EIC1 * CycIC1 + EIC2 * CycIC2 + EIC3 * CycIC3 +…….+ EICk * CycICk

(EIC1denote average energy consumption by instruction class1 )

(CycIC1denote number of cycles taken by instruction class1 )

Energy coefficient EIC1is acquired by actual measurement of a chip implementation Advantage: High efficiency (Use ISS to yield energy estimation) Disadvantage:

1) Low accuracy

2) Require actual chip implement and this is infeasible for power tradeoff studies early in the design cycle

Suitable for energy estimation of software on a fixed processor architecture

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 8/24

Related Work (cont.)

Statistical analysis and prediction macromodeling Energy coefficients are calculated with regression analysis

to build the macromodelEi = C1 * M1,i + C2 * M2,i + …….+ Ck * Mk,i + ∆i ( i=1,2….n)(Total energy consumption Ei denote dependent variable)

(Macromodel parameters M1,i…. Mk,I denote independent variable)

(∆i denote inaccuracy)

Use a set of given (Ei, M1,i ,….,Mk,i) ,i=1,2…n to predict the best energy coefficient C1 , C2 ,.., Ck

Energy macromodel generation Ê = Ĉ1 * M1+ Ĉ2 * M2,+ …….+ Ĉk * Mk

(Ĉ1,..,Ĉk denote the estimate of energy coefficient)

(Ê denotes the estimate of total energy consumption )

(Macromodel parameters M1,..,Mk are observable during ISS )

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 9/24

Paper Overview and Contributions

Hybrid energy macromodeling Instruction-level macromodeling for base processor Structural macromodeling for custom hardware extension Regression macromodeling for energy characterization

Contributions Energy consumption can simply be determined by instruction set

simulation Combines the efficiency of instruction-level approaches and the

accuracy of structural approaches Only needs the custom instruction descriptions Does’t require the custom processor to be synthesized This is the only work on evaluate energy/performance tradeoff

among candidate custom instructions for extensible processor at the early design cycle

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 10/24

Extensible Xtensa Processor

Xtensa’s ISA consists of a basic set of instructions plus a set of configurable and extensible options

Extensibility is achieved by specifying application-specific functionality through custom instructions The behavior of the custom instruction is descried using TIE

(Tensilica Instruction Extension) language TIE is independent of the processor’s pipeline

Only need to describe the semantics of the instructions as if they consist of only combination logic

The TIE compiler automatically derives The hardware implementation of custom instructions Corresponding software development kit for the configuration

ANCI C/C++ compiler, linker, assembler, debugger Cycle-accurate instruction set simulator (ISS)

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 11/24

Example Containing Three Custom Instructions

user register statement Specify the custom state register

and indices

iclass statement Define a new instruction class wi

th one or multiple custom instructions

semantic statement Describe the behavior of the inst

ruction class

schedule statement (Used for multiple cycle instruction)

Schedule the operation sequence of the custom instruction Need ars and art at the beginning of first cycle

Need ACCU at the beginning of second cycle Produce new ACCU at the end of second cycle

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 12/24

Partial Architecture of an Extended Processor

Augmented with custom hardware to implement three custom instruction: MULT, MAC and CUS

MULT and MAC perform their functionality using shared custom hardware (which is dependent of base processor operand buses) A multiplier (X), a multiplexer (MUX1), and an adder (+1)

CUS accesses custom register CR0…CR2 (which is independent of base processor operand buses)

temp1 temp2

ACCU

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 13/24

Snapshot of Dynamic Execution of a Program

Top horizontal bar lists the sequence of processor events dictated by its execution

The bottom bar depicts the side effects in either the base processor or the custom hardware Execution of the base processor instruction add actives custom hardware (X, MUX1,

+1) in the second cycle Execution of the custom instructions (I2 and I3) active base processor hardware

(ALU) in the second cycle Side effect occurs because the custom hardware and the ALU of the base

processor share the same operand buses

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 14/24

Different Factors of the Energy Macromodel Energy consumed by base processor instructions on the base

processor core Energy dependency on inter-instruction correlation and other

nonideal features (such as stalls, cache misses, etc.) Energy consumed by custom instructions on the custom

hardware Only custom hardware computation energy

The second box in the top bar of I2, I3, I4

Interplay between the base processor and custom hardware Active energy of custom hardware owing to base processor instructions

Computation side effect in the EXE stage The bottom bar of instruction I1

Active energy of base processor hardware owing to custom instructions Computation side effect in the EXE stage

The bottom bar of instructions I2 and I3

Involvement of the base processor in other pipeline stages RdReg, Wait, WrReg, WrCR event in the top bar of instruction I2, I3, I4

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 15/24

Extensible Processor Energy Estimation Flowchart

constructing macromodel template E=E0X0+E1X1+ …+EnXn

express energy consumption (dependent variable) as a function of those characteristic parameter (independent variable) • E0,..,En are constants called energy coefficient• X1,...,Xn are chosen from both instruction-level and structural domain Test program suite incorporates custom instructions to cover all the custom HW library components Regression analysis require knowledge of both the dependent variable and the independent variable

• Step 3-7 repeat for all the test program

dependent variable

independent variable Regression analysis finds the estimate of energy coefficient (energy macromodel construction complete)

Characterization Flow

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 16/24

Extensible Processor Energy Estimation Flowchart

Step 9 gathers instruction-level macromodel parameter values

• instruction-level execution statistics Step 10 gathers structural macromodel parameter values

• The activation of custom hardware

Estimation Flow

parameter values are fed to the energy macromodel to yield the energy estimation

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 17/24

Energy Macromodel Template Generation

- Eins is a linear function of instruction-level parameters depicts energy on the base processor - Estruc is a linear function of structural parameters depicts energy on custom hardware

Instruction-level macromodel parameters Reflect the usage of base processor core due to either base processor or custo

m instructions

Energy components of the base processor core Energy of base processor owing to base processor instructions

Earith,.., Ebr_utk represent the average energy consumption of each instruction class

Cycarith,.., Cycbr_utk represent the number of cycles taken by each instruction class

Energy due to inter-instruction correlation and other nonideal features Macromodel parameters Numi,..,Numinterlock denote the number of times each nonideal

case occurs Energy consumption in the base processor imposed by custom instructions (En

ergy consumption in the four pipeline stages other than the EXE stage) Macromodel parameter Cycside_tie accounts for the number of cycles taken by all custo

m instructions

Eins= Earith*Cycarith + Eld*Cycld + Est*Cycst + E j*Cyc j + Ebr_tk* Cycbr_tk + Ebr_utk*Cycbr_utk +

Ei*Numi + Ed*Numd + Euncache* Numuncache + Einterlock*Numinterlock + Eside_tie*Cycside_tie

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 18/24

Energy Macromodel Template Generation Structural macromodel parameters

Reflect the usage of custom hardware extensions due to either base processor or custom instructions

Macromodel parameters Cyc1,…,Cyc10 denote the number of cycles in which each custom hardware component category is active

Energy coefficients E1,..,E10 represent the average energy consumption for each kind of custom hardware component category

Energy components of the custom hardware extensions Custom functional blocks is activated when any custom instructions executing Custom functional blocks can also be activated when base processor instructio

ns are running Side effect due to the sharing of the same operand buses still affects the custom har

dware Dynamic resource usage analysis in the execution trace identifies the activated

custom functional blocks (HW component) for each instruction

Custom hardware energy consumption expresses as below:

Estruc= E1 * Cyc1 + E2 * Cyc2 + E3 * Cyc3 +….+E10 * Cyc10

Note: structural macromodel parameters should be covered all the components present in the custom hardware library (10 component categories is this paper)

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 19/24

Macromodel Fitting Through Regression Analysis Determining the energy coefficients in the macromodel template

Solving the linear-matrix equation M(n*21) X C(21*1)=E(n*1)

E denotes a n*1 column vector which are grouped by the energy consumption data of n test programs

M denotes a n*21 matrix which are grouped by the values corresponding to the macromodel parameters

C is the energy coefficient vector corresponding to { Earith, Eld, Est, Ej, Ebr_tk, Ebr_utk, Ei, Ed, Euncache, Einterlock, Eside_tie, E1, E2,

E3, E4, E5, E6, E7, E8, E9, E10 }

( Ĉ denotes the estimate of energy coefficient C)( Ê denotes the estimate of total energy

consumption E) Yields the energy coefficient vector C, such that the mean square error is minimized

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 20/24

Energy Coefficients of the Xtensa Processor

Energy consumption for each base processor instruction category per cycle

Energy consumption for side-effect per cycle

Energy consumption for execution-time effects per miss/per-interlock

Energy consumption for different custom hardware components per cycle

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 21/24

Absolute Accuracy Examination

Application Energy Estimates

The maximum estimation error is 8.5% The average absolute error is only 3.3% The proposed energy estimation methodology is very fast WattWatcher needs several more hours for energy estimation ( RTL description generation + RTL simulation + power estimation using WattWatcher )

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 22/24

Absolute Accuracy Examination (cont.)

Energy consumption due to custom hardware can be significant The accuracy of the macromodel is high both for the base

processor and custom hardware

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 23/24

Relative Accuracy Examination

Good relative accuracy of our macromodel The proposed energy estimation methodology is high

relative accuracy and low effort (no custom processor generation, no RTL simulation)

Therefore, it is highly suitable for energy optimization studies

112/04/18 A Hybrid Energy-Estimation Technique for Extensible Processor 24/24

Conclusions

Presented an efficient and accurate energy estimation methodology for extensible processors High efficiency comes from energy estimation only requires

instruction-set simulation based analysis of the application High accuracy comes from dynamic analysis of custom

hardware usage pattern Although it speedup energy estimation, but it still have

good absolute accuracy (average absolute error is only 3.3%)

and also achieve high relative accuracy