smartdpm: machine learning-based dynamic power …mason.gmu.edu/~spudukot/files/jolpe18.pdf · p....

IP 129174252250 On Tue 20 Aug 2019 224713Copyright American Scientific Publishers

Delivered by Ingenta

Copyright copy 2018 American Scientific PublishersAll rights reservedPrinted in the United States of America

Journal ofLow Power Electronics

Vol 14 460ndash474 2018

SmartDPM Machine Learning-Based Dynamic PowerManagement for Multi-Core Microprocessors

P D Sai Manoj1lowast Axel Jantsch2 and Muhammad Shafique31Department of Electrical and Computer Engineering George Mason University Fairfax VA 22030 USA

2Institute for Computer Technology TU Wien Vienna 1040 Austria3Institute for Informatics TU Wien Vienna 1040 Austria

(Received 3 September 2018 Accepted 18 September 2018)

To address the power management challenge in multi-core microprocessors we present alightweight machine learning based dynamic power management (SmartDPM ) scheme in which thevoltagendashfrequency levels of the cores are dynamically adjusted along with online learning basedworkload prediction in an observer-controller loop To enable scalability our SmartDPM employs aper-application autonomous power management policy in which online machine learning principlesare employed for predicting the workload and capturing sporadic variations under the constraintsof accurate yet lightweight Further applications are assigned appropriate voltagendashfrequency leveltowards an efficient power management The learning helps in dynamically reducing predictionerror Compared to the non-DVFS implementation SmartDPM achieves nearly 35 power savingand nearly 15 higher power savings on average compared to the existing machine learning basedpower management schemes for a microprocessor with up to 32-cores

Keywords Multi-Core Microprocessor Power Management Machine Learning DynamicVoltage Frequency Scaling Prediction Control Theory Online Learning ScalabilityEnergy Efficiency

1 INTRODUCTIONMulti-core microprocessors evolved as a result of scal-ing down of transistor nodes and innovations in com-puter architecture designs Increase in the number of coresin multi-core microprocessors made it feasible to enjoythe benefits of multi-threaded processing such as higherthroughput and many more performance improvementsHowever this improved performance and benefits comesat the cost of increased power consumption Neverthelessthe microprocessor comprises of multiple units cores areone of the major power consuming units encountering thedifficulty to meet the power budget demands especiallyunder unpredictable run-time scenarios of varying work-loads for different concurrently executing applications1ndash14

This calls for an efficient and adaptive power managementin multi-core microprocessorsDynamic voltage and frequency scaling (DVFS)11315ndash21

and dynamic power management (DPM)2223 have provento be effective power management techniques for

lowastAuthor to whom correspondence should be addressedEmail saimanojp2013ieeeorg

microprocessors Dynamic power management refers toselective shutdown of system components while DVFSrefers to scaling down of voltage and frequency levels forunder-utilized cores thereby reducing the dynamic powerconsumption DPM though efficient in terms of power sav-ings shutting down and turning-on a core causes addi-tional energy and latency penalty324ndash27 However this isworthwhile when the idle interval is longer than a cer-tain threshold2628ndash30 In this work we consider using theDVFS methodology for power management in multi-coremicroprocessors because of its lower overhead and widerapplicability especially for applications which are compu-tationally expensive Furthermore our work is orthogonaland can be employed in systems that perform both DVFSand DPM like current multi-core microprocessors Thecharacteristics of the workload running on core(s) have adirect impact on the power consumption of a micropro-cessor As such considering the facets of the workloadto perform DVFS is effective131630ndash35 Power management(DVFS) can be performed considering different parame-ters such as worst-case execution time of the task36 tem-perature37 workload behavior3435 Predicting or knowing

460 J Low Power Electron 2018 Vol 14 No 4 1546-1998201814460015 doi101166jolpe20181576



Manoj et al SmartDPM Machine Learning-Based Dynamic Power Management for Multi-Core Microprocessors

the workload characteristics under unpredictable hardwarefactors (such as cache misses on-chip network contentionand so on) aid in effective DVFS at run-time The desiredproperties of the predictor for predicting the workloadbehavior and to perform DVFS could be outlined as fol-lows accurate yet lightweight in nature28 effectively cap-ture sporadic variations adapt to the variations continuousself-optimization as self-optimization is important to learnover time and low complexity with reduced number ofcomputations The rationale to have a lightweight predictoris to avoid additional computational overheads caused andalleviate the processing delays in the prediction Predic-tion with the aid of machine learning has been proven tobe effective along with capturing the workload behaviorsand underlying variations3839 Along the same lines wealso adapt the machine learning to predict the workload(s)characteristics in this work The considered workload char-acteristic for DVFS in this work is the power trace of theworkload(s)

11 Motivational Case StudyThe challenge of adapting to workload variations canbe effectively addressed by learning workload behaviorAs large number of relevant machine learning algorithmsand heuristic approaches are available for prediction it iscritical to choose a suitable technique that facilitates min-imizing the prediction errors by learning the underlyingfactors to predict the sporadic variations The selection cri-teria for the predictor is to be fairly accurate robust andlightweight in terms of computations and complexity Inorder to select an appropriate learning technique we per-form a case study and evaluate the performance of differ-ent prediction algorithms This case study is performed inMatlab Workload power traces are obtained from Sniper-Sim40 with applications running on a single-core micro-processor of Nehalem architecture (In-depth details ofexperimental setup is presented in Section 61) AverageRMSE for online learning based linear prediction neuralnetworks and support vector machine based prediction isindicated in Figure 1 The Y -axis represents the RMSEerror and the executed benchmark is marked on X-axis ofFigure 1

Comparison of root-mean square error (RMSE) forSPLASH-2 benchmarksrsquo workload (power trace) predic-tion with different prediction methods is illustrated inFigure 1 The employed prediction methods for compari-son are modified covariance based prediction (Mod Cov)linear prediction with online learning (Linear Pred) tra-ditional neural networks (Neural NW) and support vec-tor machine (SVM) For the linear predictor an order of4 is used The order is choosen by experimenting thepredictor with different orders and also the complexityinvolved Similarly for neural network based prediction asingle-hidden layer with 10 nodes is employed Levenberg-Marquardt backpropagation based training is used for neu-ral network

0

02

04

06

08

1

12

14

fft barnes cholesky fmm radix raytrace

Mod Cov Linear Pred Neural NW SVM

RMSE of NN depends on initialization slower than linear predictor

SVM has least error but resource hungry and much slower than linear predictor

LP (053)

NN (057)

SVM (012)

RM

SE

Fig 1 RMSE of workload power trace prediction using covariancelinear prediction neural network and support vector machine techniques(average RMSE is marked) X-axis represent benchmark

From Figure 1 one can observe that the neural networkhas lower RMSE for some of the benchmarks and can befurther reduced by modifying the neural network archi-tecture However it is experienced during the simulationsthat the accuracy or RMSE of the neural network highlyvaries depending on the initial state ie initialization ofthe weights Online learning based linear predictor out-performs modified covariance in terms of average RMSESupport vector machine (SVM) based regression outper-forms both neural network and online learning based lin-ear predictor implementations and has a minimal averageRMSE However the runtime for SVM based predictionis more than 10times higher than the neural networks and lin-ear prediction The complexity of a technique or a methodcan as well be observed experimentally from its runtimeas the involved computations determine the runtime of atechnique The runtime for Mod Cov Linear Pred Neu-ral NW and SVM is 0026 s 0012 s 0021 s and 093 srespectivelyAs the two main desired characteristics for the pre-

dictor in this work are (a) ability to learn work-loads and capture their behaviormdashthis can be observedfrom the RMSE (b) lightweight (low complexity) andfast learning predictormdashruntime is one of the factorsthat represents this The linear predictor with onlinelearning fairly performs well and has smaller averageRMSE Additionally the linear predictor with onlinelearning requires less computations and is faster com-pared to SVM and neural networks Depending on thisanalysis and to satisfy the constraint of accurate yetlightweight prediction we have chosen linear predictorwith online learning for the purpose of predicting work-loads in SmartDPM It needs to be noted that the RMSEmight be different if the parameters of the predictorare modified such as prediction order number of lay-ers in neural network However more number of lay-ers in a neural network will increase the complexityand cost for employing it for individual core will besignificant

J Low Power Electron 14 460ndash474 2018 461



SmartDPM Machine Learning-Based Dynamic Power Management for Multi-Core Microprocessors Manoj et al

12 Associated Research ChallengesOther than the basic challenges such as voltagendashfrequencylevel changing latency and accuracy some of the mainchallenges to be addressed for efficient power managementusing workload prediction arebull Learning and adapting to workload variations Work-load(s) of an application often varies with time underlyingfactors and computations can result in sporadic variationsin workload As such it is of vital importance to capturethese variations and learn the behavior of workload adaptto it before performing the prediction of workloadsbull Robustness of the prediction It is critical to performprediction that is robust to the variations such as sporadicbehaviors in the workload and initial value in the pre-diction Additionally a small computation time is oftendesired

13 ContributionsWe propose an online learning based adaptive dynamicpower management (SmartDPM) technique for multi-core microprocessors SmartDPM performs the predictionof workload in an observer-controller loop manner dur-ing run-time with online learning The utilized observer-controller loop predictor continuously keeps the weightsupdated based on the prediction errors thus ensuring bothaccuracy and robustness to the errors Once the predic-tion of the workload is carried out corresponding volt-age and frequency levels are assigned towards effectivepower management The assignment is based on the uti-lized power model The main contributions of this workcan be outlined as followsbull A run-time adaptive dynamic power management formulti-core microprocessor with online learning basedpredictionbull A run-time workload prediction with an accurate yetlightweight online learning observer-controller loopbull Based on the learned workload characteristics andthe predicted workload characteristics voltagendashfrequencypairs are assigned to the application

The proposed SmartDPM achieves a power savingof nearly 35 15 and 14 compared to non-DVFS approaches similar to space-time multiplexing8

and SVM38 on average for a microprocessor with up to32-cores running SPLASH-2 and Parsec benchmarks

14 Paper OrganizationThe rest of this paper is organized as follows A review ofrelevant existing works on power management is presentedin Section 2 The system model and the application modelare presented in Section 3 Section 4 presents the systemarchitecture for SmartDPM with the problem formulationThe prediction technique and clustering used for Smart-DPM are described in Section 5 Simulation results andcomparisons are presented in Section 6 with conclusionsdrawn in Section 7

2 LITERATURE REVIEWIn the past few years some works started exploring learn-ing based techniques for prediction under a centralizedpredictor Regression11134142 is one of the most popu-lar methods to predict the workloads which fits a poly-nomial to the measuredobserved values and extrapolatefuture values Regression of different metrics such as LLCmiss or hit rates power trace of the workloads memoryaccesses are utilized to perform DVFS A comprehensivesurvey on machine learning based power management ispresented in Ref [1] We review some of the relevantworks belowIn Ref [34] the workload for the next frame is

estimated using linear-in-parameters model in which theoutput for the next frame is estimated as a linear combi-nation of previous workloads (frames) and a number ofsystem parameter values Similarly a linear regression isadopted in Ref [43] to estimate the workloads and per-form DVFS accordingly A gradient descent method basedupdating frequency by learning the workload using lin-ear regression considering the observations from perfor-mance counters power sensors and measured latencies isproposed in Ref [44] A space-time multiplexing (STM)based power management with auto-regressive movingaverage (ARIMA) for predicting the workloads and asingular value decomposition for clustering and voltagendashfrequency level assignment is proposed in Refs [8 13]Different techniques to reliably predict workloads such asLast-value predictor history table and ARIMA are pro-posed in Ref [38] along with an qualitative analysis Somemethods also make use of other workload characteristicssuch as deadlines memory accesses and so on to performDVFSIn Ref [36] a linear regression is used to predict mem-

ory accesses per cycle and CPU cycles per instruction(CPI) Based on the ratio of predicted CPI with on-chipaccess and overall CPI frequency scaling is performed forpower management Similarly in Ref [45] a deadline foreach task or application is considered to perform powermanagement Based on the slack time generated in the cur-rent time slot and the worst case execution of the task theoperating frequency for the application or task is scaledIn addition to workload timing behavior memory accesspatterns are as well utilized to perform power manage-ment A voltage scaling approach based on the obtaineddynamic events such as cache hit or miss rate and memory-access counts at runtime from performance modeling unit(PMU) are used to determine the optimal frequency forperformance constraint is proposed in Ref [46] Updatingfrequency levels at variable intervals for power manage-ment is proposed in Ref [19] This method makes use ofidle time between two consecutive cycles and increases ordecreases frequency levels based on that Two registers andcounters are employed to keep track of idle times and storethe values for comparing the intervals Apart from use

462 J Low Power Electron 14 460ndash474 2018




of regression techniques using linear regression wavelet-based power management scheme making use of wavelettransform to predict the signal and further based on thepredicted values a threshold dynamic power managementis carried out47 The characteristics of workload used forpower management greatly varies However it is clearlyevident that regression in one form or other are consideredto estimate the workloads to perform power managementIt needs to be noted that the complexity and achieved accu-racy varies with the applied regression technique and withapplication

On the other hand advancements in machine learningenabled adapting machine learning techniques for predic-tion purpose48 A Bayesian workload predictor with clas-sification and policy generation is proposed in Ref [49]This uses iterative loops with dynamic programming withcost function objective A model-free reinforcement learn-ing for dynamic power management with Bayesian pre-dictor for workload estimation is proposed in Ref [6]Based on the predicted workload and using reinforcementlearning power management is carried out In a simi-lar manner a Q-learning based approach considering theclock frequencies lowest frequency to meet deadline CPUutilization rate for the current task as states with tun-ing as frequency and voltage as the action is proposedin Ref [50] The Q-learning power management is alsopresented in Refs [16 31 51ndash53] A two level hierarchi-cal power management scheme is proposed in Ref [54]This method is performed at two levels component-levellocal power manager and system-level global power man-ager The component-level power management followsa pre-specified power management policy and is fixedwhereas the system-level power manager employs tempo-ral difference learning on semi-Markov decision processas the model-free reinforcement learning technique andit is specifically optimized for a heterogeneous applica-tion pool In addition to traditional learning approachesdeep learning is also employed for power management inmulti-core and many-core microprocessors A deep learn-ing approach for workload prediction and adaptive powerscaling is proposed in Ref [11] Here statistical relation-ships are exploited on the data obtained from hardwarecounters to predict the periods of low instruction through-put and the frequency voltage are decreased to save powerA hierarchical sparse coding was adopted to capture com-plicated signature patterns over time which has shown pre-diction performance improvement compared to regressionand heuristic approaches This sparse coding is succeededby support vector machine (SVM) regression Based on thetrue positives and accuracy DVFS is performed A supportvector regression based regression followed by classifica-tion is used for DVFS in Ref [38] A comparison of neu-ral networks and regression methods is presented39 whichconcludes that the neural networks outperform regressionbased prediction when the data used for training is smaller

whereas both of them perform similarly when fair amountof training data is available As such it can be observedthat power management (DVFS) is performed with the aidof various machine learning techniques ranging from semi-supervised learning (reinforcement learning) to deep learn-ing and supervised learning techniques Learning basedapproaches are more robust and efficient however theresources and computation time it requires is high Fur-thermore most of the existing machine learning basedworks are supervised with offline learning whereas theheuristics based predictors based works lack accuracy andself-learning

21 Novelty Over State-of-the-Art TechniquesIn addition to the trade-off analysis presented inSection 11 we present the similarities and deviationsfrom the existing power management techniques in theliterature Similar to most of the existing techniquesour proposed power management technique SmartDPMalso follows two-step procedure prediction and voltagendashfrequency (VF) assignment The main differences couldbe outlined as follows unlike most of the power manage-ment schemes which utilizes offline learning SmartDPMperforms predication with an observer-controller loop uti-lizing online learning to capture underlying variations inthe workload(s) The utilized observer-controller loop pre-diction technique is accurate and lightweight faster com-pared to other techniques that makes power managementfeasible during runtime with online learning This is evi-dently clarified with the aid of RMSE metric and runtimein Sections 11 and 634

3 SYSTEM MODEL31 Hardware Architecture ModelWe consider a homogeneous multi-core processor com-prising of N cores C = C1C2 CN Due to varyingworkloads different cores execute at different frequenciesin order to ensure proper execution There exists a max-imum operating frequency level fmax for every possibleoperating voltage V The frequencies of a core can be var-ied between fmin to fmax and the corresponding voltagesbetween vmin and vmax The cores operating at higher VFlevels consume more power when executing the applica-tion Furthermore similar to Ref [55] we assume thatperformance of the processor core is higher when runningat a higher VF level

32 Application ModelWe consider a mixture of single-threaded and multi-threaded applications in this work and each core executesone thread In Figure 2 different shades on cores representdifferent applications running on them The distribution ofapplications are not uniform ie different applications canrun on different number of cores depending on the number





Application-level power trace(s)

X1(1) X1(n)

Xm(1) Xm(n)

-

Application monitor

App1(VF) Appm(VF)

^ ^

App1 App2 App3 App4 App5 App6

Multi-core microprocessor

Prediction

Error

Update

Xm(n+1) X1(n+1)

Smar

tDPM

Observercontroller loop

predictor

VF allocator

Fig 2 Overview of proposed SmartDPM in a multi-core microprocessor

of threads Each of the application comprises of multipletasks as such a task requires w clock-cycles for execu-tion From the application perspective we assume that theapplications are multi-threaded and can be executed in par-allel similar to Ref [3] Each core can execute one threadWe assume that the total number of threads executing aresmaller or equivalent to number of cores To avoid theassociated overheads especially for multi-threaded appli-cations we employ per-application power manager ratherthan system-level or per-core power manager To obtainper-application power or energy trace we sum the powertraces of the cores on which the application is currentlyexecuting

33 Power ModelThe total power consumption of a core comprises of staticand dynamic power The static power is dominantly due tothe leakage currents and varies exponentially with thresh-old voltage The dynamic power consumption is due tothe switching activities inside the core As such the totalpower consumption5657 when operating at voltage V andfrequency f is modeled as

PV f = Pstatic+PDynamic = I0eminusVthVT V +CV 2f (1)

Here I0 and are technology parameters VT is thethermal voltage Vth is the threshold voltage repre-sents the switching activity factors and C is the averagecapacitance

4 SYSTEM ARCHITECTURE41 System OverviewWe present an overview of the system along with differ-ent components and their corresponding models that areutilized in this work The main components of the Smart-DPM system are cores voltagendashfrequency regulators andconnectivity between cores

An Intel Xeon microarchitecture58 core model is cho-sen to build the system QuickPath interface is usedfor point-to-point connection for cores External compo-nents such as DRAM are connected with the help ofbuses The system is supported with DRAM memory con-troller(s) for memory access In the recent years therehas been a surge of interest to build on-chip integratedswitching voltagendashfrequency regulators59ndash61 To enable thesupport for power management with SmartDPM stan-dard on-chip voltagendashfrequency regulators that are usedin industrial processor like Xeon X5550 are embed-ded into the system The latency involved in changingvoltage and frequency levels is less than 2 s5859 Itneeds to be noted that the proposed SmartDPM is notbound to any specific core architecture ie the proposedwork can also be deployed in microprocessors with other(micro-)architectures The system architecture is describedbelow

42 System ArchitectureThe system architecture to perform SmartDPM basedpower management is presented in Figure 2 The compo-nents used in the system inherits the model descriptionin Section 41 The microprocessor comprises of multi-ple cores with private L1 L2 instruction and data cachesand shared L3 cache(s) The microprocessor is equippedwith on-chip DRAM controller for DRAM access Inaddition to the traditional core and caches system isequipped with an Application monitor unit that moni-tors the workload statistics of each application execut-ing on the microprocessor core(s) The workload statisticsrefer to the power traces of the workload when executedon the core(s) which are then provided to the Smart-DPM unit Each SmartDPM unit performs prediction ofthe workload in two stages modeling and optimizationin an observer-controller loop fashion Further based onthe predicted workload a suitable voltagendashfrequency (VF)





pair is decided and assigned to the core(s) running thecorresponding application towards an efficient power man-agement These voltagendashfrequency values are set and pro-vided to cores with the aid of on-chip power convertersAs continuous prediction adds computational overheadsthe prediction is performed at regular intervals oftime

A multi-core microprocessor running different applica-tions equipped with SmartDPM power management isdepicted in Figure 2 Shading over cores in Figure 2 rep-resent different applications An application refers to aprogram that runs on the core(s) of the microprocessorFor instance multi-threaded applications such as Parseclsquoblackscholesrsquo requires multiple cores to run As suchto avoid the associated overheads especially for multi-threaded applications we employ per-application powermanager that autonomously learns and updates the predic-tion model depending on the variations in the workloadand further assign the voltage and frequency to the core(s)for an effective power management

43 Associated Research QuestionsOther than the basic challenges such as voltagendashfrequencylevel changing latency some of the main challenges to beaddressed for an efficient power management using work-load prediction arebull How to learn and adapt to workload variationsWorkload(s) of an application(s) often varies with timeUnderlying factors and computations can result in sporadicvariations in workload behavior As such it is of vitalimportance to effectively learn the behavior of workloadby capturing the underlying variations adapt the model incase of deviationsbull How to learn fast In addition to learning and adaptingto workload variations the amount of time it needs to learnand adapt also impacts the performancebull Robustness of the prediction It is critical to performprediction that is robust to the variations such as sporadicbehaviors in the workload and initial value

44 Problem StatementThe power consumption of a core depends on the work-load that it is running For instance a core running a com-putationally intensive application consumes more powercompared to the core which executes computationally lessintensive and lightweight workload Based on the posedresearch questions and the presented system architecturewe define the problem as follows

In a multi-core microprocessor the workload(s) needs tobe learned along with its underlying variations and furtherpredict the behavior so that the core(s) can be providedwith appropriate voltage and frequency levels towardsan effective power management under the constraints of

achieving a targeted performance This could be mathe-matically defined as

minPTotal

ST i xinminus xin1 lt rarr 0

ii vjn fj nlarr xin

iii vL le vopt le vH

iv Thr ge Thrmin

(2)

The main objective of this work is to minimize the over-all power consumption PTotal without violating the perfor-mance requirements Here the performance is measured asthroughput The throughput setting is provided externallyto the system in this work This objective has to be fulfilledsubject to an accurate prediction of workload(s) as givenin (i) of (2) it is desired to have a small prediction error between the predicted workload xin and the actual work-load xin for application i for time instant n Secondlybased on the predicted workload xin at a phase n thean optimal voltage vjn and frequency fjn levels haveto be assigned to the core(s) running workload i withoutoverpowering the core(s) as given in (ii) of (2) and thevoltage level assigned to the application needs to be suf-ficient ie the core(s) must be supplied with a minimalvoltage to keep the application running as given in (iii) of(2) along with achieving a performance (throughput Thr)better than the lower bound (Thrmin) In addition to theabove constraints as mentioned in the previous sectionsthe power manager should not incur large overheadsTo predict and learn the workload behavior the predic-

tion needs to be based on the previous inputs (observer)To capture and adapt to the sporadic variations in the work-load behavior a feedback has to be provided to the predic-tor to update the model As such the problem of predictioncan be mathematically stated as

xn= zxnminus1 xnminus2 xnminusp (3)

where xnminusp is the workload at nminusp-th time instantand z denotes the observer-controller loop based pre-diction function

5 SMARTDPM ONLINE LEARNING-BASEDDYNAMIC POWER MANAGEMENT

Our proposed multi-core microprocessor power manage-ment scheme SmartDPM efficiently manages the systempower under the constraints of lightweight yet low errorin the workload prediction (with the aid of observer-controller loop) and a minimized overhead This is per-formed in two steps In the first step an observer-controllerloop based prediction is performed to learn and predict theworkload(s) at an application-level granularity under theconstraints of low prediction error and lightweight Fur-ther depending on the application and the predicted work-load characteristics the voltagendashfrequency (VF) pairs are





assigned to the corresponding core(s) that run the work-load towards minimizing the overall power consumptionunder the constraints of performance Predicting at regu-lar time-intervals is adopted in order to reduce the com-putational overheads A detailed description of the wholeprocess is presented below

51 Learning and Prediction of WorkloadAt a more higher abstract level observer-controller loopbased predictor as a part of SmartDPM is shown inFigure 2 enclosed in a zoomed-out rectangle A moredetailed architecture of the observer-controller loop basedpredictor employed in SmartDPM is depicted in Figure 3As continuous prediction of workload characteristics mightcause additional overheads we predict the workload char-acteristic ie power trace at regular time intervals Pleasenote that the workload characteristic here refers to thepower trace when the application is run on the core(s)The workload prediction in SmartDPM uses a observer-controller loop prediction equipped with an online learningto learn and predict the workload effectively by captur-ing the underlying variations in real-time Additionally theweights of the predictor are updated continuously depend-ing on the prediction error The prediction mechanism min-imizes the prediction errors in a least-square sense Theonline learner continuously learns the workloads runningon the core(s) and adapts the weights in order to capturethe variations if the prediction error increases This couldbe explained in two phases modeling phase (observer) andoptimization phase (controller to minimize the predictionerror)Phase 113 Modeling (Observer) Towards predicting the

workloads (at phase-level) the first step is to build a modelthat effectively represents the workload A linear model isinitially built to estimate the workload at the phase-leveland perform prediction The model is built consideringthe workload values at previous phases under the con-straint of minimizing the prediction errors The predictionerrors aid in adapting the model by learning the variationsmore effectively and capture the variations effectively Thismodeling phase can be seen as learning phase where themodel is built so as to estimate the workload effectivelyAs shown in Figure 3 the observer monitors the workload

Microprocessor

Learning based prediction

Linear model

Optimization

Observer

VF regulator

RM

SE

Workload

Predicted workload

VF pairai

xi(n)

xi(n+1)^

Fig 3 An observer-controller loop based workload prediction used forSmartDPM

of the applications and provides the observed values to ini-tially build a model The built linear model for predictioncan be given as

xin = a1xinminus1+a2xinminus2+middotmiddotmiddot+apxinminusp+

=psum

j=1

ajxinminusj+ (4)

here aj j = 12 p are the coefficients p represent-ing the order of prediction ie number of previous sam-ples considered for the prediction xin represents theworkload at phase n for application i represents theprediction error and the predicted value is indicated asxin The utilized modeling and learning phase is linearin nature and hence fasterAs the workloads may deviate from the built model

under sporadic events it is necessary to update the builtlinear model with time without overhead To capture suchsporadic variations the model building phase is succeededby an optimization phase with an objective to minimizethe prediction errors along with updating the built modelwith timePhase 213 Optimization (Controller) The model built in

the modeling phase is efficient in predicting the work-loads based on the trained data However workloads couldexperience sporadic variations during the runtime due tounderlying factors compared to the data learned during themodeling phase In order to capture those variations in theworkload and predict more precisely using (4) the modelneeds to be updated if the workloads change ie the coef-ficients (ai) need to be tuned if the prediction is not closeto the actual values This is performed by updating theweights in order to minimize the prediction error ie tosatisfy auto-correlation criterion As such the choice ofcoefficients is made by

psumi

aiRjminus i=minusRj (5)

for 1le j le p and Rj is the autocorrelation for the signalgiven by

Rj = Exinxinminus j (6)

where Exin denotes the expected value of workloadxinThis optimization phase can be seen as updating of the

prediction model based on the observed prediction errorsand deviation of the predicted and actual values This opti-mization phase helps to capture underlying factors thatcontribute to sporadic variations

52 VoltagendashFrequency AssignmentOnce the workload is efficiently predicted for an appli-cation the corresponding core(s) must be provided withthe voltagendashfrequency (VF) levels that meet the work-load requirement This ensures that the core(s) running





the workload is neither overpowered or underpowered torun the application workload The VF assignment to theapplication based on the predicted workload (power) ispresented in this section We chose four VF levels in thiswork as the chosen levels support the considered IntelNehalem microarchitecture58 However the same tech-nique could be extended to more levels depending on therequirements

SmartDPM has a set of pre-defined voltagendashfrequencypairs (VF levels) for which the corresponding providablepower is calculated based on model in (1) and based onthe predicted workload value the corresponding applicationis assigned with one of the suitable VF levels A Maha-lanobis distance metric based clustering is employed todetermine the cluster (pre-defined VF level based on thepower model) to which the application will be assignedThe clustering is done by considering the suitability of thepredicted power trace for an application and the power thatcan be supplied when a VF level is assigned The Maha-lanobis distance between two vectors can be given as

dx y=radicxminus ySminus1xminus y (7)

where x and y are the vectors between which the Maha-lanobis distance is calculated and S is the covariancematrix In our case the vectors are the predicted workloadsand the formed centroid for four voltagendashfrequency levelsie the power that can be provided when run at particularVF level The closer the Mahalanobis distance similar thevectors The application (workload) will be assigned withthe VF level to which it has smaller Mahalanobis distanceThe same clustering technique could be utilized when con-sidering additional multiple metrics It needs to be notedthat when the performance is not met with the assignedVF settings the VF of the next higher level are assignedto the application

The multi-core power management using our proposedSmartDPM is presented in Algorithm 1 Initially theworkloads are monitored by the observer for few phasesinstances Then a linear model is built based on the previ-ous p values as given in Line 1 of Algorithm 1 Furtheran online learning is employed to update the coefficientsand minimize the prediction in case of variations as inLine 4 of Algorithm 1 Further based on the predictedworkload values the Mahalanobis distance metric is usedto map the application (workload) to one of the four VFlevels for which the providable powers are defined basedon the model represented by ck In Line 5 q representsthe number of VF levels The VF level with smaller dis-tance is assigned to the predicted workload of the appli-cation as described in Line 5ndash11 of Algorithm 1 For thebrevity the performance requirement is not presented inthe Algorithm however when the performance constraintis not met the VF settings of the next higher-level is setuntil the performance is met

Algorithm 1 (Proposed SmartDPM Based PowerManagement)Input Workloads (W = W1W2 ) supported voltagendashFrequency (VF) pairs (C = c1 c2 cq)Output Minimized power consumption with VF Scaling1 Workload Wi = xi1 xi2 xin2 for j = p to n do3 xij=

sumpk=1 ak lowastxijminusk

4sump

i aiRjminus i=minusRj (Find ai)5 for k = 1 to q do

6 dxij ck=radicxijminus ckS

minus1xijminus ck

7 if dxij ck=min then8 Wijlarr ck9 else10 k = k+111 end if12 end for13 end for

6 RESULTS AND DISCUSSIONFirst we present the simulation settings and our tool flowfor evaluating the efficacy of our SmartDPM techniquefollowed by achieved power savings

61 System SettingsThe proposed SmartDPM power management schemeis implemented in the SniperSim multi-core simulator40

which is a parallel interval-accurate and high-speed times86simulator The maximum voltage and frequency levels are12 V and 266 GHz respectively In simulations weuse four voltagendashfrequency levels for power managementthat are supported by standard Nehalem microachitecturebased cores (12 V 266 GHz) (11 V 18 GHz) (10 V15 GHz) and (09 V 10 GHz) However this could bemodified depending on the simulation environment and theutilized cores A standard switching time for the voltagendashfrequency regulator (less than 2 s) is considered in thesimulations as reported in Ref [58] Additional details onthe configuration of microprocessor core and other com-ponents are presented in Table I In order to validate theperformance of SmartDPM simulations are run with Par-sec62 and SPLASH-263 benchmarks The number of coresare varied from 1 to 32 A machine learning predictorwith observer controller loop based predictor of order 4 as

Table I Overview of core configuration

Item Description Value

Microprocessor core Frequency (Max) 266 GHzVoltage (Max) 12 VTechnology node 22 nm

L1-I cache 32 KBL1-D cache 32 KBL2 cache 256 KB

L3-cache 8 MB





described in Section 51 is employed to learn and predictworkloads

62 Tool Flow for SmartDPMThe tool flow to perform SmartDPM in Sniper multi-coresimulator is presented in Figure 4 The Sniper multi-coresimulator is initialized with the configurations such asnumber of cores VF levels applications to be run andso on For an unbiased and better evaluation of Smart-DPM applications are assigned to the core(s) in a ran-dom manner Next the workload statistics at applicationlevel granularity are monitored from the simulator and areprovided to the SmartDPM power manager The predictorbuilds the model during the training phase based on thetraining data and to minimize the root mean square error(RMSE) of the predicted values During the test phase orpost-training phase the model will be updated with the aidof online learning in an observer-controller loop mannerso as to adapt to the workload variations as described inSection 51 Based on the predicted workload(s) the cor-responding voltage and frequency levels are assigned tothe core(s) that are running the corresponding applicationsLastly McPAT64 is used to obtain the power consumptionstatistics of the microprocessor To perform Non-DVFSthe workloads can be run on similar settings without invok-ing the DVFS

63 SmartDPM PerformanceIn this section we present the power savings in a multi-core system with SmartDPM and compare the achievedpower savings with non-DVFS technique and other simi-lar proposed approaches such as use of traditional linearregression4344 SVM regression based power manage-ment38 and space-time multiplexing (STM) based powermanagement8 For a fair comparison these techniques arere-implemented as described in Refs [8 38 43 44] andadapted for application based power management Beforepresenting the performance of SmartDPM we illustrate theperformance of SmartDPM when performed at differentgranularities and the associated overheads

631 Power Management at Different GranularitiesThe proposed SmartDPM focuses on power managementat application level However it is also possible to employ

Snipersim

Learn and predict power traces

Map predicted power to VF level

McPATAssign VF level(s) to

core(s) running application

application Power Traces for

Fig 4 Experimental tool flow to perform SmartDPM in SniperSim

SmartDPM at lower granularity (core-level) and highergranularity (system-level or per-chip level) As a casestudy we present the impact of power management at dif-ferent granularity levels for a 4-core microprocessor Foranalysis multi-threaded applications are chosen based onthe manner the workloads are distributed among coresThe two workloads categories chosen are (a) tightly cou-pled workload (b) loosely coupled workload Here tightlycoupled workload indicates that the workloads of an appli-cation are evenly distributed among multiple cores andloosely coupled workload indicates that the workload of anapplication is unevenly distributed among multiple coresThe power consumption at three different granular-

ity levels for a microprocessor running multi-threadedapplication(s) is shown in Figure 5 Following are theobservationsbull For loosely coupled multi-threaded applications appli-cation level power management has better power savingscompared to system level if the applications are uncorre-lated ie applications are dissimilarbull If the workloads are loosely coupled and correlatedie similar workloads system-level and application-levelpower management achieve similar power savingsbull In case of single multi-threaded application distributedamong all the cores irrespective of granularity the powermanagement achieves similar performance if the applica-tion is tightly coupledbull For a loosely coupled application system-leveland application level power management has similarperformance

Per-core power management has better power savingshowever this adds additional overheads such as monitor-ing power regulators for each of the core System levelpower management has smaller overhead and reducedpower saving compared to per-core power managementPer-application level power management has performancein-between per-core and system-level power managementAs running multiple applications is much realistic onmulti-core microprocessor per-application based power

0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)

Core-level PM Application-level System-level

Single ApplicationMultiple Applications (Loosely coupled)

Uncorrelated

Loosely coupled

Tightly coupled

Correlated

Fig 5 Power savings with SmartDPM at different granularity levels





0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted

Fig 6 An observer-controller loop based workload prediction forSPLASH-2 Raytrace benchmark

management is a better choice for power management tohave power savings with smaller overheads

632 Workload PredictionWorkload prediction at phase level is one of the key stepsin the proposed SmartDPM Prediction of the SPLASH-2Raytrace benchmarkrsquos power trace with time is depictedin Figure 6 As it can be clearly observed that the pre-dicted power trace follows the original trace even undersome of the sporadic variations such as at 11 ms TheRMSE is 021 The number of previous inputs con-sidered for prediction in observer-controller loop is setto 4 It can be observed that the sudden variations suchas spikes and peaks are well captured with the use ofobserver-controller loop based prediction On an aver-age the RMSE for all the benchmarks is 049 TheRMSE varies with benchmark For the employed bench-marks a max RMSE error of 088 is observed which issmaller compared to maximum RMSE error obtained withother prediction techniques such as covariance and neuralnetworks

633 Power Saving with SmartDPMAs each of the applications has different power con-sumption we present the normalized power saving withSmartDPM Figure 7(a) shows the normalized powerconsumption of different SPLASH-2 benchmarks that run

No DVFS ndash1 Trad Lin ndash065 Proposed ndash064 SVM ndash089 STM ndash081

0

02

04

06

08

1

12

exp 1 exp2 exp3 exp4

Nor

mal

ized

pow

er

No DVFS Trad lin Proposed SVM STM

Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e

No DVFS Trad Lin Proposed SVM STM

Proposed 022


(b)

Fig 7 Performance comparison of non-DVFS SmartDPM neural network and SVM based power management in a single-core microprocessor(a) Power savings and (b) runtime

on a single-core microprocessor whose configuration ispresented in Table I In Figures 7(a) 8(a) and 9(a) X-axisrepresents the application that is run on microprocessorand Y -axis represent the normalized power consumptionFor a single-core microprocessor an average power

saving of 36 is observed compared to non-DVFSimplementation as shown in Figure 7(a) Similarly thepower consumption by employing SmartDPM on a dual-core microprocessor compared to non-DVFS technique isdepicted in Figure 8(a) In case of dual-core microproces-sors where SPLASH-2 and Parsec benchmarks are ran-domly executed in different experiments nearly 36 ofpower saving on average compared to non-DVFS imple-mentation is observedOn an average 34 power saving is observed for micro-

processors with up to 32-cores running SPLASH-2 andParsec benchmarks on it as illustrated in Figure 9(a)

634 ComparisonTo evaluate the effectiveness of SmartDPM we comparethe achieved power savings of SmartDPM with other pro-posed works Despite the fact that there exist many workson power management we implemented few works suchas Refs [8 38 43 44] (with minor adaptations such aspower management at application level) for a fair compari-son The rationale for choosing those works are as followsIn Ref [8] prediction of workload using auto-regressivemoving average (ARIMA) and a singular value decom-position (SVD) based voltagendashfrequency level assignmentis carried out which is close to our work Machinelearning equipped power management is proposed inRef [38] where support vector machine (SVM) basedregression for predicting workloads and SVM classifierbased voltagendashfrequency level assignment is employedThe sparse encoding is not implemented as the data is notas large as that in the original work A linear regressionwith offline learning or modeling based workload predic-tion and VF level assignment is utilized in Refs [43 44]Similar resemblances can be observed from other existingworks as well Our proposed methodology as well fol-lows a two step process for power management prediction





0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)

Fig 8 Performance comparison of non-DVFS SmartDPM neural network and SVM based power management in a dual-core microprocessor(a) Power savings and (b) runtime

of workload in an online learning observer-controller loopmanner and voltagendashfrequency level assignment by clus-tering of data The updating of predictor model basedon the observation and the voltagendashfrequency assignmentmethodology with Mahalanobis distance primarily differ-entiates our work from other works The comparison isprovided in Figures 7ndash9 In the legend lsquoProposedrsquo lsquoNoDVFSrsquo lsquoTrad Linrsquo lsquoSVMrsquo and lsquoSTMrsquo represents theachieved performance (power or runtime) with SmartDPMnon-DVFS traditional linear regression based power man-agement SVM and space-time multiplexing based powermanagement techniques respectivelyFor a single-core microprocessor SmartDPM consumes

nearly 36 less than that of non-DVFS implementationwhereas linear regression based power managementconsumes 35 less than non-DVFS implementationsimilarly space-time multiplexing and SVM based powermanagement techniques consume 11 and 19 lesspower compared to the non-DVFS technique As such itcan be derived that linear regression and proposed Smart-DPM based power management performs nearly similar incase of single-core microprocessors for the experimentedbenchmarks Nearly 24 and 16 additional power sav-ing is achieved with SmartDPM compared to SVM andSTM based power management techniques respectivelyin a single-core microprocessor

0

02

04

06

08

1

12

1 cores 2 cores 4 cores 8 cores 16 cores 32 cores

Nor

mal

ized

pow

er

No DVFS Trad Lin Proposed SVM STMProposed

066


(a)


0

02

04

06

08

1

12

1 core 2 core 4 core 8 core 16 core 32 core

Nor

mal

ized

tim

e

No DVFS Trad Lin Proposed SVM STMProposed 074

(b)

Fig 9 Performance comparison of non-DVFS SmartDPM neural network and SVM based power management in a multi-core microprocessor(a) Power savings and (b) runtime

For multi-core microprocessor simulations benchmarks(mixture of Parsec and SPLASH-2) are assigned to coreswith SmartDPM deployed for power management A com-parison of power consumption in a 2-core microproces-sor with randomly assigned benchmarks is presented inFigure 8 For a dual-core microprocessor proposed Smart-DPM has nearly 20 and 12 more power saving com-pared to SVM and STM based approaches respectivelyHowever the linear regression performs similar to that ofSmartDPMSimilarly we compare the power savings of Smart-

DPM for more number of cores Compared to simi-lar implementation as space-time multiplexing (STM) inRef [8] proposed method achieves nearly 14 higherpower saving on average compared to proposed methodsupport vector machine based implementation similar toRef [38] has nearly 14 lower power savings on aver-age for 32 cores and compared to linear regression basedapproach in Ref [43] a power saving of nearly 2 isobtained However it can be observed from Figure 9(a)that the linear regression and proposed method performssimilarly for smaller number of cores However for micro-processors with large number of cores such as 16 coresor 32 cores SmartDPM outperforms linear regression interms of power saving showing that proposed technique





is scalable for many-core systems with large number ofcores

The achieved power savings with SmartDPM dependshighly on the distribution of application(s) in micro-processor A tightly coupled core-application distributionachieves higher power savings similar to per-core DVFSie when an multi-threaded application is partitioned andevenly distributed among cores the power saving is higherthan unbalanced distribution As one more instance inan 8-core microprocessor we have observed a saving ofnearly 30 compared to non-DVFS when the cores runsimilar workloads whereas to analyze the impact of dis-tributing application in an unbalanced manner is carriedout a power saving of only 8 is observed In the demon-strated results especially in multi-core simulations appli-cations are distributed among cores randomly to analyzethe power savings with SmartDPM under no controlledapplication distribution among cores

635 Overhead AnalysisThe proposed SmartDPM power management scheme iseffective in minimizing the overall system power How-ever the power management with SmartDPM involvessome overhead that are (a) additional computationsneeded for predicting the workload characteristics(b) Updating the prediction model (c) voltage frequencyassignment based on the predicted workload character-istics These overheads can be justified as follows Pre-diction of workload helps to assign the VF level to thecore(s) running the application thereby avoiding overpow-ering of cores Updating the model leads to learning theworkload effectively and predict the workloads accuratelyduring sporadic variations as well As the predictionsare performed at phase-level granularity the computationsrequired for prediction and assigning the VF levels aresmaller compared to continuous prediction and changingthe assigned VF levels

We provide an analysis of the proposed method interms of its impact on runtime Similar to power sav-ings we compare the overhead as well with other worksAs each application has different runtime we present theruntime for different benchmarks on single-core dual-coreand multi-core in a normalized manner The runtime forsingle-core microprocessor dual-core microprocessor andmulti-core microprocessor running SPLASH-2 and Parsecbenchmarks are depicted in Figures 7(b) 8(b) and 9(b)respectively

On average the proposed method SmartDPM requiresnearly 10 more time than non-DVFS technique 3lower time than linear regression based power manage-ment 23 lower runtime compared to SVM based powermanagement and 5 higher runtime than STM basedimplementation for microprocessors up to 32 cores whichcan be seen in Figure 9(b) As the non-DVFS method-ology involves no additional computations it is obvious

that the non-DVFS is faster than SmartDPM but at thecost of additional power As seen in motivational casestudy that SVM requires more time for prediction thesame is reflected in the runtime measurements As tra-ditional linear regression and proposed SmartDPM useslinear models they have similar runtime however due tobetter convergence and adaptations SmartDPM is fasterand has better power savings even for large number ofcores In case of STM the clustering is performed rela-tively simpler compared to the proposed work though theprediction takes similar time As such the SmartDPM has5 more runtime on average than STM based power man-agement techniqueAs a summary the proposed SmartDPM tech-

nique performs efficient per-application power manage-ment utilizing accurate yet lightweight online learningobserver-controller loop predictor with less overheadIt outperforms some of the existing power managementtechniques

7 CONCLUSIONSmartDPM with online learning observer-controller loopbased workload prediction at phase-level to perform powermanagement in a multi-core microprocessor is proposed inthis paper SmartDPM performs power management at anapplication level and utilizes lightweight linear predictorwith continuous feedback of prediction error to learn theunderlying patterns in the workload behavior Based on thepredicted workload characteristics voltage and frequencypairs are assigned to the applications A power saving ofnearly 34 on average is achieved with SmartDPM com-pared to non-DVFS technique and nearly 15 more powersaving on average compared to other power managementtechniques for a microprocessor with up to 32-cores

References1 S Pagani P D S Manoj A Jantsch and J Henkel Machine learn-

ing for power energy and thermal management on multi-core pro-cessors A survey IEEE Transactions on Computer Aided Systemsof Integrated Circuits and Systems (2018) pp 1ndash17

2 H Khdr S Pagani Eacutericles Sousa V Lari A Pathania F HannigM Shafique J Teich and J Henkel Power density-aware resourcemanagement for heterogeneous tiled multicores IEEE Transactionson Computers 66 488 (2017)

3 S Pagani A Pathania M Shafique J Chen and J Henkel Energyefficiency for clustered heterogeneous multicores IEEE Trans onParallel and Distributed Systems 28 1315 (2017)

4 P D S Manoj J Lin S Zhu Y Yin X Liu X Huang C SongW Zhang M Yan Z Yu and H Yu A scalable network-on-chipmicroprocessor with 25D integrated memory and accelerator IEEETransactions on Circuits and Systems I13 Regular Papers 64 1432(2017)

5 M U K Khan M Shafique A Gupta T Schumann and J HenkelPower-efficient load-balancing on heterogeneous computing plat-forms Design Automation Test in Europe Conference Exhibition(DATE) (2016)

6 Y Wang and M Pedram Model-free reinforcement learning andbayesian classification in system-level power management IEEETrans on Computers 65 3713 (2016)





7 M Shafique D Gnad S Garg and J Henkel Variability-aware darksilicon management in on-chip many-core systems Design Automa-tion Test in Europe Conference Exhibition (DATE) (2015)

8 P D S Manoj H Yu and K Wang 3D many-core microproces-sor power management by space-time multiplexing based demand-supply matching IEEE Trans on Computers 64 3022 (2015)

9 J Lin S Zhu Z Yu D Xu P D S Manoj and H Yu A scalableand reconfigurable 25D integrated multicore processor on siliconinterposer IEEE Custom Integrated Circuits Conf (2015)

10 S Pagani H Khdr W Munawar J Chen M Shafique M Liand J Henkel TSP Thermal safe powermdashEfficient power budgetingfor many-core systems in dark silicon International Conference onHardwareSoftware Codesign and System Synthesis (CODES+ISSS)(2014)

11 S J Tarsa A P Kumar and H T Kung Workload prediction foradaptive power scaling using deep learning IEEE Int Conf on ICDesign Technology (2014)

12 P D S Manoj H Yu Y Shang C S Tan and S K Lim Reli-able 3-D clock-tree synthesis considering nonlinear capacitive TSVmodel with electrical-thermal-mechanical coupling IEEE Trans onComputer-Aided Design of Integrated Circuits and Systems 32 1734(2013)

13 P D S Manoj K Wang and H Yu Peak power reduc-tion and workload balancing by space-time multiplexing baseddemand-supply matching for 3D thousand-core microprocessorACMEDACIEEE Design Automation Conf (2013)

14 H Esmaeilzadeh E Blem R St Amant K Sankaralingam andD Burger Dark silicon and the end of multicore scaling Int Sym-posium on Computer Architecture (2011)

15 M Salehi M K Tavana S Rehman F Kriebel M ShafiqueA Ejlali and J Henkel DRVS Power-efficient reliability man-agement through dynamic redundancy and voltage scaling undervariations IEEEACM International Symposium on Low Power Elec-tronics and Design (ISLPED) (2015)

16 H Hantao P D S Manoj D Xu H Yu and Z Hao Reinforce-ment learning based self-adaptive voltage-swing adjustment of 25DIOs for many-core microprocessor and memory communicationIEEEACM Int Conf on Computer-Aided Design (ICCAD) (2014)

17 D-C Juan S Garg J Park and D Marculescu Learning-based powerperformance optimization for many-core systemswith extended-range voltagefrequency scaling IEEE Trans onComputer-Aided Design of Integrated Circuits and Systems 35 1318(2016)

18 J Howard S Dighe S R Vangal G Ruhl N Borkar S JainV Erraguntla M Konow M Riepen M Gries G Droege T Lund-Larsen S Steibl S Borkar V K De and R Van Der WijngaartA 48-core IA-32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling IEEE J ofSolid-State Circuits 46 173 (2011)

19 M E Salehi M Samadi M Najibi A Afzali-Kusha M Pedramand S M Fakhraie Dynamic voltage and frequency schedulingfor embedded processors considering powerperformance tradeoffsIEEE Trans on Very Large Scale Integration Systems 19 1931(2011)

20 P Choudhary and D Marculescu Power management of volt-agefrequency island-based systems using hardware based methodsIEEE Trans on Very Large Scale Integration Systems 17 427 (2009)

21 A Miyoshi C Lefurgy E Van Hensbergen R Rajamony andR Rajkumar Critical power slope Understanding the runtime effectsof frequency scaling Int Conf on Supercomputing (2002)

22 R David P Bogdan and R Marculescu Dynamic power manage-ment for multicores Case study using the intel SCC IEEEIFIP IntConf on VLSI and System-on-Chip (2012)

23 M Ghasemazar and M Pedram Variation aware dynamic powermanagement for chip multiprocessor architectures Design Automa-tion Test in Europe (2011)

24 M Shafique A Ivanov B Vogel and J Henkel Scalable powermanagement for on-chip systems with malleable applications IEEETransactions on Computers 65 3398 (2016)

25 H Bokhari H Javaid M Shafique J Henkel and S ParameswaranMalleable NoC Dark silicon inspired adaptable network-on-chipDesign Automation Test in Europe Conference Exhibition (DATE)(2015)

26 G Chen K Huang and A Knoll Energy optimization for real-time multiprocessor system-on-chip with optimal DVFS and DPMcombination ACM Trans Embed Comput Syst 13 1111 (2014)

27 M Shafique L Bauer and J Henkel Adaptive energy manage-ment for dynamically reconfigurable processors IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems 33 50(2014)

28 H Lee M Shafique and M A Al Faruque Low-overhead aging-aware resource management on embedded GPUs Design Automa-tion Conference (2017)

29 A Pathania H Khdr M Shafique T Mitra and J Henkel Scalableprobabilistic power budgeting for many-cores Design AutomationTest in Europe Conference Exhibition (DATE) (2017)

30 M Shafique M U K Khan O Tuumlfek and J HenkelEnAAM Energy-efficient anti-aging for on-chip video memoriesACMEDACIEEE Design Automation Conference (DAC) (2015)

31 D Xu P D S Manoj H Huang N Yu and H Yu An energy-efficient 25D through-silicon interposer io with self-adaptive adjust-ment of output-voltage swing IEEEACM International Symposiumon Low Power Electronics and Design (ISLPED) (2014)

32 M Shafique M U K Khan and J Henkel Power efficient andworkload balanced tiling for parallelized high efficiency video cod-ing IEEE International Conference on Image Processing (ICIP)(2014)

33 S S Wu K Wang P D S Manoj T Y Ho M Yu and H YuA thermal resilient integration of many-core microprocessors andmain memory by 25D TSI IOs Design Automation Test in EuropeConference Exhibition (DATE) (2014)

34 B Dietrich S Nunna D Goswami S Chakraborty and M GriesLMS-based low-complexity game workload prediction for DVFSIEEE Int Conf on Computer Design (2010)

35 K Choi R Soma and M Pedram Dynamic voltage and frequencyscaling based on workload decomposition Int Symp on Low PowerElectronics and Design (2004)

36 K Choi R Soma and M Pedram Fine-grained dynamic voltageand frequency scaling for precise energy and performance tradeoffbased on the ratio of off-chip access to on-chip computation timesIEEE Trans on Computer-Aided Design of Integrated Circuits andSystems 24 18 (2005)

37 J S Lee K Skadron and S W Chung Predictive temperature-aware DVFS IEEE Trans on Computers 59 127 (2010)

38 M Zaman A Ahmadi and Y Makris Workload characterizationand prediction A pathway to reliable multi-core systems IEEE IntOn-Line Testing Symp (2015)

39 D B L Bong J Y B Tan and K C Lai Application of mul-tilayer perceptron with backpropagation algorithm and regressionanalysis for long-term forecast of electricity demand A comparisonInt Conf on Electronic Design (2008)

40 T E Carlson W Heirman S Eyerman I Hur and L EeckhoutAn evaluation of high-level mechanistic core models ACM Transon Architecture and Code Optimization 11 281 (2014)

41 P D S Manoj K Wang H Huang and H Yu Smart IOsA data-pattern aware 25D interconnect with space-time multiplex-ing ACMIEEE International Workshop on System Level Intercon-nect Prediction (SLIP) (2015)

42 R Zamani and A Afsahi Adaptive estimation and prediction ofpower and performance in high performance computing ComputerSciencemdashResearch and Development 25 177 (2010)





43 B Rountree D K Lowenthal M Schulz and B R de SupinskiPractical performance prediction under dynamic voltage frequencyscaling Int Green Computing Conference and Workshops (2011)

44 S Yang R A Shafik G V Merrett E Stott J M Levine J Davisand B M Al-Hashimi Adaptive energy minimization of embeddedheterogeneous systems using regression-based learning Int W onPower and Timing Modeling Optimization and Simulation (2015)

45 S Lee and T Sakurai Run-time power control scheme using soft-ware feedback loop for low-power real-time applications Asia-Pacific Design Automation Conf (2000)

46 A Weissel and F Bellosa Process cruise control-event-driven clockscaling for dynamic power management Int Conf for Compil-ers Architectures and Synthesis for Embedded Systems (CASES)(2002)

47 A Abbasian S Hatami A Afzali-Kusha and M PedramWavelet-based dynamic power management for nonstationary ser-vice requests ACM Trans Des Autom Electron Syst 13 131(2008)

48 G Tesauro R Das H Chan J O Kephart C Lefurgy D WLevine and F Rawson Managing power consumption and perfor-mance of computing systems using reinforcement learning Int Confon Neural Information Processing Systems (2007)

49 H Jung and M Pedram Supervised learning based power man-agement for multicore processors IEEE Trans on Computer-AidedDesign of Integrated Circuits and Systems 29 1395 (2010)

50 F Islam and M Lin A framework for learning based DVFS tech-nique selection and frequency scaling for multi-core real-time sys-tems IEEE Int Conf on Embedded Software and Systems (2015)

51 D Xu N Yu H Huang P D S Manoj and H Yu Q-learningbased voltage-swing tuning and compensation for 25D memory-logic integration IEEE Design and Test 35 91 (2018)

52 D Xu N Yu P D S Manoj K Wang H Yu and M YuA 25-D memory-logic integration with data-pattern-aware memorycontroller IEEE Design Test 32 1 (2015)

53 P D S Manoj H Yu H Huang and D Xu A Q-learning basedself-adaptive IO communication for 25D integrated many-coremicroprocessor and memory IEEE Trans on Computers 65 1185(2016)

P D Sai ManojP D Sai Manoj is a research assistant professor at George Mason University (GMU) Fairfax VA United States Prior to joiningGMU Dr Manoj worked as a post-doctoral research fellow at TU Wien Austria He received his PhD in Electrical and ElectronicEngineering from Nanyang Technological University Singapore in 2015 Dr Manoj has organized multiple special sessions at premiervenues such as DAC ICCAD ESWEEK and served on TPC of various conferences His research interests are in adversarial learningneuromorphic computing hardware accelerators cyber-security for embedded processors security in Internet of Things networksmachine learning for on-chip data processing and self-aware SoC design He is a recipient of A Richard Newton Young ResearchFellow Awardrsquo in DAC 2013 He is a Member of the IEEE

Axel JantschAxel Jantsch is professor of Systems on Chip at TU Wien Vienna Austria He received his PhD degree from TU Wien Austria in1992 Between 2002 and 2014 he was professor in Electronic Systems design at KTH Stockholm Sweden He has published over 300papers in international conferences and journals and 6 books as author or co-editor and he has served on numerous program andorganization committees He has organized many special sessions tutorials and workshops at international conferences His currentresearch interests include Dependability robustness and self-awareness in embedded systems He is a Member of the IEEE

54 Y Wang M Triki X Lin A C Ammari and M Pedram Hierar-chical dynamic power management using model-free reinforcementlearning Int Symp on Quality Electronic Design (ISQED) (2013)

55 M Salehi M Shafique F Kriebel S Rehman M K TavanaA Ejlali and J Henkel dsReliM Power-constrained reliability man-agement in dark-silicon many-core chips under process variationsInt Conf on HardwareSoftware Codesign and System Synthesis(CODES+ISSS) (2015)

56 D Brooks R P Dick R Joseph and L Shang Power thermaland reliability modeling in nanometer-scale microprocessors IEEEMicro 27 49 (2007)

57 A Ejlali B M Al-Hashimi and P Eles Low-energy standby-sparing for hard real-time systems IEEE Trans on Computer-AidedDesign of Integrated Circuits and Systems 31 329 (2012)

58 R Singhal Inside intelreg core microarchitecture (nehalem) IEEE HotChips Symp (2008)

59 W Kim M S Gupta G Wei and D Brooks System level analysisof fast per-core DVFS using on-chip switching regulators Int Sympon High Performance Computer Architecture (2008)

60 S Abedinpour B Bakkaloglu and S Kiaei A multi-stage inter-leaved synchronous buck converter with integrated output filter ina 018spl muSiGe process in IEEE Int Solid State Circuits Conf(2006)

61 P Hazucha G Schrom J Hahn B A Bloechel P Hack G EDermer S Narendra D Gardner T Karnik V De and S BorkarA 233-MHz 80ndash87 efficient four-phase DCndashDC converter uti-lizing air-core inductors on package IEEE J of Solid-State Circuits40 838 (2005)

62 C Bienia S Kumar J P Singh and K Li The PARSEC benchmarksuite Characterization and architectural implications Int Conf onParallel Architectures and Compilation Techniques (2008)

63 S C Woo M Ohara E Torrie J P Singh and A Gupta TheSPLASH-2 programs Characterization and methodological consid-erations SIGARCH Comput Archit News 23 24 (1995)

64 S Li J H Ahn R D Strong J B Brockman D M Tullsen andN P Jouppi McPAT An integrated power area and timing model-ing framework for multicore and manycore architectures IEEEACMInt Symp on Microarchitecture (2009)





Muhammad ShafiqueMuhammad Shafique is a full professor of Computer Architecture and Robust Energy-Efficient Technologies (CARE-Tech) at theEmbedded Computing Systems Group Institute of Computer Engineering Faculty of Informatics Vienna University of Technology (TUWien) since November 2016 He received his PhD in Computer Science from Karlsruhe Institute of Technology (KIT) Germany inJanuary 2011 Dr Shafique has given several Invited Talks Tutorials and Keynotes He has also organized many special sessions atpremier venues (like DAC ICCAD DATE and ESWeek) and served as the Guest Editor for IEEE Design and Test Magazine (DampT) andIEEE Transactions on Sustainable Computing (T-SUSC) He has served as the TPC co-Chair of ESTIMedia and LPDC General Chairof ESTIMedia and Track Chair at DATE and FDL He has served on the program committees of numerous prestigious IEEEACMconferences including ICCAD ISCA DATE CASES ASPDAC and FPL His research interests are in computer architecture power-and energy-efficient systems robust computing dependability and fault-tolerance hardware security emerging computing trends likeNeuromorphic and Approximate Computing neurosciences emerging technologies and nanosystems self-learning and intelligentsystems FPGAs MPSoCs and embedded systems Dr Shafique received the 2015 ACMSIGDA Outstanding New Faculty Award aswell as several best paper awards and nomination in premier conferences He is a senior member of IEEE and a member of ACMSIGDA SIGARCH SIGBED and HiPEAC





the workload characteristics under unpredictable hardwarefactors (such as cache misses on-chip network contentionand so on) aid in effective DVFS at run-time The desiredproperties of the predictor for predicting the workloadbehavior and to perform DVFS could be outlined as fol-lows accurate yet lightweight in nature28 effectively cap-ture sporadic variations adapt to the variations continuousself-optimization as self-optimization is important to learnover time and low complexity with reduced number ofcomputations The rationale to have a lightweight predictoris to avoid additional computational overheads caused andalleviate the processing delays in the prediction Predic-tion with the aid of machine learning has been proven tobe effective along with capturing the workload behaviorsand underlying variations3839 Along the same lines wealso adapt the machine learning to predict the workload(s)characteristics in this work The considered workload char-acteristic for DVFS in this work is the power trace of theworkload(s)

11 Motivational Case StudyThe challenge of adapting to workload variations canbe effectively addressed by learning workload behaviorAs large number of relevant machine learning algorithmsand heuristic approaches are available for prediction it iscritical to choose a suitable technique that facilitates min-imizing the prediction errors by learning the underlyingfactors to predict the sporadic variations The selection cri-teria for the predictor is to be fairly accurate robust andlightweight in terms of computations and complexity Inorder to select an appropriate learning technique we per-form a case study and evaluate the performance of differ-ent prediction algorithms This case study is performed inMatlab Workload power traces are obtained from Sniper-Sim40 with applications running on a single-core micro-processor of Nehalem architecture (In-depth details ofexperimental setup is presented in Section 61) AverageRMSE for online learning based linear prediction neuralnetworks and support vector machine based prediction isindicated in Figure 1 The Y -axis represents the RMSEerror and the executed benchmark is marked on X-axis ofFigure 1

Comparison of root-mean square error (RMSE) forSPLASH-2 benchmarksrsquo workload (power trace) predic-tion with different prediction methods is illustrated inFigure 1 The employed prediction methods for compari-son are modified covariance based prediction (Mod Cov)linear prediction with online learning (Linear Pred) tra-ditional neural networks (Neural NW) and support vec-tor machine (SVM) For the linear predictor an order of4 is used The order is choosen by experimenting thepredictor with different orders and also the complexityinvolved Similarly for neural network based prediction asingle-hidden layer with 10 nodes is employed Levenberg-Marquardt backpropagation based training is used for neu-ral network

0

02

04

06

08

1

12

14

fft barnes cholesky fmm radix raytrace

Mod Cov Linear Pred Neural NW SVM

RMSE of NN depends on initialization slower than linear predictor

SVM has least error but resource hungry and much slower than linear predictor

LP (053)

NN (057)

SVM (012)

RM

SE

Fig 1 RMSE of workload power trace prediction using covariancelinear prediction neural network and support vector machine techniques(average RMSE is marked) X-axis represent benchmark

From Figure 1 one can observe that the neural networkhas lower RMSE for some of the benchmarks and can befurther reduced by modifying the neural network archi-tecture However it is experienced during the simulationsthat the accuracy or RMSE of the neural network highlyvaries depending on the initial state ie initialization ofthe weights Online learning based linear predictor out-performs modified covariance in terms of average RMSESupport vector machine (SVM) based regression outper-forms both neural network and online learning based lin-ear predictor implementations and has a minimal averageRMSE However the runtime for SVM based predictionis more than 10times higher than the neural networks and lin-ear prediction The complexity of a technique or a methodcan as well be observed experimentally from its runtimeas the involved computations determine the runtime of atechnique The runtime for Mod Cov Linear Pred Neu-ral NW and SVM is 0026 s 0012 s 0021 s and 093 srespectivelyAs the two main desired characteristics for the pre-

dictor in this work are (a) ability to learn work-loads and capture their behaviormdashthis can be observedfrom the RMSE (b) lightweight (low complexity) andfast learning predictormdashruntime is one of the factorsthat represents this The linear predictor with onlinelearning fairly performs well and has smaller averageRMSE Additionally the linear predictor with onlinelearning requires less computations and is faster com-pared to SVM and neural networks Depending on thisanalysis and to satisfy the constraint of accurate yetlightweight prediction we have chosen linear predictorwith online learning for the purpose of predicting work-loads in SmartDPM It needs to be noted that the RMSEmight be different if the parameters of the predictorare modified such as prediction order number of lay-ers in neural network However more number of lay-ers in a neural network will increase the complexityand cost for employing it for individual core will besignificant




























X1(1) X1(n)

Xm(1) Xm(n)

-

Application monitor

App1(VF) Appm(VF)

^ ^



Prediction

Error

Update

Xm(n+1) X1(n+1)

Smar

tDPM


predictor

VF allocator



















minPTotal


ii vjn fj nlarr xin


iv Thr ge Thrmin

(2)














Microprocessor


Linear model

Optimization

Observer

VF regulator

RM

SE

Workload

Predicted workload

VF pairai

xi(n)

xi(n+1)^




=psum

j=1

ajxinminusj+ (4)




psumi


















4sump



minus1xijminus ck









L3-cache 8 MB









Snipersim










0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)



Uncorrelated

Loosely coupled

Tightly coupled

Correlated






0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)



























































































































X1(1) X1(n)

Xm(1) Xm(n)

-

Application monitor

App1(VF) Appm(VF)

^ ^



Prediction

Error

Update

Xm(n+1) X1(n+1)

Smar

tDPM


predictor

VF allocator



















minPTotal


ii vjn fj nlarr xin


iv Thr ge Thrmin

(2)














Microprocessor


Linear model

Optimization

Observer

VF regulator

RM

SE

Workload

Predicted workload

VF pairai

xi(n)

xi(n+1)^




=psum

j=1

ajxinminusj+ (4)




psumi


















4sump



minus1xijminus ck









L3-cache 8 MB









Snipersim










0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)



Uncorrelated

Loosely coupled

Tightly coupled

Correlated






0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)










































































































minPTotal


ii vjn fj nlarr xin


iv Thr ge Thrmin

(2)














Microprocessor


Linear model

Optimization

Observer

VF regulator

RM

SE

Workload

Predicted workload

VF pairai

xi(n)

xi(n+1)^




=psum

j=1

ajxinminusj+ (4)




psumi


















4sump



minus1xijminus ck









L3-cache 8 MB









Snipersim










0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)



Uncorrelated

Loosely coupled

Tightly coupled

Correlated






0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)







































































































Microprocessor


Linear model

Optimization

Observer

VF regulator

RM

SE

Workload

Predicted workload

VF pairai

xi(n)

xi(n+1)^




=psum

j=1

ajxinminusj+ (4)




psumi


















4sump



minus1xijminus ck









L3-cache 8 MB









Snipersim










0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)



Uncorrelated

Loosely coupled

Tightly coupled

Correlated






0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)











































































































4sump



minus1xijminus ck









L3-cache 8 MB









Snipersim










0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)



Uncorrelated

Loosely coupled

Tightly coupled

Correlated






0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)








































































































Snipersim










0

5

10

15

20

25

30

exp1 exp2 exp3 exp4

Pow

er (

W)



Uncorrelated

Loosely coupled

Tightly coupled

Correlated






0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)




































































































0 1 2 3 4 5 6 7

Time (ms) times104

2

4

6

8

10

12

Pow

er (

W)

OriginalPredicted






0

02

04

06

08

1

12


Nor

mal

ized

pow

er


Proposed 064

(a)

0

02

04

06

08

1

12


Nor

mal

ized

tim

e


Proposed 022


(b)










0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)




































































































0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

pow

er


Proposed 064


(a)

0

02

04

06

08

1

12

exp1 exp2 exp3 exp4

Nor

mal

ized

tim

e


Proposed 059


(b)




0

02

04

06

08

1

12


Nor

mal

ized

pow

er


066


(a)


0

02

04

06

08

1

12


Nor

mal

ized

tim

e


(b)

































































































smartdpm: machine learning-based dynamic power …mason.gmu.edu/~spudukot/files/jolpe18.pdf · p....

Documents