a dynamic programming framework for dvfs- based energy

A Dynamic Programming Framework for DVFS-

based Energy-Efficiency in Multicore Systems Shervin Hajiamini, Behrooz Shirazi, Aaron Crandall, Hassan Ghasemzadeh

Abstract—Per-core Dynamic Voltage and Frequency (V/F) Scaling (DVFS) is a well-known methodology for achieving energy

efficiency in multicore systems. Heuristic DVFS techniques provide fast, suboptimal V/F predictions while Dynamic

Programming (DP) methods solve smaller sub-problems iteratively and use their outcomes to evaluate V/F levels globally, but at

the cost of overhead delays. We propose an efficient DP framework using the Viterbi algorithm, which uses the Energy-Delay

Product (EDP) as objective function to predict the best V/F levels using applications’ profiled information, to minimize energy

consumption and execution time. Experimental results show that our framework outperforms heuristics using the EDP criteria

and provides near-optimal solutions when maximizing energy saving is as, or more, important than minimizing execution time

penalty. In fact, across several benchmarks, our proposed algorithm provides from 12 to 75% improvement in EDP compared to

heuristic methods. Furthermore, using a Pareto frontier to evaluate solutions of the algorithms under study, we demonstrate that

our framework’s energy-time solution is on average only 9% worse than the optimal solution. In addition, we show that our

dynamic programming solution is 3 to 18% closer to a theoretical lower-bound when compared to the studied heuristic methods.

Index Terms—Dynamic voltage and frequency scaling, Energy efficiency, Dynamic programming, The Viterbi algorithm, Pareto

frontier.

—————————— ——————————

1 INTRODUCTION

odern large-scale computing systems, such as data centers and High Performance Computing (HPC)

clusters, massively integrate multicore chips in their sys-tem design. Such systems are severely constrained by power and cooling costs to meet computing needs of emerging extreme-scale (or exascale) applications. The power efficiency and timing constraints of the high-performance multicore chips become more critical with increase in the number of cores in a single chip.

Dynamic Voltage/Frequency (V/F) Scaling (DVFS) remains as one of the most effective methods for adjust-ing processing cores’ V/F levels to match application’s performance/energy-efficiency goals—increasing cores’ V/F levels during compute-intensive phases of the appli-cations and lowing the V/F levels when the cores are less utilized. Effective DVFS techniques rely on future predi-cations of application or system behavior (e.g., prediction of future busy/idle cycles) to make correct adjustments to the V/F levels now.

Existing heuristics-based core-level DVFS methods perform V/F level predictions based on a current system state without examining all possible solutions. Although greedy techniques are fast, they do not in general produce global optimal solutions. Dynamic Programming (DP) techniques address this limitation by solving sub-problems sequentially using previously solved smaller sub-problems. At each stage, the sub-problems’ solution values are stored to avoid re-solving the same sub-problems in the next stage. At the final stage, a DP tech-nique moves backward, through the sup-problems’ solu-tions, stage by stage to obtain the optimal solution [23].

This paper proposes a fast DP framework based on the Viterbi algorithm [30] to find optimal per-core V/F levels over applications’ execution phases at runtime. This pa-per targets application-specific kernel codes or bench-marks that are periodically executed as embedded appli-cations or as core segments of other larger applications. To do this, the workloads of these applications are exten-sively analyzed by optimization methods at compile-time to obtain more accurate system status at runtime as men-tioned above. The proposed technique uses the Energy-Delay Product (EDP) metric [1] as its cost function to de-termine energy consumption and execution time tradeoff at each application execution phase. Our Viterbi-based DP (VDP) algorithm has polynomial time complexity and low implementation overhead, particularly when the number of V/F levels is less than the number of applica-tion’s execution phases.

This paper presents the following: A Viterbi-based Dynamic Programming algorithm

is proposed to perform per-core DVFS to make a tradeoff between performance and energy con-sumption goals of applications. The VDP algo-rithm uses EDP as its objective function to predict the best V/F levels that maximize the cores energy savings while minimizing application execution time. The VDP algorithm states are defined using the applications profiled energy and time data.

The proposed VDP algorithm performance is compared to a greedy algorithm, an ondemand governor, and a feedback controller method. Us-ing EDP as the metric to evaluate these algorithms, experimental results demonstrate that VDP out-performs the heuristics under the study by an av-erage of 12 to 75%.

Since comparing the VDP performance to many

•The authors are with the School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99163. Email: {shervin.hajiamini, shirazi, acrandal}@wsu.edu, [email protected]

M

other state-of-the-art DVFS algorithms is not prac-tical, the optimality of the VDP and other algo-rithms under study are evaluated using a Pareto frontier. We will demonstrate that the VDP algo-rithm has the closest performance to its corre-sponding solution on the Pareto frontier curve (on average 9% across the experimented benchmarks). Furthermore, we will show that the VDP algo-rithm’s solutions are on average 3 to 18% closer to a theoretical lower-bound energy-time solution (best timing and best energy efficiency) than the other studied heuristics.

This paper is structured as follows. Section 2 gives an overview of algorithms related to the ones discussed in this paper. Section 3 presents a model for executing the experimented applications. It also defines an application interval as well as tasks executed per interval in this exe-cution model. Section 4 explains a strategy for profiling data from the executed applications. Based on the pro-filed data, this section defines 1) problem objectives and 2) system states used by the algorithms to perform DVFS. Section 5 details descriptions of VDP and the other heu-ristics that evaluate VDP. Section 6 provides analyses on runtime and amount of memory consumed by the algo-rithms. Section 7 explains setups of the states, experi-mented applications, and simulation toolsets. Section 8 compares VDP to the heuristics and discusses the opti-mality of various solutions obtained. Concluding remarks and future work is discussed in Section 9.

2 RELATED WORK

Application of Dynamic Power Management (DPM) techniques has been extensively studied in multicore sys-tems that aim to scale cores’ V/F levels during an applica-tion run to trade reduced static/dynamic energy con-sumption for an acceptable execution time performance. To achieve this goal, a wide range of system characteris-tics has been studied and evaluated, including, but not limited to, memory access delays [2], [3], slack reclama-tion [4], [5], [37] task characterization and scheduling [6], [10], application phase detection [8], [9], compiler-based instruction set optimizations [10], [11], and communica-tion bandwidth [38].

Using these parameters, various optimization algo-rithms have been proposed to balance the power/energy consumption and execution time, such as: linear pro-gramming [12], simulated annealing [13], genetic algo-rithms [14], game theory-based [15], and machine learn-ing techniques [16]. Among such a large body of literature that are presented in the DPM/DVFS arena, we mainly overview predictive, feedback control-based, and dynam-ic programming techniques, four of which are imple-mented and compared in this paper.

Predictive techniques predict future system state either greedily based on current system state or by considering history of system states. Ioannou et al., [17] devise a DVFS methodology at multiple core domains granularity that tracks Message Passing Interface-based recurrent applica-tion phases. According to this methodology when an ap-plication phase is re-encountered at runtime, cores’ V/F

levels are scaled up or down depending on performance degradation. If the performance degradation, caused by running this phase with a lower V/F level at previous occurrence does not exceed a threshold, the V/F levels are decreased; otherwise, they are scaled up.

Spiliopoulos et al., [2] predict cores’ execution times based on measuring memory latency cycles (due to last-level cache misses) that scale with the cores’ frequencies. Using the memory cycles to measure the application per-formance loss, [2] evaluates the cores’ energy efficiency, using EDP, under multiple V/F levels. Lai et al, [18] pro-pose a profile-guided DVFS based on the CPU-intensiveness of application phases measured by instruc-tion per cycle and bus utilization. This profile is used by an analytically derived power-performance model, which finds suitable V/F levels at domain-level granularity that minimize EDP. Isci et al., [9] design a look up table that performs phase predictions at runtime. The phase predic-tion table, which guides DVFS, is constructed based on categorizing memory bus transaction to retired instruc-tion ratio (phase pattern) into multiple levels. A history window of previous patterns and corresponding predic-tions are stored in the table to be used when consecutive similar patterns are found at runtime.

The feedback control approaches adjust the cores’ voltage and frequency based on the difference between current system performance and a target point at every control interval. David et al., [19] develop a feedback-based algorithm that controls the multicore tiles’ V/F levels in the Intel Single-chip Cloud Computer (SCC) platform. Their feedback controller adjusts the tiles’ V/F levels considering a reference queue occupancy level and inter-VFI communication workload, which is measured by message arrival/service rates. Leva et al., [20] integrate a thermal controller with a power/performance controller to regulate per-core temperature. This event-driven con-troller is invoked to improve system performance and reduce power consumption overhead only when either a temperature limit or a timeout is exceeded.

Kim et al., [21] dynamically tune a Voltage/Frequency Island-based (VFI) system’s V/F levels with a feedback controller. Control variable defined in this work is a met-ric that captures average core and link utilizations within the VFIs. This metric, whose value falls in one of pre-defined ranges and mapped to a V/F level in a lookup table, is augmented by a tracking error (between predict-ed V/F level for the next time interval and observed met-ric for that interval) to increase the controller’s prediction accuracy. Lukefahr et al., [22] propose a controller operat-ing between two cores, one as high performance and the other as energy efficient core combined in a heterogene-ous dual-core processor to maximize the energy saving within a performance loss constraint. The control decision to switch between these cores is based on an estimate of performance loss vs. energy waste tradeoff in the next time interval.

Dynamic Programming (DP) techniques overcome lo-cal decision-making limitations whether the predictions are made greedily or are compared to expected outcomes to enhance future predictions. To approximate optimal

solutions, DP techniques use principles of overlapping subproblems, optimal substructures, and memorization [23]. Niu et al., [24] propose a DP algorithm for task volt-age assignment problem on a Directed Acyclic Graph (DAG). The task voltage assignments are determined by bottom-up traversing the DAG where a subgraph feasible solution is defined as the one that probabilistically mini-mizes subgraph energy consumption and satisfies an exe-cution time constraint.

Zhong et al., [25] formulate a voltage scheduling prob-lem that minimizes the system energy consumption and solve it with an exact DP in polynomial time. For each V/F level, they define a state to be the tasks’ cumulative utilizations and energy consumptions on a partial path up to the current task. For any two states, the one with high-er energy consumption and utilization is pruned by the dominance relation. The DP’s backtracking phase con-structs optimal V/F level sequences starting from a state with maximum cumulative utilization.

Jung et al., [26] apply a value iteration algorithm as a DP that uses Bellman equation to minimize a system state (power-delay product) cost. This cost is defined as a summation of instantaneous energy consumption, caused by incurring an action (V/F level change) on the current system state, and discounts the next state energy con-sumption given state transition probabilities. Kang et al., [27] transform cores V/F level decision problem as find-ing optimal number of cache banks per core to maximize performance without violating temperature constraints. They use a DP whose sub-problem solutions correspond to cache bank-to-core allocations that maximize total sys-tem throughput (instructions per second).

3 WORKLOAD EXECUTION MODEL

3.1 Assumptions

Workloads, which will be explained in more details in section 7.3, consist of multithreaded benchmarks. Based on modern memory system architecture, we assume a shared memory (e.g., L2 cache or main memory) is ac-cessed by cores to facilitate inter-core data traffic at runtime [28]. All execution runs reported in this paper are performed on the benchmarks’ parallel section, known as Region Of Interest (ROI).

3.2 Impact of Communication on Tasks Execution Times

Benchmarks that are experimented in this paper utilize data-parallel parallelization model. In this model, bench-mark input data is distributed among cores where each core, in parallel with others, executes program instruc-tions to process a different subset of data (task). For communicating data among cores during benchmark exe-cution, the underlying architecture provides shared memory as mentioned in Section 3.1. Accessing the data through the shared memory causes memory access de-lays. These delays, which are considered as communica-tion delays, are embedded in the execution time of the tasks as will be explained in Section 3.3. Our previous work [39] extensively studied the impact of the communi-cation delays on system energy efficiency and is not dis-cussed here. However, it should be noted that such delays

affect the execution lengths of the benchmark tasks that consist of computation and communication (memory ac-cess) periods. This paper utilizes the varied execution lengths of the tasks per application interval to slow down cores running small tasks and speed up the ones running larger tasks, resulting in optimizing the energy efficiency for the entire application runtime.

3.3 Execution Phase Definition

The benchmark ROI consists of a set of execution inter-vals (or phases), where in each interval a set of parallel threads run simultaneously on multiple cores. All threads in an interval synchronize using synchronization opera-tions (e.g., locks and barriers), as demonstrated in Fig. 1, which shows a three-phase execution of Radix-2 sort ap-plication during its ROI execution. Note that execution times of threads running within the same interval/phase are different, depending on (1) the input data and size and (2) the memory access delays at runtime. This is shown in Fig. 2, where, for example during interval t1, τ2,1 takes the longest time to complete while τ1,1 and τr,1 com-plete their executions earlier during that interval. Thus, the interval time or length is defined by the execution time of the slowest thread during that interval. It should be noted that the interval lengths vary from each other, caused by the nature of the computations and communi-cations during that interval. For example, computing first key rank shown as second phase in RADIX (Fig. 1) re-quires more inter-core data transfers (memory reads/writes) than other two phases that have more computations and local data transfers.

Fig. 1. Three-phase execution run of RADIX benchmark

3.4 Task Set Definition

By defining the execution run of a thread in an interval as a task, the benchmark’s parallel computational phases can be modeled by a task set T = {tj|1 ≤ j ≤ p} as shown in Fig. 2, where tj = {τi,j, 1 ≤ i ≤ r} denotes a set of subtasks, τi,j, which execute on core ci, 1 ≤ i ≤ r, at interval tj, and whose execution times may include memory access delays for data exchange among the subtasks through the shared memory. In Fig. 2, during an interval, gray portions show computation periods of cores executing tasks and black portions show the core’s idle periods representing over-heads caused by inter-core synchronizations at the end of each interval. In our algorithm, we take advantage of the-se variations in idle periods to improve energy efficiency and performance by slowing down cores that execute tasks with longer idle periods and speeding up cores with tasks that have shorter idle periods within any given in-terval. A characteristic of this approach is that dependen-cies among the tasks are automatically preserved by the

Co

un

t ke

y’s

blo

ck

va

lue

oc

cu

rren

ce

in e

ac

h p

roc

es

so

r

Co

mp

ute

first k

ey

ran

k w

ith

a p

artic

ula

r blo

ck

va

lue

Co

mp

ute

ov

era

ll ke

ys

ran

k

c1

c2

cN

...

Ba

rrier

Ba

rrier

...

... ...

...

c1

c2

cN

...

synchronization constructs. The amount of variations among the idle periods in an application depends on the cores’ computational workloads in each of the applica-tion’s execution intervals, which is partially impacted by memory access (i.e., inter-core communication) delays.

4 PRELIMINARIES

4.1 Application Profiling

The energy-efficiency algorithms studied in this work rely on three application parameters, i.e., tasks’ execution times, tasks’ energy consumption, and cores’ utilization. A task execution time during an application phase is the time that a core executes the task before reaching a barrier (shaded boxes in Fig. 2.) Energy consumption of a task refers to the rate of core’s power usage during the task’s execution time. Busy utilization during an application phase is defined as the ratio of a core’s busy cycles while executing the task to the total cycles (sum of busy and idle cycles) in the execution phase.

The above three parameters are obtained by profiling the application. The profiling of a benchmark/application collects the execution time, energy consumption, and busy utilization parameters for each application phase in the benchmark for each possible V/F level. It is noted that even though profiling is a time-consuming process (large overhead), it is carried out only once per application and only at compile-time (prior to the actual application exe-cution). Therefore, this overhead is not considered in the complexity of the energy-efficiency algorithms.

4.2 Compile-time vs. Runtime V/F Tuning

Our proposed optimization framework follows optimize-once-execute-many-times realization model. This means that the framework is a good fit for kernel code or em-bedded applications whose V/F levels can be optimized once at compile-time, and then applied many times over. Thus, for these applications, the following steps are taken to realize DVFS performed by all the energy-efficiency algorithms studied in this paper: 1. At compile-time

i. Profile the application and obtain the execution time, energy consumption, and busy utilization parameters.

ii. Apply an energy-efficiency algorithm (e.g., VDP, feedback, or ondemand governor) using the pro-filed parameters to predict the best V/F levels for each application interval for each core. Some of these algorithms use local information (e.g., ondemand governor) and some like VDP use

global information for performing static (com-pile-time) optimization. Once the best predicted V/F levels (per core and per interval) are ob-tained, they are stored in a look-up table to be used at run-time.

2. At runtime At each application interval, the system (e.g., OS or microkernels) will consult the look-up table obtained in step 1 to fetch each core’s V/F level and issue it to the corresponding voltage regulators with minimal runtime overhead.

4.3 Problem Definition

This paper aims at determining the cores’ V/F levels per application phase/interval to minimize the application’s execution time (makespan) and total energy consump-tion, i.e., formulations (1) and (2), respectively:

VFLltxd jji

j i l

ljilji ,)max( ,,,,, (1)

VFLltxe jji

j i l

ljilji ,,,,,, (2)

Where, di,j,l and ei,j,l in (1) and (2) denote the execution time (delay) and energy consumption of executing sub-tasks τi,j under V/F level l in interval tj, respectively. xi,j,l is a decision variable, which indicates whether the subtask τi,j is executed under V/F level l during interval tj. The max(.) function in (1) determines the interval’s length based on the slowest core’s execution time among all cores in that interval. Of note, VFL is a set of discrete V/F levels, i.e., VFL=[1, L] where L is the maximum V/F level.

4.4 State Definition

To evaluate V/F level predictions for future benchmark intervals, we use two state definitions. The first state def-inition is based on per-task energy consumption and exe-cution time measured over multiple V/F levels. This state definition is used by the proposed VDP and greedy algo-rithms. The second state definition is based on per-task computational workload (or per-core utilization in an interval), which is used by the ondemand governor and feedback controller. The algorithmic implementations of feedback controller and ondemand governor presented in this paper leverage cores’ utilizations in the past to pre-dict future states. Therefore, the core utilization-based state definition was a reasonable choice to consider for these heuristics.

As previously mentioned, each core’s V/F levels are determined independent of the V/F levels of the other cores. Hence, for the rest of this paper, index i, which cor-responds to core ci, is not shown in equations and algo-rithmic descriptions.

Energy/Time-based state definition: The state set is de-fined as follows:

}|{ jj tsS (3)

},|),{( ,,, ldes jiljljj (4)

Where, S in (3) is the system state set. sj in (4) denotes a state defined by energy consumption and execution time of task τi,j in interval tj under V/F level l. Intuitively, for

Fig. 2. The task set example generated by executing a p-phase application on r cores

Cores

c1

c2

cr

12,

p,1

t1 t2 tp

Application

intervals

11, 2,1

22, ,p2

1r, 2r, r,p

each interval j, state sj has multiple instances, where each instance corresponds to a (ej,l, dj,l) pair obtained at a par-ticular V/F level l. Thus, the number of states (instances) for each interval j depends on the number of V/F levels in VFL. Since the cores execute tasks, in an interval, with unique computational characteristics, each core has dif-ferent states, (ej,l, dj,l) pairs, than the other cores within and over the application’s intervals.

Utilization-based state definition: This state definition is based on per-core utilizations over all the intervals measured under multiple V/F levels.

}|{ lsS l (5)

}|{ jjl tus (6)

jl

lj

j tVFL

u

u

,

(7)

VFLllss ll ',' (8) Where, in (6), sl, i.e., the state defined for V/F level l, is

a set of average core utilizations u j. For each interval tj, an average of core utilization at multiple V/f levels l, u j, is computed as shown in (7). Equation (8) indicates that the states are non-overlapping.

The boundaries of states defined in (6) are determined by K-means clustering [29]. K-means is a well-known learning algorithm that partitions a data set into multiple clusters such that each cluster contains a subset of similar data by considering a distortion metric. This paper uses the K-means clustering to form clusters of cores’ utiliza-tion by executing the following steps iteratively:

}',:{2

'

2Ssuuus lsjsjjl ll

(9)

l

su

j

ss

u

lj

l

(10)

Where, in (9), u j is assigned to a cluster (state) sl whose centroid (cluster average value), μs, has the least squared Euclidean distance to u j compared to the cen-troids of other states s’l. This step is followed by (10), which computes an average of all assigned u j deter-mined in the previous step (9). It should be noted that produced clusters do not overlap and each cluster is asso-ciated with a V/F level where clusters with minimum and maximum centroids are associated with the lowest and highest V/F levels (l = 1 and l = L in VFL), respectively.

5 ALGORITHM DESCRIPTIONS

This section describes the VDP algorithm and the three heuristic algorithms that solve per-core DVFS problem discussed in section 4.3, (1)-(2). It should be noted that algorithms’ pseudocodes presented here are executed per-core. Therefore, benchmarks’ total energy consump-tion and execution time will be obtained after all the cores complete the execution of tasks assigned to them.

5.1 Viterbi-based Dynamic Programming (VDP)

VDP is based on the Viterbi algorithm [30], which is a dynamic programming technique and operates on a trellis diagram basis (Fig. 3), spanned over P steps (intervals)

where in each step j system state sj is in one of the L states. Viterbi operates on the trellis in two phases: forward

and backward. In the forward phase, in every step j and for every state l, the objective function value (e.g., EDP in our study) is computed for all state sequences starting from a state in the first step and ending in the l-th state at the j-th step where the corresponding objective value is stored. After reaching the last step (P), Viterbi uses these locally stored objective values to backtrack the trellis and find a state sequence, among the other sequences, that provides the best global solution.

Given core’s state sj-1 in current interval j-1 and pos-sible states sj in the next interval j, the cost (C) of the path that consists of a state sequence starting from s1 and leading to sj through sj-1 is computed as follows:

VFLlltdddDf

eeeEfsssC

jljljljljm

ljljljljn

jj

',,)),,(,(

)),,(,(),(

,,',1',1

,,',1',111 (11)

Where, symbol → indicates the state sequence from s1 to sj-1 on the trellis. Ej-1,l’ and Dj-1,l’ are cumulative energy consumption and execution time (corresponding to the state sequence) up to state sj-1 at V/F level l’. Symbol ∆ shows instantaneous transition energy and time costs between states sj-1 and sj at V/F levels l’ and l, respective-ly. f(.) is used to add up and normalize the energy and time variables mentioned above. Integer parameters m and n weigh the energy consumption to execution time portions of the cost function (11).

Having computed the path costs that cross through all the states sj-1 and sj in intervals j-1 and j, respectively, the state sj-1 on the path with the minimum cost is recorded for each state sj:

jjjs

j ssssCs

j

)},({minarg 11*

1

1

(12)

Where, s*j-1 is the state, among all states sj-1 (corre-sponding to V/F levels l’), that minimizes the cost func-tion of a state sequence up to interval j. The best state se-quence (V/F levels) is obtained after determining state sP, in the P-th interval (last interval), with minimum cost and moving backward on the trellis:

)(1bestj

bestj sbs (13)

Where, sbestj is the best state (V/F level) in interval j. b(.) is the backtracking function, which saves previously visited states s*j-1 with cost function values (11) being the minimum among the other states on a path from t1 to tj-1.

The VDP algorithm’s pseudocode is described in Algo-rithm 1. The algorithm’s variables are explained in Lines 1-4. The states cumulative energy consumption and exe-cution time are initialized (Line 5). For the remaining in-

1

2

L

1

2

L

1

2

L

...

...

...

...

1 2 P

Fig. 3. Trellis diagram for L states and P steps

tervals, the algorithm computes for each state the cumula-tive energy consumption and execution time up to the current interval (Lines 6-9). The algorithm records a state in the previous interval, which led to minimum cost (weighted EDP) for the state in the current interval (Line 10) and updates cumulative energy consumption and execution time for that state (Line 11). Finally, the algo-rithm determines a state, in the last interval, with the min-imum weighted EDP (Line 14) and backtracks to find op-timal state sequence (Lines 15-17).

Algorithm 1. VDP pseudocode

1. e1,L, d1,L: energy and time in the first interval under the highest V/F level L.

2. Ej,l, Dj,l: cumulative energy and time up to a state in interval j at V/F level l.

3. E*j,l, D*j,l: cumulative energy and time with minimum weighted EDP.

4. s*j: the best state in interval j. 5. E1,l = e1,L, D1,l = d1,L 6. for tj from t2 to tp do: 7. for each sj do: 8. Ej,l = E*j-1,l’+ej,l + ∆(ej-1,l’, ej,l) 9. Dj,l = D*j-1,l’+dj,l + ∆(dj-1,l’, dj,l)

10. b(sj) = 1

minarg

js

(Enj,l.D

mj,l)

11. E*j-1,l = Ej,b(sj), D*j-1,l = Dj,b(sj) 12. end for 13. end for

14. sbest

p = ns

minarg (En

p,l. Dm

p,l)

15. for tj from tp to t2 do: 16. sbestj-1 = b(sbestj) 17. end for

Metric for VDP Cost Function

Generally speaking, any DVFS algorithm, including VDP proposed in this paper, aims at scaling the V/F levels of cores at runtime to optimize desired objectives. Typical DVFS algorithms, whether they are used as prototypes for empirical studies or implemented in real embedded systems (i.e., Linux governors), use cores workloads to predict V/F levels. To use cores workloads as a metric for V/F level decisions, one has to carefully define workload thresholds for selecting suitable V/F levels while avoid-ing unnecessary frequent V/F level switching. This paper demonstrated this limitation when constructing the lookup table (Section 4.4) for the feedback and ondemand algorithms. To address this limitation, VDP uses our ob-jective parameters, execution time and energy consump-tion ((1)-(2)), as V/F selection criteria for its cost function. To account for the simultaneous impact of energy over time and vice versa, these parameters are combined into a well-known metric, Energy-Delay Product (EDP) [1]. EDP is used to measure tradeoffs between energy reduction and performance loss on the time-energy Pareto frontier. It should be noted that utilizing energy and time as the criteria for V/F level decisions provides sensible out-comes only when their values are accurately estimated. As such, this paper uses the profiling strategy, as dis-

cussed in Section 4.1, for estimating energy usage and execution time of running tasks by cores under all V/F levels per application interval.

5.2 Greedy

This algorithm’s policy is based on a simplified version of the Viterbi algorithm, i.e., it only considers cores’ current states to make future state predictions (with no backtrack-ing or global optimization). The state representation de-fined in this algorithm is based on (4) and EDP is used to evaluate the energy consumption and execution time tradeoff over all the states.

Algorithm 2 describes the proposed greedy algorithm. The variables and procedure in Algorithm 2 are similar to the ones in Algorithm 1 except that after the initialization (Algorithm 1, Line 5), for each interval tj, the algorithm greedily chooses state s*j whose EDP is minimum among all states sj in interval j (Line 6). Therefore, the cost of the existing path (state sequence) is only updated by energy and time of the predicted states s*j (Line 7).

Algorithm 2. Greedy pseudocode

1. for tj from t2 to tp do: 2. for each sj do: 3. Ej,l = E*j-1,l’+ej,l + ∆(ej-1,l’, ej,l) 4. Dj,l = D*

j-1,l’+dj,l + ∆(dj-1,l, dj,l) 5. end for

6. s*j = js

minarg (Enj,l Dmj,l)

7. E*j,l = Es*j,l, D*j,l = Ds*j,l 8. end for

5.3 Feedback Controller

In this algorithm, in general, prediction error is incorpo-rated in the objective function to prevent unnecessary V/F level changes caused by short-term workload varia-tions at runtime. The core utilization-based state defini-tion (explained in 4.2) is used to construct a time-invariant lookup table of state and unique V/F level pairs. We deploy the well-known Exponential Weighted Moving Average (EWMA) algorithm [31], which imple-ments the feedback controller to predict core utilization in the next interval based on the core utilization in the cur-rent interval while accounting for the prediction error.

Algorithm 3 describes the feedback controller’s pseudocode. Lines 1-6 explain the algorithm’s variables. Line 7 performs the core utilization’s prediction for the second interval. Using weighted sum of the actual core utilization in the current interval, which is obtained based on the state (and corresponding V/F level) predicted in the previous interval, as well as core utilization prediction history, core utilization for the next interval is predicted (Line 9). After that, the state in which the predicted core utilization falls is determined and its corresponding V/F level is obtained from the lookup table (Line 10). In case the predicted core utilization does not fall into any of the states, state whose centroid is the closest to the core utili-zation is selected for the next interval (Lines 11-13).

5.4 Ondemand Governor

This algorithm is based on ondemand governor used in

Algorithm 3. Feedback Controller pseudocode

1. w ϵ [0,1]: parameter that weighs importance of the current core utilization vs. prediction history.

2. u 1: core utilization in the first interval at the highest V/F level L.

3. u *2: predicted core utilization for the second interval. 4. u *j+1: expected core utilization computed for the next

interval j+1. 5. sl: centroid (cluster’s average core utilization) for

state sl. 6. s*l: predicted state 7. u *2 = u 1 8. for tj in t2 to tp do: 9. u *

j+1 = u j + w| u j - u *j|

= w. u j + (1 – w). u *j

10. s*l = ls

memberarg { u *j+1, S}

11. if no such sl exists then:

12. s*l = ls

minarg {| u *j+1 - ls |}

13. end if

14. end for

Linux kernel [40], which adjusts the V/F levels of a core according to the current core workload. To account for the variation of the core workload across the application intervals, the governor used in this paper also utilizes workloads in the previous intervals for adjusting V/F levels. Technically, this algorithm predicts the core’s utili-zation for next interval based on the average of core’s utilizations of current and previous intervals. The lookup table constructed in 5.3 is also used for the ondemand governor to find the correct state and select the corre-sponding V/F level for the next interval.

Pseudocode in Algorithm 4 describes the ondemand governor algorithm. The algorithm’s variables are ex-plained in Lines 1-4. Line 5 initializes history level array, which is used by the algorithm to obtain the actual cores utilization based off the previously predicted V/F levels. Lines 7-12 predict the core utilization (and its correspond-ing V/F level) for the next interval based on the average of actual core utilizations, in the previous intervals, whose values depend on the predicted V/F levels in those intervals. Line 13 adds the corresponding V/F level to the history level array to be used for the next V/F pre-diction.

6 ALGORITHMS TIME/SPACE COMPLEXITY

As mentioned in 5.1, the VDP algorithm consists of two phases, where in the first phase (forward phase), shown in (11), the costs of state sequences on the trellis are com-puted and in the second phase (backward phase), shown in (13), the trellis is backtracked to produce optimal state sequence. The backward phase takes constant time be-cause the backpointers are used to track the states with optimum cost in each interval. The VDP algorithm’s for-ward phase computes the EDP, |VFL|2 times in every interval. Thus, for N intervals the time complexity is O(|VFL|2 N). Since determining minimal-cost path re quires |VFL|2 comparison pairs and |VFL|

Algorithm 4. Ondemand governor pseudocode

1. hi (history interval): number of past intervals. 2. hl(tk): stores predicted V/F level for interval tk. 3. u k: actual core utilization for interval k. 4. U (cumulative cores utilization): stores core utiliza-

tions for hi intervals. 5. hl(t1) = L 6. for tj from t2 to tp do:

7. U =

j

hijk

ku ( ku is actual utilization based off hl(tk))

8. u *j+1 = hiU

9. s*l = ls

memberarg { u *j+1, S}

10. if no such sl exists then:

11. s*l = ls

minarg {| u *j+1 - ls |}

12. end if 13. Find V/F level l that corresponds to s*l and add it

to hl(tk). 14. end for

comparison outcomes, b(sj), are stored for states in every interval, the VDP’s space complexity is O (|VFL|2+|VFL| N).

The greedy algorithm is a simpler version of VDP where the EDP is computed |VFL| times in every inter-val wrt. the selected state in the previous interval; hence its time complexity is O (|VFL| N). The greedy algo-rithm’s space complexity is O (|VFL|) since only |VFL| comparison pairs are conducted in an interval.

Both feedback controller’s and ondemand governor’s time complexities are linear in the number of intervals and their space complexities are constant. The history interval (hi) in the ondemand governor remains fixed dur-ing the algorithm runtime, so it does not impact the space complexity.

VDP’s Implementation Efficiency

The VDP framework is a static (pre-runtime) V/F level assignment algorithm—the V/F levels are predicted by VDP at compile-time and put in a look up table for access during the run-time. The previous section explained that VDP’s time complexity is polynomial and scales with the number of V/F levels and the number of intervals. How-ever, the algorithm is efficiently implemented in MATLAB with the worst-case execution time of the algo-rithm being less than 10 seconds.

7 EXPERIMENTAL SETUP

7.1 States Configuration

The following 6 V/F levels, VFL = [1, 6], are associated to the states defined based on the cores’ energy/time or uti-lization: (0.5 V, 1.25 GHz), (0.6 V, 1.5 GHz), (0.7 V, 1.75 GHz), (0.8 V, 2.0 GHz), (0.9 V, 2.25 GHz), (1.0 V, 2.5 GHz). These V/F levels, which are equally distanced, are within

The psuedocodes of the algorithms are presented in Section 5. The actual source codes are available here: https://gitlab.eecs.wsu.edu/Shervin/DVFS_Algorithms

a nominal V/F level range, used by our power and per-formance simulators, whose power and performance val-ues have linear relationship. Using a higher frequency than 2.5 GHz only provides marginal improvements in execution times in our study and thus is not considered. Using a lower voltage than 0.5 V is not supported by our power simulator’s configuration set up since it negatively increases the leakage power consumption.

We use [32] to compute V/F level switching time/energy overheads that are in the order of 0.5 micro-second/1 microjoule, respectively. These time/energy overheads, which are very small and often negligible, are still accounted for in our algorithms evaluations. For the heuristics explained in section 5, the state values are or-dered in such a way that the first and last states, which correspond to the energy and time pairs or lookup table entries, are associated to the lowest and highest V/F lev-els, respectively.

7.2 Emulation Setup

GEM5 [33], as an industry- and academic-standard emula-tion environment for fine-grain computer system evalua-tions, is used to assess and compare the performance of the proposed VDP and greedy methods vs ondemand and feedback control-based algorithms. Using the 65nm technol-ogy node, GEM5 simulates a full system of 64 Alpha ho-mogenous cores that are arranged in an 8х8 mesh topology. Ruby [33] is used as memory model to provide private 64 KB L1 instruction and data caches and a shared 8MB (128 KB per core) L2 cache. McPAT [34] is used to obtain per-core energy consumption in each interval.

7.3 Benchmarks

Our heuristics’ energy efficiencies are measured by run-ning five workloads chosen from SPLASH-2 and PARSEC benchmark suites [35], [36]. These benchmark suites con-sist of realistic applications whose runtime characteristics vary from a floating-point computation-bound (e.g., FFT) to a data sharing memory-bound (e.g. CANNEAL) benchmark. Hence, the experimental results shown here are likely generalizable to applications with varying com-putational intensity. All benchmarks, whose number of execution phases (intervals) varies between 6 and 16, are run with large input sizes to better evaluate the effects of the considered V/F levels on per-core energy consump-tion and execution time for among the intervals. Table 1 illustrates problem size and application domain of these benchmarks.

8 PERFORMANCE EVALUATION

This section evaluates the algorithms energy-efficiency, discussed in section 5, based on the system energy con-sumption and execution time tradeoffs. First, the algo-rithms’ energy/time efficiencies are evaluated and com-pared using EDP as performance measure. Then, we measure the extent to which the algorithms’ average en-ergy-time solutions are close to corresponding optimal solutions, defined on benchmarks’ Pareto frontiers, and a theoretical lower-bound optimal solution.

8.1 Energy-Delay Product (EDP) Comparison

Fig. 4 shows the algorithms’ EDP outcomes. The EDPs are normalized to the no-DVFS EDP where the task sets’ en-ergy consumptions and execution times are obtained at the highest V/F level, (1.0 V, 2.5 GHz). Thus, smaller normalized EDP values reflect better overall performance.

For each algorithm in Fig. 4, four bars are presented; three of those that correspond to solving our optimization problem with different input configurations are discussed below. The fourth bar represents the simple average of the EDP values for the other corresponding 3 configura-tions.

For the VDP and greedy algorithms we use En.Dm as objective function for optimizing energy and time param-eters. Thus, the n and m integer factors are used to assign more weight to either energy or time, respectively. For example, m > n indicates that minimizing the execution time, perhaps at the cost of higher energy consumption, is more important to the user. Needless to say, the degree of importance of one parameter over the other is determined by the integer values assigned to m and n. Since m and n are “to the power of” factors, it is not necessary to use large values for them since the effect of bigger values quickly diminishes beyond a factor of 2. Thus, in our studies we would either assign 1 or 2 to m and n.

For feedback controller, the input configuration w specifies the amount of contribution of prediction history over the past intervals over core utilization in the current interval to predict core utilization for the next interval. For example, w < 0.5 suggests that the prediction history is weighted more than the current core utilization.

The input configuration of the ondemand governor de-termines the number of past intervals (history interval) that is considered to predict the next interval’s core utili-zation (i.e., hi = 2 means predicting core utilization based on the past two intervals).

Observations:

The following are some observations about the bar charts in Fig. 4.

VDP and Greedy When considering the equal weight for energy and

time (m = n = 1), VDP provides better/lower EDP com-pared to Greedy across all the benchmarks. Among the configurations, m < n (energy reduction is more im-portant than time improvement) provides lower EDP for VDP and Greedy. In terms of the average EDP, VDP has the superior performance compared to all the heuristics,

TABLE 1. BENCHMARKS

Benchmark Application domain Problem size

FFT Signal processing 1,048,576 data points

CANNEAL Routing cost with simulated annealing

200,000 elements

RADIX Integer sort 4,194,304 inte-gers, 1024 radix

LU Dense matrix computation 1024х1024 matrix, 16х16 block

WATER Measure forces and poten-tials in water molecules

8000 molecules

justifying that VDP performs more computations than the other heuristics to achieve better performance (Section 6).

Feedback controller The feedback controller generally achieves lower EDP

when the prediction history outweighs the current core utilization (w < 0.5). Interestingly, for this configuration, feedback controller performance is comparable to the “average case” EDPs of VDP and Greedy. This indicates that the accumulation of trends in the prediction history is more effective than the current core utilization on pre-dicting the core utilization (or V/F level) in the next in-terval. For w < 0.5 and w > 0.5 configurations, this paper uses w = 0.2 and w = 0.8, respectively. Experimentations with other values less/more than 0.5 provide similar out-comes compared to the weights mentioned above.

Ondemand governor Fig. 4 shows that the ondemand governor performance

is steady for different history intervals (hi = 1, 2, and 3). This indicates that the most recent core utilization (hi = 1) provides sufficient information with less memory foot-print to predicting the next V/F levels. Of note, using hi = 1 in our ondemand governor is analogous to the real im-plementation of this governor in Linux kernel. Compared to the other algorithms, the poorer performance of the governor is because of using 1) a simple predictive model and 2) the variations in core utilization from one interval to the next make core utilization a poor choice as a V/F level selection metric.

VDP vs. heuristics To quantify the relative performance of VDP to the

other three heuristics, Table 2 computes a ratio between the “average case” EDP obtained by each heuristic to VDP. As demonstrated in Table 2, on the average, VDP performance is 12%, 25%, and 75% better than Greedy, Feedback, and Ondemand governor, respectively.

8.2 Optimality of the outcomes

The previous section demonstrated the algorithms’ rela-tive performances under different configurations. For comparing our heuristics against the state-of-the-art algo-rithms, we realized that it is not practical to consider all those algorithms in our comparison study. Furthermore, a state-of-the-art algorithm may or may not improve the EDP at the same scale as others, which causes inconsist-encies when comparing performances of our heuristics to those of state-of-the-art algorithms. Instead, we measure the extent to which our heuristics’ performances are close to the optimal solutions, which provides a reliable way of evaluating our heuristics independent of optimality of other algorithms. This section presents a two-fold com-parative study, which trades off the energy consumption and execution time solutions between our algorithms and optimal solutions. First, the EDPs for the average energy-time solutions, discussed in section 8.1, are compared to equivalent optimal solutions on the Pareto frontier. To find out the overall performance of these algorithms, dis-tances of their solutions are computed from a theoretical lower-bond solution on the energy-time search space.

Fig. 5 demonstrates an example of our approach for comparing the average energy-time solutions, over the

configurations discussed earlier in this section, provided by our DVFS algorithms and their corresponding optimal solutions. In this figure, the black markers represent the

Fig. 4. The algorithms’ average energy-time solutions for (a)

FFT, (b) CANNEAL, (c) RADIX, (d) LU, and (e) WATER

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

w <

0.5

w =

0.5

w >

0.5

Ave

rag

e

hi =

1

hi =

2

hi =

3

Ave

rag

e

VDP Greedy Feedback Ondemand

No

rma

lize

d E

DP

(r

ela

tiv

e t

o n

o-D

VF

S)

WATER(e)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

w <

0.5

w =

0.5

w >

0.5

Ave

rag

e

hi =

1

hi =

2

hi =

3

Ave

rag

e


No

rma

lize

d E

DP

(r

ela

tiv

e t

o n

o-D

VF

S)

LU(d)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

w <

0.5

w =

0.5

w >

0.5

Ave

rag

e

hi =

1

hi =

2

hi =

3

Ave

rag

e


No

rma

lize

d E

DP

(re

lati

ve

to

no

-DV

FS

)

RADIX(c)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

w <

0.5

w =

0.5

w >

0.5

Ave

rag

e

hi =

1

hi =

2

hi =

3

Ave

rag

e


No

rma

lize

d E

DP

(r

ela

tiv

e t

o n

o-D

VF

S)

CANNEAL(b)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

m =

2n

m =

n

m =

0.5

n

Ave

rag

e

w <

0.5

w =

0.5

w >

0.5

Ave

rag

e

hi =

1

hi =

2

hi =

3

Ave

rag

e


No

rma

lize

d E

DP

(re

lati

ve

to

no

-DV

FS

)

FFT(a)

TABLE 2. PERCENTAGE OF AVERAGE CASE NORMALIZED EDPS

OVER AND ABOVE THE EDP OF THE AVERAGE VDP CASES

FFT CAN RAD LU WAT Average

Greedy 12 7 12 22 5 12

Feedback 36 25 35 25 5 25

Ondemand 120 38 70 89 60 75

energy-time solutions for our heuristics. For an algorithm, for example VDP, its corresponding best solution is a point on the no-DVFS Pareto frontier, which has the same performance degradation as VDP but saves the most en-ergy consumption (square white marker in Fig. 5). In this figure, the energy-time solution of the ondemand govern or is shown for hi = 1 (i.e., the current core utilization is used for predicting the V/F levels), similar to the algo-rithm design of ondemand governor used in the Linux kernel. Of note, similar comparisons between the algo-rithms vs. optimal energy-time solutions can be per-

formed when the optimal solutions are defined to be points on the Pareto frontier that have the best perfor-mances (fastest execution times) subject to consuming the same energies as the algorithms under consideration. Fig. 6 shows the algorithms’ energy efficiencies when their average EDPs are normalized to their respective optimal solutions on the Pareto frontier curve for a given time constraint. For example, considering VDP in Fig. 5, for 0.253 seconds execution delay, VDP provides 11.69 joules energy efficiency. For the same execution time (0.253 seconds), the best energy efficiency we can possibly obtain is around 8.89 joules (on the Pareto curve). Thus, for LU, the EDP of the VDP solution is 18.97% worse than the optimal. Fig. 6 shows that on the average VDP gains the closest performance to its optimal solution (1.09x compared to optimal) among the algorithms studied here. This indicates that for the same time penalty as the opti-mal solution to perform DVFS, VDP obtains the best en-ergy saving. In contrast, the ondemand governor shows the worst performance, degrading its optimal solution’s EDP by almost 2x, which corroborates the ondemand’s high EDP outcomes as shown in Fig. 4.

Fig. 6 suggests that for LU benchmark, VDP has a clear dominance over the other algorithms. The reason is that LU, compared to other benchmarks, exhibits more varia-tions among the cores’ utilizations. The dynamic pro-gramming nature of VDP leverages such variations to efficiently scale the cores’ V/F levels across the intervals with respect to their workloads.

To quantify the extent to which the algorithms’ energy-time solutions are close to optimal, we use a measure that indicates the percentage of similarity of our algorithms’ EDPs to the EDP of the theoretically best (ideal) solution, i.e., when we assume the application can theoretically run at the highest speed with the lowest energy consumption. The theoretical ideal solutions are obtained by taking the least execution time and energy consumption from the two solutions that lie on the upper-left and lower-right corners of the Pareto frontier, respectively. For example, in Fig. 5 for LU, the best theoretical solution results in 0.162 seconds of execution time at the cost of 6.452 Joules.

Salg,best = ),(1

1lg besta EDPEDPD

(14)

D(EDPalg

, EDPbest

) =

c

bestac EDPEDP

2lg (15)

The similarity measure (14) and (15) is defined based on the Euclidean distance between the EDPs of the algo-rithms and the EDP of the corresponding best theoretical solution. Here, Salg,best [0,1] is the similarity degree be-tween an algorithm (alg) and the best solution (best) EDPs. D(EDPalg, EDPbest) denotes the Euclidean distance of the best solution (EDPbest) from EDP of the algorithm (EDPc

alg) whose energy-time solution is obtained by using an input configuration c (e.g., w < 0.5, w = 0.5, and w > 0.5 for the feedback controller). As seen in (14), the similarity meas-ure is inversely proportional to the Euclidean distance, indicating that an algorithm’s energy efficiency is more similar (with a higher percentage) to the best solution when EDPs of the corresponding solutions have a lower

Fig. 6. The algorithms’ EDPs normalized to their respec-tive optimal solutions for (a) FFT, (b) CANNEAL, (c) RA-

DIX, (d) LU, and (e) WATER.

0.00

0.50

1.00

1.50

2.00


No

rma

lize

d E

DP

ra

tio

WATER(e)

0.00

0.50

1.00

1.50

2.00

2.50


No

rma

lize

d E

DP

ra

tio

LU(d)

0.00

0.50

1.00

1.50

2.00


No

rma

lize

d E

DP

ra

tio

RADIX(c)

0.95

1.00

1.05

1.10

1.15

1.20

1.25


No

rma

lize

d E

DP

ra

tio

CANNEAL(b)

0.00

0.50

1.00

1.50

2.00

2.50

3.00


No

rma

lize

d E

DP

ra

tio

FFT(a)

Fig. 5. Computing optimal solutions on the LU’s Pareto frontier,

which correspond to the algorithms’ outcomes

6

8

10

12

14

16

18

20

22

0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34

Av

era

ge

en

erg

y u

sa

ge

(J

ou

les

)

Average execution time (seconds)

LU

VDP

Greedy

Feedback

ondemand

distance. In contrast, the algorithm has a lower percent-age of similarity, compared to the best solution, when the EDPs of corresponding solutions have a larger distance.

Table 3 shows the similarities of our algorithms EDPs to the best solution EDP computed by (14). It is observed that on average VDP is 3, 7 and 18% more similar to the best solution compared to the greedy, feedback controller, and ondemand heuristics, respectively. Furthermore, Ta-ble 3 suggests that among the heuristics, greedy’s similar-ity outcomes are more comparable to the VDP. The de-grees of similarity of the algorithms to the best solution, shown in Table 3, match their relative energy efficiencies (or EDPs) of these algorithms to one another as shown in Fig. 4 and Fig. 6.

9 CONCLUSION AND FUTURE WORK

This paper proposes a Dynamic Programming (DP) framework, based on the Viterbi algorithm, to achieve fine-grain, per core energy-time tradeoff analysis. This technique globally optimizes per-core V/F levels by min-imizing the cores’ energy consumptions and execution times. The performance of the framework is compared to faster version of the VDP algorithm (Greedy) and two other heuristic algorithms (Feedback and Ondemand). The EDP performance results show that, on the average, our proposed VDP algorithm outperforms Greedy by 12%, Feedback by 25% and Ondemand by 75%.

Furthermore, the results show that the VDP algorithm performs the best when maximizing energy saving is as, or more, important than minimizing the execution time penalty. Considering the best possible optimal solution for each algorithm, the results show that VDP, compared to the other heuristics, has the closest performance to its corresponding optimal solution.

As discussed in Section 4.2, the algorithms presented in this paper perform compile-time DVFS and are usable for applications with a specific execution model (Fig. 1). Furthermore, these algorithms are applied on a system with a moderate number of homogenous cores as ex-plained in Section 7.2. To further extend these algorithms, future work will address the following points: 1) Study-ing applications with different execution models, 2) Ana-lyzing the scalability and energy efficiency of the algo-rithms with respect to larger system sizes, 3) Integrating the algorithms with OS kernels to perform runtime DVFS, and 4) Evaluating the performance of the algorithms on systems with heterogeneous cores.

REFERENCES

[1] V. W. Freeh et al., “Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications,” IEEE Transactions on Parallel and Distributed Systems, vol. 18, no. 6, pp. 835-848, Jun. 2007.

[2] V. Spiliopoulos, S. Kaxiras, G. Keramidas, “Green governors: A framework for Continuously Adaptive DVFS,” Green Computing Con-ference and Workshops (IGCC), 2011.

[3] Q. Deng, D. Meisner, A. Bhattacharjee, T. F. Wenisch, R. Bianchini, “CoScale: Coordinating CPU and Memory System DVFS in Server Systems,” 45th Annual IEEE/ACM International Symposium on Mi-croarchitecture, pp. 143-154, Dec. 2012.

[4] N. Rizvandi, J. Taheri, A. Y. Zomaya, “Some Observations on Opti-mal Frequency Selection in DVFS-based Energy Consumption Min-imization,” Jormal of Parallel and Distributed Computing, vol. 71, pp. 1154-1164, Aug. 2011.

[5] D. Li, B. R. de Supinski, M. Schulz, K. Cameron, D. S. Nikolopoulos, “Hybrid MPI/OpenMP power-aware computing,” IEEE Interna-tional Symposium on Parallel Distributed Processing (IPDPS), pp. 1-12, Apr. 2010.

[6] J. Liu, Q. Zhuge, S. Gu, J. Hu, G. Zhu, E. H. M. Sha, “Minimizing System Cost with Efficient Task Assignment on Heterogeneous Mul-ticore Processors Considering Time Constraint,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 8, pp. 2101-2113, Aug. 2014.

[7] J. Cong, Jason, K. Gururaj, “Energy Efficient Multiprocessor Task Scheduling Under Input-dependent Variation,” Proceedings of the Conference on Design, Automation and Test in Europe, pp. 411-416, 2009.

[8] JD Booth, J Kotra, H Zhao, M Kandemir, P Raghavan, “Phase detec-tion with hidden markov models for dvfs on many-core processors,” IEEE 35th International Conference on Distributed Computing Sys-tems (ICDCS), pp. 185-195, 2015.

[9] C. Isci, G. Contreras, M. Martonosi. “Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management,” 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp. 359-370, Dec. 2006.

[10] T. Lu, B. Shirazi, and P. Pande, “A Dynamic, Compiler Guided DVFS Mechanism to Achieve Energy-Efficiency in Multi-core Processors,” Sustainable Computing: Informatics and Systems (SUSCOM), vol. 12, pp. 1-9, Dec. 2016.

[11] Q. Wu et al., “Dynamic-Compiler-Driven Control for Microprocessor Energy and Performance,” IEEE Micro, vol. 26, no. 1, pp. 119-129, Jan. 2006.

[12] K. Tarplee, R. Friese, A. A. Maciejewski, H. J. Siegel, E. K. P. Chong, “Energy and Makespan Tradeoffs in Heteregenous Computing Sys-tems Using Efficient Linear Programming Techniques,” IEEE transac-tions on parallel and distributed systems 27, (2016), 1633-1646.

[13] K. Chakraborty, S. Roy, “Topologically homogeneous power-performance heterogeneous multicore systems,” Design, Automation Test in Europe, pp. 1-6, Mar. 2011.

[14] F. Kong and P. Tao, S. Yang and X. Zhao, “Genetic Algorithm Based Idle Length Prediction Scheme for Dynamic Power Management,” The Proceedings of the Multiconference on "Computational Engi-neering in Systems Applications, pp. 1437-1443, Oct 2006.

[15] I. Mansouri, C. Jalier, F. Clermidy, P. Benoit, L. Torres, “Implementa-tion Analysis of a Dynamic Energy Management Approach Inspired by Game-Theory,“ IEEE Annual Symposium on VLSI, pp. 422-427, Jul. 2010.

[16] N. AbouGhazaleh et al., “Integrated CPU and L2 Cache Voltage Scaling Using Machine Learning,” Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 41-50, 2007.

[17] N. Ioannou, M. Kauschke, M. Gries, and M. Cintra, “Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer,” International Conference on Parallel Archi-tectures and Compilation Techniques, pp. 131-142, Oct. 2011.

[18] Z. Lai, K. T. Lam, C. L. Wang, and J. Su, “PoweRock: Power Model-ing and Flexible Dynamic Power Management for Many-Core Archi-tectures,” {IEEE Systems Journal, vol. 11, no. 2, pp. 600-612, June 2017.

[19] R. David, P. Bogdan and R. Marculescu,“Dynamic power manage-ment for multicores: Case study using the intel SCC,” IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC), pp. 147-152, Oct. 2012.

TABLE 3. SIMILARITIES OF HEURISTICS’ EDPS TO THE BEST SOLUTION

EDP ACROSS BENCHMARKS

FFT CAN RAD LU WAT Average

VDP 0.549 0.541 0.547 0.529 0.520 0.537

Greedy 0.518 0.520 0.511 0.460 0.502 0.510

Feedback 0.439 0.461 0.481 0.450 0.500 0.466

Ondemand 0.307 0.427 0.367 0.320 0.358 0.356

[20] A. Leva; F. Terraneo; I. Giacomello; and W. Fornaciari, “Event-Based Power/Performance-Aware Thermal Management for High-Density Microprocessors,” in IEEE Transactions on Control Systems Technol-ogy 99, (2017), 1-16.

[21] R. G. Kim, W. Choi, Z. Chen, P. P. Pande, D. Marculescu, and R. Marculescu, “Wireless NoC and Dynamic VFI Codesign: Energy Ef-ficiency Without Performance Penalty,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, (2016), 2488-2501.

[22] A. Lukefahr et al., “Composite Cores: Pushing Heterogeneity Into a Core,” Proceedings of the 2012 45th Annual IEEE/ACM Internation-al Symposium on Microarchitecture, (2012), pp. 317-328.

[23] R.E. Bellman, Dynamic Programming, Dover Publications, 2003, pp. 1-340.

[24] Niu, C. Liu, Y. Gao, and M. Qiu, “Energy Efficient Task Assignment with Guaranteed Probability Satisfying Timing Constraints for Em-bedded Systems,” IEEE Transactions on Parallel and Distributed Sys-tems, vol. 25, nom. 8, pp. 2043-2052, Aug. 2014.

[25] X. Zhong and C. z. Xu, “System-Wide Energy Minimization for Real-Time Tasks: Lower Bound and Approximation,” IEEE/ACM Inter-national Conference on Computer Aided Design, pp. 516-521, Nov. 2006.

[26] H. Jung and M. Pedram, “Supervised Learning Based Power Man-agement for Multicore Processors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 29, no. 9, pp. 1395-1408, Sep. 2010.

[27] K. Kang, J. Jung, S. Yoo, and C. M. Kyung, “Maximizing throughput of temperature-constrained multi-core systems with 3D-stacked cache memory,” International Symposium on Quality Electronic De-sign, pp. 1-6, Mar. 2011.

[28] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite: characterization and architectural implications,” Proceedings of the 17th international conference on Parallel architectures and compi-lation techniques, (2008), 72-81.

[29] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

[30] K. E. G. Magnusson, J. Jaldén, P. M. Gilbert, and H. M. Blau, “Global Linking of Cell Tracks Using the Viterbi Algorithm,” IEEE Transac-tions on Medical Imaging, vol. 34, no. 4, pp. 911-929, Apr. 2015.

[31] itl.nist.gov, EWMA Control Charts, http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm (accessed 20 August 2017).

[32] J. Murray, T. Lu, P. Wettin, P. Pande, and B. Shirazi, “Dual-Level DVFS-Enabled Millimeter-Wave Wireless NoC Architectures,” J. Emerg. Technol. Comput. Syst., vol. 10, no. 4, Jun. 2014.

[33] N. Binkert, et al., “The GEM5 Simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1-7, Aug. 2011.

[34] S. Li, J. Ahn, R. Strong, J. Brockman, D. Tullsen, N. Jouppi, “McPAT: an Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,” in Proceedings of the Inter-national Symposium on Microarchitecture, pp. 469-480, Dec. 2009.

[35] S. Woo, M. Ohara, E. Torrie, J. Singh, A. Gupta, “The SPLASH-2 programs: characterization and methodological considerations,” ISCA, vol. 23, no. 2, pp. 24-36, May 1995.

[36] C. Bienia, Benchmarking modern multiprocessors, Ph.D. Disserta-tion, Princeton Univ., Princeton NJ, Jan. 2011.

[37] H. Kimura, M. Sato, Y. Hotta, T. Boku and D. Takahashi, “Emprical study on Reducing Energy of Parallel Programs using Slack Reclama-tion by DVFS in a Power-scalable High Performance Cluster,” IEEE International Conference on Cluster Computing, pp. 1-10, 2006.

[38] G. D. Costa and J. Pierson, “DVFS Governor for HPC: Higher, Faster, Greener,” Euromicro International Conference on Parallel, Distribut-ed, and Network-Based Processing, pp. 533-540, 2015.

[39] S. Hajiamini, B. Shirazi, A. Crandall, H. Ghasemzadeh, and C. Cain, “Impact of Cache Voltage Scaling on Energy-Time Pareto Frontier in Multicore Systems,” Elsevier SUSCOM, pp. 54-65, 2018.

[40] Dominik Brodowski, “Linux CPUFREQ GOVERNORS,” https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. Apr. 2017 (accessed on 11/25/2018).

Shervin Hajiamini is a PhD candidate in Computer Science at

Washington State University. His research interest is energy-efficient design of optimization techniques for partitioned multicore systems. He is a student member of the IEEE. Behrooz A. Shirazi is a Professor and the Director of the Communi-ty Health Analytics Project at Washington State University (WSU). He served as the Director of the School of Electrical Engineering and Computer Science at WSU from 2005 to 2016. Dr. Shirazi has con-ducted research in the areas of pervasive computing, health analyt-ics, green/energy efficient computing, and high-performance compu-ting over the recent years. He is currently serving as the Editor-in-Chief for Sustainable Computing: Informatics and Systems journal. He is the principal founder of the IEEE Symposium on Parallel and Distributed Processing (later joined with IPPS to form IPDPS); a co-founder of the IEEE International Conference on Pervasive Compu-ting and Communications (PerCom); and, a co-founder of the Inter-national Green and Sustainable Computing Conference (IGSC). Hassan Ghasemzadeh received the B.Sc. degree from Sharif Uni-versity of Technology, Tehran, Iran, the M.Sc. from University of Teh-ran, Tehran, Iran, and the Ph.D. from the University of Texas at Dal-las, Richardson, TX, in 1998, 2001, and 2010 respectively, all in Computer Engineering. He was on the faculty of Azad University from 2003-2006 where he served as Founding Chair of Computer Science and Engineering Department at Damavand branch, Tehran, Iran. He spent the academic year 2010- 2011 as a Postdoctoral Fel-low at the West Wireless Health Institute, La Jolla, CA. He was a Research Manager at UCLA Wireless Health Institute 2011-2013. Currently, he is Assistant Professor in Electrical Engineering and Computer Science at Washington State University, Pullman, WA. The focus of his research is on algorithm design and system level optimization of embedded and pervasive systems with applications in healthcare and wellness. Aaron Crandall is a Clinical Associate Professor of computer sci-ence in at Washington State University’s (WSU) School of Electrical Engineering & Computer Science. He is a founding member of the Center for Advanced Studies in Adaptive Systems (WSU CASAS). Dr. Crandall has conducted research on artificial intelligence and human factors in gerontechnology and senior care technologies. He serves on several smart home / intelligent environment journals, such as Sensors, Pervasive and Mobile Computing, Intelligent envi-ronments, and Digital Health Summit. Dr. Crandall has founded sev-eral technology startups, leads undergraduate STEM programs with grants from the US National Institutes of Health (NIH), Naval Sea Systems Command (NAVSEA), and the National Aeronautics and Space Administration (NASA).

https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

a dynamic programming framework for dvfs- based energy

Documents