power-aware bag-of-tasks scheduling on heterogeneous platforms · power-aware bag-of-tasks...

Cluster Comput (2016) 19:615–631DOI 10.1007/s10586-016-0544-2

Power-aware Bag-of-Tasks scheduling on heterogeneous platforms

George Terzopoulos1 · Helen D. Karatza1

Received: 14 December 2014 / Revised: 31 January 2016 / Accepted: 1 February 2016 / Published online: 23 February 2016© Springer Science+Business Media New York 2016

Abstract Energy preservation is very important nowadays.A large number of applications in science, engineering,astronomy and business analytics are classified as Bag-of-Tasks (BoT) applications. A BoT is a collection ofindependent tasks that do not communicate with each otherduring execution. BoT scheduling has been severely studiedfrom a performance point of view. In this paper, we addressthe problem of energy-efficient BoT scheduling in a hetero-geneous environment with the twin objectives of minimizingfinish time and energy consumption. Specifically, we extendtwo performance-oriented scheduling policies,Min–Min andMax–Min, and propose power-aware centralized schedul-ing policies that incorporate a dynamic voltage/frequencyscaling mechanism and can power on and off unneededcomputing nodes of a heterogeneous cluster environmentusing dynamic power management. Additionally, to evalu-ate the system using a more realistic workload, high-prioritytasks with and without time-constraints are also submitted.A series of simulation experiments conducted, show thatwe can achieve significant energy savings without affectingsignificantly the execution of BoTs and high-priority tasks.Additional experiments on a real system also confirmed theeffectiveness of our policies.

Keywords Energy · Simulation ·Cluster ·Heterogeneous ·Bag-of-Tasks · DVFS · DPM

B George [email protected]

Helen D. [email protected]

1 Department of Informatics, Aristotle Universityof Thessaloniki, Thessaloniki, Greece

1 Introduction

Energy consumption of large-scale systems has becomea significant matter nowadays due to various reasons. Itwas estimated that data centers would receive a $7.4 bil-lion electric bill for 2011 [1]. Additionally, computer usageaccounts for 2% of anthropogenic CO2 emissions. As statedby Google, a Google search may generate about 0.2 g ofcarbon dioxide and for 2012, Google searches were 1.2 tril-lion [2]. Furthermore, cooling equipment can consume up to50 % of the total energy consumption in some commercialservers [3]. High energy consumption leads to higher tem-peratures and higher failure rate of components. Commoditycomponents fail at an annual rate of 2–3% [4], thus a petaflopsystem of about 12,000 nodes will sustain hardware failuresonce every 24h. All of the above information points out thenecessity of finding ways to reduce energy consumption inlarge-scale systems.

As a response to global concerns about energy conserva-tion and environmental control, several leading organizationsin the computer industry, supported a common standard ofconfiguration interface between computer hardware and soft-ware. Advanced configuration and power interface (ACPI),is a power management and configuration standard, devel-oped by Intel, Microsoft and Toshiba in 1999. ACPI allowsthe operating system to control the amount of power eachdevice is given. Dynamic voltage/frequency scaling (DVFS)and dynamic power management (DPM) are the two mostpopular techniques for dynamically reducing energy con-sumption. While DVFS dynamically scales the supply volt-age/frequency level of the device, DPM performs selectiveshutdownof systemcomponents.Allmodern processors sup-port DVFS, and their clock frequency can be decreased.DVFS reduces power consumption, due to the quadratic rela-tion between power consumption and operating voltage of

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s10586-016-0544-2&domain=pdf

616 Cluster Comput (2016) 19:615–631

a CMOS circuit. Through ACPI, the operating system byusing DVFS, can control the operating frequency of the sys-tem’s processors and lower the clock speed during timeswhen applications don’t require the full processor clockspeed. ACPI defines the CPU performance states (P-states),as the capability of a processor to switch between differentsupported operating frequencies. The number of P-states isprocessor specific. P0 is the highest performance state; withP1 to Pn being successively lower performance states, upto an implementation-specific limit of n no greater than 16.Higher P-state numbers represent slower processor speedsand lower power consumption. A processor in P2 state willrun slower and use less power than a processor running atP1 state. An Intel Pentium M processor for example, has6 performance levels with frequencies ranging from 600 to1600MHz and power consumption from 6 to 25W.

Additionally, with ACPI, the operating system can reducemotherboard, processor and peripheral device power needs,by not activating devices until they are needed, by usingDPM.AlthoughACPI supports low power sleep states wherea device can save most energy since it is almost shut down,there is a wake-up latency trade-off in order for the device tobe operational again.Usually, components are turned off aftera fixed amount of idling time while more advanced methodstry to predict the expected idling time using past data in orderto estimate the future status of a system. Modern processorscan benefit fromDPM since they support several sleep powermodes called C-States, starting from C0. The higher the Cnumber, the deeper the CPU sleep mode and while CPU con-sumes less energy, it needs more time to wake-up and be100% operational.

Bag-of-Tasks (BoT) applications are very common inlarge-scale systems. A BoT is a parallel application whosetasks are independent and do not need to communicate witheach other during execution. Each task of a BoT applicationcan have different computational needs. Examples of BoTapplications include computer imaging, computational biol-ogy, parameter sweeps, fractal calculations, data mining andMonte Carlo simulations. The significance of these applica-tions is very high. It is estimated that up to 96% of CPU timein grids and up to 70 % in parallel systems, is consumed byBoT applications [5].

Nowadays, heterogeneous computing architectures aregaining researchers’ attention. Data center deployments canbe heterogeneous due to upgrade cycles, replacement offailed components or by design. For mobile devices, apopular heterogeneous computing architecture is ARM’sbig.LITTLE architecture. ARM big.LITTLE processing isa power-optimization technology where high-performanceCPUs are combinedwithmore efficient CPUs to deliver peakperformance capacity, higher sustained performance at sig-nificantly lower average power [6]. ARM claims that thisarchitecture can save 75 % of CPU energy in low to mod-

erate performance scenarios, and can increase performanceby 40 % in highly threaded workloads. Big.LITTLE tech-nology takes advantage of the fact that the usage patternfor smartphones and tablets is dynamic. There are periodsof high-processing intensity tasks, alternate with typicallylonger periods of low-processing intensity tasks. The sameapplies to large-scale systems, since there are periods of peakload and periods of low load.

In this paper, we propose energy-efficient scheduling forBoT applications on a heterogeneous power-aware clustersystem that utilizes DVFS-capable processors. A heteroge-neous cluster is composed of computing elements of differentarchitecture, performance and energy consumption. Specif-ically, the cluster consists of high-performance processorsand low-power processors. Scheduling in such an environ-ment must balance performance and energy consumption.Extensive simulation experiments and tests on a real systemare used to evaluate the proposed scheduling policies.

The structure of this paper is organized as follows. InSect. 2, related work and the contribution of our work isdescribed,while Sect. 3 contains information about thework-load submitted to the system. An overview of the proposedscheduling policies and energy-saving mechanisms is pro-vided in Sect. 4. In Sect. 5, the system and power model usedfor simulation experiments are presented, while in Sect. 6,simulation results are depicted. In Sect. 7, experiments ona real hardware platform and results are described. Finally,summary, conclusions and future directions are discussed inSect. 8.

2 Related work

Energy efficiency is a topic much investigated at the momentand researchers are trying to study different environments,covering various aspects and workloads [7–10]. DVFS is awell-known technique that is used when usually indepen-dent real-time tasks are executed [11–13]. These tasks havea deadline constraint which is used to express that tasks arerequired to compute their results within some deadline. Real-time tasks can be classified into either hard or soft dependingon the consequences of a task missing its deadline. A hardreal-time task must produce its results within certain pre-defined time bounds or else its results are worthless andthe consequences may be even catastrophic to the system.Systems executing hard real-time tasks are typically safety-critical. Soft real-time tasks also have time bounds associatedwith them, although they are less critical and the usefulnessof their results degrades after their deadline.

Additionally, other studies use DPM techniques in orderto save energy. DPM performs selective shutdown of systemcomponents that are idle or underutilized. However employ-ing such techniques also incurs performance loss due to the

123

Cluster Comput (2016) 19:615–631 617

overhead associated when components wake-up from sleepmode. An effective DPM policy must predict component’snext idle period and deactivate the component. DVFS andDPM are usually applied together for maximizing energysavings [11,14,15].

BoT applications are parallel applications whose tasks areindependent of each other. Scheduling of these applicationsis a topic that has been studied in various computing envi-ronments such as grids and clusters. BoT scheduling in agrid environment is studied in [16–19]. Due to the dynamicnature of grids, information about the whole system is typ-ically changing over time and resources are heterogeneous.Resources are provisioned onto and removed from the gridon an on-going basis. While resources come and go on thegrid, clusters typically contain a static number of processorsand resources. Additionally, clusters are usually physicallycontained in the same complex in a single location and haveextremely low network latency. BoT scheduling in a clusterenvironment is studied in [20–22].

In the literature, many applications are often modelled asa directed acyclic graph (DAG). Each node in a DAG repre-sents a task and the edges represent dependencies betweenthe tasks that constrain the order in which tasks are executed.Energy-aware DAG scheduling, have received much atten-tion from the research community over the last years [23,24].BoTs and DAGs are entirely different and represent differenttypes of applications. The main difference is that BoTs haveno dependencies, while DAGs have dependencies and hencesignificant communication overhead in order to keep trackof the dependencies.

In BoT scheduling, most works focus on the computa-tional demand of BoTs [25], while others also consider com-munication overheads [26,27]. Some research efforts haveaddressed different aspects of BoT applications, like cost.For example in [28], authors propose a budget-constraintscheduler, BaTS, that schedules BoTs in a way that the com-pletion time isminimized for a given budget. Energy-efficientBoT scheduling has recently gained researchers’ attention.Power-aware scheduling of BoTs with real-time require-ments is studied in [22]. A homogeneous DVFS-enabledcluster executingBoTs is considered.Ahomogeneous clusterconsists of processing elements (PEs) that provide identicalprocessing performance and power consumption. In [21],authors propose a power-aware scheduling algorithm forhomogeneous clusters where virtual machines are dynam-ically provided for executing cluster jobs. Their goal is tominimize the processor power dissipating, by scaling downprocessor frequencieswithout drastically increasing theover-all virtual machine execution time. A BoT power-awarescheduling policy for heterogeneous clusters is studied in[20]. Authors assume that the power consumption for amachine is a multiple of its execution rate. In their case,the faster the machine, the more energy it consumes which

is an oversimplified assumption. Most of their experimentsdo not consider DVFS, and machines can be in a busy stateor in an idle state.

In scheduling, most decentralized approaches have eachcomputing node obtaining andmaintaining only partial infor-mation locally in order to make decisions without the need tomaintain complete and accurate information about the wholesystem. For large-scale platforms, like grids, a decentralizedapproach is more suitable since resource availability changesover time. For smaller platforms though like clusters, it maybe more realistic to make decisions based upon completeand reliable knowledge of system’s status. A decentralizedsolution for energy-efficient BoT scheduling is proposedin a previous work [29]. Specifically, a DVFS mechanismcalled adaptive DVFS (AD-DVFS) is proposed, in order tosave energy during the execution of BoT applications andhigh-priority tasks. This mechanism was applied by eachcomputing node. AD-DVFS chooses the processor speeddepending on the completion percentage of the BoT. Thebasic principle of this mechanism is that if none of the BoT’ssubtasks have completed their execution, then a task can beexecuted at very low speeds. In addition, when the com-pletion percentage of the BoT is high, a subtask should beexecuted at a high speed in order not to delay the entire BoT.This mechanism managed to save up to 13 % with smallperformance degradation in a heterogeneous environment,where BoTs and high-priority tasks are executed. Traditionalperformance-oriented scheduling policies were applied bythe central resource scheduler such as Min–Min and Max–Min, proposed in [30]. These policies are extensively usedin the literature for BoT scheduling. Min–min and Max–min, assign tasks to processing elements based on the earliestcompletion time (ECT). Due to the heterogeneity of the sys-tem, the completion time of a task differs from machineto machine. Min–Min gives priority to the task with theminimum ECT, while Max–Min gives priority to the taskwith the maximum ECT. With Min–Min, small tasks arescheduled first in addition to Max–Min, where large tasksare scheduled first. These performance-oriented algorithmsmainly focus on the makespan of a BoT application, i.e.the total length of the scheduling and not on the energyefficiency.

While this decentralized version proved to be effectiveand gained up to 13 % energy, in this paper we proposefour centralized scheduling policies that extend Min–Minand Max–Min, take into account cluster’s heterogeneity andthe energy consumption of computing nodes. These poli-cies are best suited for smaller platforms like clusters, wherethe central resource scheduler has complete knowledge ofsystem’s status. Policies use DVFS in order to scale-downprocessors’ speed and aDPMmechanism in order to activate-deactivate computing nodes when they are not needed. Themain contributions of this paper are the following:

123

618 Cluster Comput (2016) 19:615–631

• We propose centralized energy-efficient algorithms forBoT scheduling that extend performance-oriented poli-cies Min–Min and Max–Min and incorporate DVFS.Additionally, a DPM mechanism is applied in order topower on and off unneeded computing nodes for energypreservation. Although energy efficiency is a hot topicnowadays, this is the first time that an extension to Min–Min andMax–Min is attempted in order to scheduleBoTsin a more energy-efficient way. The importance of thistype of tasks lies on the fact that they are dominant inmost systems. Finally, to the best of our knowledge thereare not any studies in the literature that extend Min–Minand Max–Min algorithms by adding DVFS and DPMmechanisms in order to achieve energy efficiency forBoTapplications.

• Acluster environment is selected since proposed schedul-ing policies are centralized and it is considered that thecentral resource scheduler is aware and makes decisionsbased upon complete and reliable knowledge of system’sstatus. Clusters are often studied in the literature [3,12,20–22] since they can be used as underlying basic com-ponents for other environments such as grids and clouds.Since most real-world clusters consist of machines withdifferent characteristics, in terms of both performanceand energy consumption, we focus on heterogeneousclusters. The systemconsists of highperformanceproces-sors and low power processors. A scheduling policy cantake advantage of this heterogeneity and achieve highperformance when needed and low power consumptionwhen performance is not needed.

• The proposed algorithms are evaluated with simulationon two different systemswith varying heterogeneity (lowand high). The system with low heterogeneity consistsof two types of processors while the system with highheterogeneity consists of three types of processors. Fur-thermore, proposedpolicies are testedon aheterogeneousreal hardware platform in order to verify simulationresults.

• The workload submitted to the system consists of BoTsandhigh-priority tasks as in our previouswork [29],whileother studies assume only BoTs. Furthermore, variousratios between BoTs and high-priority tasks are consid-ered. This mixed workload represents in a more accurateway real workloads, since a fraction of the jobs submittedto large-scale distributed systems is BoTs.

• Compared to our previous work [29], where we studieda decentralized DVFS mechanism combined with tradi-tional performance-oriented scheduling with Min–Minand Max–Min, in this paper we propose four centralizedscheduling policies that extend Min–Min and Max–Minusing energy efficiency criteria.While our previous workcan be applied to systems where the central resourcescheduler has not complete knowledge of the system’s

status such as grids, the energy-efficient scheduling poli-cies proposed in this paper can be applied in systemslike clusters where the scheduler is aware of the system’sstatus.

3 Workload

The submitted workload consists of two types of tasks: BoTsand high-priority tasks. Depending on the number of sub-tasks, BoTs can be small or large. High-priority tasks can beeither real-time tasks or without time-constraints. We studyseparately both types. In the first case, we measure deadlinemiss ratio for real-time tasks, while in the second case westudy the total time that tasks spend in the system.We assumethat all tasks submitted are CPU-bound, which facilitates thecalculation of the energy consumption.

ThePareto distribution [31] is used in this study formodel-ing BoT arrivals. This distribution is a skewed, heavy-taileddistribution that is used in description of social, scientific,geophysical, actuarial and many other types of observablephenomena. Authors in [5] argue that the Pareto distributionshould be used for modeling BoT arrivals after studying aseries of real data traces, hence we assume that our workloadis based on real data. They also concluded that a fractionof jobs (40–89 %) submitted to large-scale systems is BoTs.Due to the fact that the Pareto distribution is heavy-tailed, thearrivals of BoTs will be bursty. This suggests that there willbe many small values for the inter-arrival time of tasks andfew large values will be present. Since only a fraction of taskssubmitted to large-scale systems is BoTs, we consider thatthe load also consists of high-priority tasks. Exploiting con-clusions from [5], in our experiments various ratios betweenBoTs and high-priority tasks are considered (60–40, 50–50and 40–60%).

4 Scheduling policies and energy-savingmechanisms

BoTs are scheduled according to four energy-efficient poli-cies based on and extending performance-oriented Min–MinandMax–Min.Min–Min andMax–Min policies focus on thetotal length of the scheduling and don’t take energy consump-tion into account.Min–Minbegins by scheduling the task thatchanges the expected machine ready time status by the leastamount, while Max–Min assigns first the task that has themaximum ECT. These two policies are often used in manystudies as a reference, or in order to compare results [17,18,25]. Our proposed policies work in two phases: In Phase 1,each task of a BoT is assigned to a processor, while in Phase2, the P-states that tasks will be executed are determined. Thebasic principle for our proposed policies is that each task is

123

Cluster Comput (2016) 19:615–631 619

scheduled and executed at a processor speed that would notdelay the execution of the rest of the BoT’s subtasks.

Two of the proposed policies are greedy policies and try tosave as much energy as possible. These two policies find thesubtask that will finish last its execution and execute the restof the tasks at lower speeds by using DVFS. Algorithms ofthe two greedy policies are depicted in Algorithms 1 and 2.

Since slowing the execution of a BoT will affect the taskswaiting in processors’ queues and future tasks arriving, wealso propose two more conservative policies that executeBoT’s subtasks based on the average completion time of theBoT. Detailed views of the conservative algorithms are pre-sented in Algorithms 3 and 4.

Algorithm 1. Greedy Energy-Efficient Min-Min (GrEMinMin) PolicyPhase 11. For each task of a BoT determine its earliest completion time (ECT) over all computing nodes2. Over all tasks, find the task with the minimum ECT and assign the task to the computing node that gives this completion time3. Re-calculate the ECT over all computing nodes for the rest of the tasks4. Iterate until all tasks are scheduledPhase 2 1. Find the highest value of ECT (HECT) in the BoT2. Select the lowest P-state of each processor, so that each task of a BoT completes its execution before HECT

Algorithm 2. Greedy Energy-Efficient Max-Min (GrEMaxMin) PolicyPhase 11. For each task of a BoT determine its earliest completion time (ECT) over all computing nodes2. Over all tasks, find the task with the maximum ECT and assign the task to the computing node that gives this completion time3. Re-calculate the ECT over all computing nodes for the rest of the tasks4. Iterate until all tasks are scheduledPhase 2 1. Find the highest value of ECT (HECT) of the BoT2. Select the lowest P-state of each processor, so that each task of a BoT completes its execution before HECT

Algorithm 3. Conservative Energy-Efficient Min-Min (ConEMinMin) PolicyPhase 11. For each task of a BoT determine its earliest completion time (ECT) over all computing nodes2. Over all tasks, find the task with the minimum ECT and assign the task to the computing node that gives this completion time3. Re-calculate the ECT over all computing nodes for the rest of the tasks4. Iterate until all tasks are scheduledPhase 2 1. Find the average value of ECT (AECT) of the BoT2. Select the lowest P-state of each processor, so that each task of a BoT completes its execution before AECT

Algorithm 4. Conservative Energy-Efficient Max-Min (ConEMaxMin) PolicyPhase 11. For each task of a BoT determine its earliest completion time (ECT) over all computing nodes2. Over all tasks, find the task with the maximum ECT and assign the task to the computing node that gives this completion time3. Re-calculate the ECT over all computing nodes for the rest of the tasks4. Iterate until all tasks are scheduledPhase 2 1. Find the average value of ECT (AECT) of the BoT2. Select the lowest P-state of each processor, so that each task of a BoT completes its execution before AECT

123

620 Cluster Comput (2016) 19:615–631

High-priority tasks are scheduled according to the join theshortest queue (JSQ) policy and are always executed at thehighest P-state, P0. An implementation of the JSQ policy isdepicted in Algorithm 5.

Algorithm 5. JSQ Policy1. Find the processor with the minimum number of tasks in the queue2. If there is more than one processors from Step 1 with minimum number of tasks, select randomly one of them, else select the processor from Step 13. Schedule the task to the selected processor

A DPM mechanism, proposed in a previous work [15] isapplied. This mechanism is only applied dynamically whenthe load of the system is low since it would not be efficientto power-off an idle processor when the load of the systemis relatively high and while all the other processors are busy.In general, shutting down processors may save energy butother active processors of the system take the extra work-load and work in higher speeds. If there are idle processorsin a cluster, some of them can be put into low-power sleepstate.

An implementation example of how GrEMinMin andConEMinMin policies operate compared to Min–Min whena BoT consisted of 7 tasks is executed, is depicted in Table1. We consider that processors in this example have 10 P-states with service rate varying from 0.1 (P9) to 1 (P0)and P-states are evenly distributed. For example, the ser-vice rate at P1 is 0.9 and at P2 is 0.8. Tasks’ service timeis [1, 1.4, 1.8, 2, 1.6, 1.2, 1.5]. HECT is the maximum valueof [1, 1.4, 1.8, 2, 1.6, 1.2, 1.5], while AECT is the aver-age value of [1, 1.4, 1.8, 2, 1.6, 1.2, 1.5]. According toMin–Min, all tasks are executed at P0. It is apparent thatGrEMinMin executesmost tasks at lowerP-states thanConE-MinMin.

Table 1 Example ofGrEMinMin andConEMinMin compared toMin–Min

Tasks Execution time (P-state for execution)

Min–Min ECT GrEMinMin(HECT=2)

ConEMinMin(AECT= 1.5)

Task 1 1 (P0) 1 2 (P5) 1.43 (P3)

Task 2 1.4 (P0) 1.4 2 (P3) 1.4 (P0)

Task 3 1.8 (P0) 1.8 2 (P1) 1.8 (P0)

Task 4 2 (P0) 2 2 (P0) 2 (P0)

Task 5 1.6 (P0) 1.6 2 (P2) 1.6 (P0)

Task 6 1.2 (P0) 1.2 2 (P4) 1.5 (P2)

Task 7 1.5 (P0) 1.5 1.88 (P2) 1.5 (P0)

5 Simulation setup

For the simulation experiments we consider that the hetero-geneous cluster system consists of 24 computing nodes and

each node is equippedwith a processor. Each computing nodemaintains a local queue in which tasks are queuing up forexecution and a voltage regulator that adjusts the processor’soperating frequency based on a DVFS mechanism. Taskssubmitted by users, arrive at a central resource scheduler andfrom there they are dispatched to the computing nodes of thesystem based on a scheduling policy. It must be noted thatthe scheduler has complete knowledge about the tasks thathave arrived, but not about future tasks which have not yetarrived. The scheduling policies proposed in this study alsodictate the P-state of the processor that will execute each task.A general view of the system used is depicted in Fig. 1. Twovariations of the system are studied through simulation. Thefirst variation has high heterogeneity and consists of 3 typesof processors:

• Eight high-performance and high-power processors.High-end servers belong to this category. Specificationsof a AMD Opteron processor 2.6 GHz are considered.

• Eight low-performance and low-power processors.Mobile processors belong to this category. Specificationsof a Intel® Pentium® M 1.6 GHz processor are consid-ered.

• Eight medium-performance and low-power processors.Recent low power processors achieve medium perfor-mance compared to high-end servers with lower powerconsumption. Specifications of a VIA C7-M 2 GHzprocessor are considered.

The second variation of the system has low heterogeneityand consists of two types of processors:

• Twelve high-performance and high power AMDOpteronprocessors 2.6 GHz.

• Twelve low-performance and low-power Intel® Pen-tium® M 1.6 GHz processors.

In general, two components determine the power con-sumption in a CMOS circuit: static and dynamic powerconsumption. CMOS components such as processors have

123

Cluster Comput (2016) 19:615–631 621

Fig. 1 System model

very low static power consumption, which is the result ofleakage current. Since the leakage power is not affected bythe clock rate of processors and the workload applied to thesystem, it is usually not taken into account by most earlyresearch on power-aware computing. The dynamic powerconsumption is dominant in our case and is given by Eq.1:

Pdynamic = Ceff f V2dd (1)

where Ceff is the effective switching capacitance, f theoperating frequency and Vdd the supply voltage [32]. Theimportance of DVFS technique is apparent from Eq.1, sincelowering the CPU clock frequency and the supply voltagewill result in lower system energy consumption due to thequadratic relation between the CPU power consumption andvoltage.

The total amount of energy required to execute all tasks iscalculated by examining the time that each processor spendson each P-state, hence the total energy is given by Eq.2:

E = P ∗ t (2)

The power consumption of the processors is attributed basedon their model and the power consumption specification pro-vided by their manufacturers [33–35]. Power consumption at

Table 2 P-states and power consumption

Frequency(GHz)

P-state (W)AMD opteron

P-state (W) intelpentium M

P-state (W)VIA C7-M

2.6 P0-95 – –

2.4 P1-90 – –

2.2 P2-76 – –

2.0 P3-65 – P0-20

1.8 P4-55 – P1-18

1.6 – P0-25 P2-15

1.4 – P1-17 P3-13

1.2 – P2-13 –

1.0 P5-32 P3-10 P4-10

0.8 – P4-8 P5-7

0.6 – P5-6 P6-6

0.4 – – P7-5

Idle 15 5 0.1

each P-state for every processor used in the simulation exper-iments is depicted in Table 2.

The maximum number of processors that can be in sleepstate with DPM depends on the cluster’s size and in ourcase the maximum number for the cluster is four processors

123

622 Cluster Comput (2016) 19:615–631

since we observed through simulation experiments that thisnumber achieves optimal results. Processors are put to sleepfor a fixed amount of time and wake-up transition overheadis taken into account. Since the central resource scheduleris aware of the system’s load, when a processor becomesidle, based on the system’s status, it deactivates or not theidle processor. As shown in [36], the timing overhead fromsleep to wake-up and vice versa is usually a very small valuefor contemporary microprocessors if properly designed forpower awareness and can be safely ignored in simulationframeworks, although the energy overhead must be takeninto account and in our case is 483 µJ, as considered in [37].

The cluster is connected through a high-speed local net-work. A perfect network with zero latency is considered andthere is no communication among tasks. The same assump-tion is also made in other studies [16,25]. Furthermore, theoverhead of the frequency transition in processors is notconsidered since it takes a negligible amount [17]. In gen-eral, the power supplied to a server is utilized by its variouscomponents, including CPU, memory, hard disks, networkinterface cards and other devices. In [38], authors came tothe conclusion that the main contributor to the power con-sumption of a server is the processor (37 %). Additionally,in [39] authors concluded that in most cluster systems, CPUscan consume 35–50 % of a cluster node’s total power. Mostworks on energy-efficient scheduling assume CPU-intensivetasks. Since DVFS and DPM are power-saving techniquesmainly used at the processor level, other devices are not con-sidered in this study. This applies also to a large number ofstudies [11,40]. Due to the fact that processors are majorcontributors to the system’s energy consumption and thatCPU-intensive BoTs and high-priority tasks are submitted tothe system, the focus of this study is on processors’ energyconsumption.

The simulation model was created in C++ programminglanguage and the discrete event simulation technique wasused to evaluate the proposed scheduling policies. Accordingto this technique, the operation of the system is described asa discrete sequence of events in time. Every event occurs at aparticular instant in time and changes the state of the system.It must be noted that between consecutive events there is nochange in the system’s status and the simulation clock jumpsforward in time, each time an event occurs. There are twotypes of events that can occur in the system: task arrivalsand task departures. A task arrival happens when a taskarrives at the central resource scheduler, while a task depar-ture occurs when a task is successfully executed and departsfrom the system. It must be noted that much of the workdone on energy efficiency over the years has been done withcustom simulators or by extending simulators that supportscheduling but not energy efficiency. Repeated simulationexperiments with different seeds of random numbers, ensure95 % confidence level for simulation results. The benefit of

Table 3 Processors’ service rate

P-state AMD Opteron Intel Pentium M VIA C7-M

P0 1 0.615 0.769

P1 0.923 0.538 0.692

P2 0.846 0.461 0.615

P3 0,769 0.384 0.538

P4 0.692 0.307 0.384

P5 0.384 0.230 0.304

P6 – – 0.230

P7 – – 0.153

simulation technique is that it’s possible to run experimentsand combine various scenarios in very short time andwithoutcost.

The workload submitted consists of BoTs and high-priority tasks. High-priority tasks can be either real-time orwithout time-constraints. Regarding their size, BoT applica-tions can be small with size uniformly distributed (U(2, 4))with mean value 3, or large with size uniformly distributed(U(5, 9)) with mean value 7. Each task in a BoT has a servicetime uniformly distributed (U(0.1, 0.3)) with mean value 0.2.High-priority tasks have service time uniformly distributed(U(0.5, 1.5)) with mean value 1. The Pareto distribution isused for modeling the inter-arrival times of BoTs and theexponential distribution for high-priority tasks. The arrivalrate for small, largeBoTs and high-priority tasks is λ1,λ2 andλ3 respectively. Since cluster’s processors are heterogeneousregarding performance, we consider that the service rate forprocessors with the highest performance (AMD Opteron) atthe highest P-state P0 is 1. For other states and other proces-sors the service rate is calculated respectively depending ontheir clock speed, as it is depicted in Table 3. The servicerate of each processor is determined based on the valuesprovided by manufacturers presented in Table 2. Servicerate depends on the speed of the processor that executes atask. As an example, a task that has service time 0.2 and isexecuted by a processor at performance state P3 with ser-vice rate 0.5 will finish its execution in 0.4 simulation timeunits. This is equivalent to other studies that use millionsinstructions per second (MIPS) units to model the servicerate of a processor and millions instructions (MI) to modelthe task’s length. In this case, a task with length 500 MIwill be executed in 0.5 s by a processor with 1000 MIPSspeed.

Themaximum service rate of the system is achieved whenall computing nodes are at the higher P-state, P0. The systemis evaluated for low (30 %) and high (60 %) load. Simulationparameters are summarized in Table 4. In order for our resultsto be more accurate, extensive simulation experiments wereconducted and each result presented in this study is derivedfrom 100 simulation experiments with different seeds of ran-

123

Cluster Comput (2016) 19:615–631 623

Table 4 Simulation parametersSmall BoTs: λ1

Task arrival rate Large BoTs: λ2High-priority tasks: λ3Small BoTs: U(2, 4)

Task size Large BoTs: U(5, 9)

High-priority tasks: 1

Small BoTs: BoT size * U(0.1, 0.3)

Service demand Large BoTs: BoT size * U(0.1, 0.3)

High-priority tasks: U(0.5, 1.5))

Ratio between BoTs andhigh-priority tasks

60–40, 50–50, 40–60 %

60–40 % Low load {λ1=1.8, λ2=1.8, λ3=2.4}

50–50 % Low load {λ1=1.5, λ2=1.5, λ3=3}

Arrival rates evaluated 40–60 % Low load {λ1=1.2, λ2=1.2, λ3=3.6}

60-40 % High load {λ1=3.6, λ2=3.6, λ3=4.8}

50–50 % High load {λ1=3, λ2=3, λ3=6}

40–60 % High load {λ1=2.4, λ2=2.4, λ3=7.2}

High-priority tasks: JSQ

Scheduling policies BoTs: GrEMinMin, ConEMinMin, GrEMaxMin, ConEMaxMin

dom numbers. Each simulation experiment ends when a totalnumber of 6,400,000 BoTs and high-priority tasks are exe-cuted. The exact number of each type depends on the ratiobetween them. For example, a 60–40 % ratio translates into3,840,000 BoTs and 2,560,000 high-priority task executionsfor a simulation experiment.

6 Simulation results and analysis

In this section the simulation results are presented. Thescheduling policies proposed in this paper extend Min–Minand Max–Min policies. A comparison of the proposed poli-cies withMin–Min andMax–Min is presented. Furthermore,a scenario where Min–Min and Max–Min are applied withAD-DVFS is also included in the simulation results. In sum-mary, simulation results include:

• Proposed scheduling policies (GrEMaxMin, ConEMaxMin, GrEMinMin, ConEMinMin) with DVFS and DPM

• Min–Min and Max–Min without the application of anyenergy-saving mechanisms

• Min–Min andMax–Minwith the application ofAD-DVFS

This section is divided into subsections for better presen-tation of the simulation results. In Sect. 6.1 we present theenergy savings achieved by our policies while in Sect. 6.2,performance and its degradation (tasks’ time in the systemanddeadlinemiss ratio of real-time tasks) is described.DVFScauses execution time stretching of tasks while utilizationis rising and idle time of processors is shrinking. In addi-

tion, DPM takes advantage of the idle time and deactivatesthe processors and saves energy. Since DVFS and DPM arein general two competitive power-saving mechanisms, inSect. 6.3 the idle time of processors is presented.

6.1 Energy savings

For the system with high heterogeneity, compared to Min–Min, energy savings with GrEMinMin are up to 20.4 %in contrast to 12.6 % achieved by AD-DVFS (with Min–Min). As expected, GrEMinMin results in greater energysavings compared to ConEMinMin (18.7 %) due to thegreedy nature of the algorithm. Furthermore, compared toMax–Min, energy savings with GrEMaxMin are up to 20.8% in contrast to 12.9 % achieved by AD-DVFS (with Max–Min) and ConEMaxMin achieves 19.1 % savings. Energysavings are depicted in Fig. 2. When the load is low, energysavings are higher due to the fact that the arrival rate is lowerand processors are underutilized. In this case, processors canexecute tasks at lower speeds or they can be put to sleep.

The same conclusions can bemade about conservative andgreedy policies in the case where the heterogeneity of thesystem is low. Simulation results presented in Fig. 3 suggestthat lower heterogeneity achieves lower energy reduction.Energy savings with GrEMinMin are up to 19.3 % whilewith ConEMinMin, savings are up to 17.6 %. GrEMaxMinachieves 19.7 % energy reduction while ConEMaxMin 18%. A related study [41] also showed that energy savingswere higher when the heterogeneity level of the systemincreased. In our case, the system with higher heterogeneityis more “green” and results in lower energy consump-

123

624 Cluster Comput (2016) 19:615–631

Fig. 2 Energy savingscompared to a Min–Min andb Max–Min for highheterogeneity (Real-time tasks)

Fig. 3 Energy savingscompared to a Min–Min and bMax–Min for low heterogeneity(Real-time tasks)

tion, since some of the energy-consuming AMD Opteronprocessors are replaced with energy-efficient VIA C7-Mprocessors. For both architectures, policies based on Max–Min (GrEMaxMin, ConEMaxMin) achieve slightly higherenergy savings than those based on Min–Min (GrEMinMin,ConEMinMin).

Results show that proposed policies gain more energycompared to performance-oriented Min–Min and Max–Minand energy-oriented AD-DVFS. Energy gains are higherwhen the load is low (30 %). This is a result caused bythe DPM mechanism. When the load is low, the idle timeof processors is exploited to a large degree by DPM andprocessors are deactivated more often, so more energy issaved. In addition, when the load is high, although DVFSgains energy, DPM cannot exploit, as much as before, theidle time of processors since they are most of the time atbusy states.

Simulation results indicate that the ratio among BoTs andhigh-priority tasks affects energy savings. For example, a60–40 % ratio achieves lower energy savings than a 40–60% ratio. In the first case, more tasks are BoT applicationsthat have higher arrival rates and power-saving mechanisms

cannot save much energy. In the second case, most of thetasks are high-priority and they are executed at the highestspeed. BoTs’ arrival rate is lower and power-saving mecha-nisms can exploit large idle time periods. In the case of BoTssubmitted with high-priority tasks without time-constraintssimilar conclusions are obtained.Again, compared to our pre-vious work with AD-DVFS, energy savings are significantlyhigher. In summary, conservative policies outperform AD-DVFS, while higher system heterogeneity results in higherenergy savings.

6.2 Performance

While power-management techniques such as DVFS andDPMsave energy, executing tasks at lower speedswithDVFScauses further delay of tasks waiting in processors’ queues.Additionally, when processors are put into a sleep state withDPM, the overall service rate of the system is reduced and thismay cause severe performance degradation in times of highload. Performance degradation in the case of BoTs submittedwith high-priority real-time tasks can be seen from the dead-line miss ratio of real-time tasks and from the time that a BoT

123

Cluster Comput (2016) 19:615–631 625

spends in the system until it is successfully executed. Whileour four proposed scheduling policies outperform traditionalpolicies such as Min–Min and Max–Min and AD-DVFS interms of energy preservation, results show that in terms ofperformance the four policies achieve lower performancedegradation than AD-DVFS and comparable performance toMin–Min and Max–Min. It must be noted that energy gainsand performance loss depend on the nature of the energysaving mechanisms. For example, a DVFS mechanism thatexecutes tasks at very low speeds theoretically saves moreenergy and results in greater performance degradation. Inaddition, a more conservative DVFS mechanism that takesinto account the status and the load of the system and doesn’texecute tasks always at the lowest speed, can achieve lowerenergygainswith little or noneperformancedegradation.Thesame conclusion can be made for DPM. A DPMmechanism

that deactivates a processor as soon as it becomes idle wouldsave more energy although it would result in great perfor-mance loss due to its greedy nature. Our proposed schedulingpolicies aim to save energy with insignificant or zero perfor-mance loss.

For both variations of the system, similar results wereproduced due to the fact that the both variations have almostthe same processing capability and service rate. In the lowheterogeneity variation, the service rate is 12 × 1 + 12 ×0.615 = 19.38 while in the high heterogeneity variation theservice rate is 8×1+8×0.615+8×0.769 = 19.07. Resultspresented in this section depict from simulation experimentson the high heterogeneity variation of the system.

The deadline miss ratio for real-time tasks is depicted inTables 5 and 6, while the total time spent in the system forsmall and large BoTs is depicted in Tables7, 8, 9, and 10.We

Table 5 Deadline miss ratio(Min–Min)

Low load High load

60–40 (%) 50–50 (%) 40–60 (%) 60–40 (%) 50–50 (%) 40–60 (%)

Without DVFS-DPM 0.00 0.00 0.00 0.01 0.01 0.02

AD-DVFS 0.00 0.00 0.00 0.27 0.45 0.76

GrEMinMin 0.00 0.00 0.00 0.21 0.38 0.62

ConEMinMin 0.00 0.00 0.00 0.13 0.24 0.39

Table 6 Deadline miss ratio(Max–Min)

Low load High load

60–40 (%) 50–50 (%) 40–60 (%) 60–40 (%) 50–50 (%) 40–60

Without DVFS-DPM 0.00 0.00 0.00 0.01 0.01 0.02

AD-DVFS 0.00 0.00 0.00 0.31 0.51 0.90

GrEMaxMin 0.00 0.00 0.00 0.23 0.40 0.73

ConEMaxMin 0.00 0.00 0.00 0.15 0.27 0.41

Table 7 Small BoTs, time inthe system (Min–Min)

Low load High load

60–40 50–50 40–60 60–40 50–50 40–60

Without DVFS-DPM 0.26 0.26 0.26 0.27 0.28 0.28

AD-DVFS 0.46 0.47 0.48 0.56 0.62 0.70

GrEMinMin 0.39 0.41 0.42 0.47 0.51 0.59

ConEMinMin 0.33 0.34 0.35 0.38 0.39 0.40

Table 8 Large BoTs, time inthe system (Min–Min)

Low load High load

60–40 50–50 40–60 60–40 50–50 40–60

Without DVFS-DPM 0.31 0.31 0.31 0.35 0.35 0.35

AD-DVFS 0.78 0.78 0.80 0.88 0.91 0.96

GrEMinMin 0.59 0.60 0.61 0.72 0.77 0.79

ConEMinMin 0.43 0.44 0.44 0.48 0.49 0.51

123

626 Cluster Comput (2016) 19:615–631

Table 9 Small BoTs, time inthe system (Max–Min)

Low load High load

60–40 50–50 40–60 60–40 50–50 40–60

Without DVFS-DPM 0.25 0.25 0.26 0.26 0.26 0.27

AD-DVFS 0.46 0.47 0.48 0.55 0.61 0.69

GrEMaxMin 0.38 0.41 0.41 0.46 0.49 0.58

ConEMaxMin 0.33 0.33 0.34 0.37 0.38 0.38

Table 10 Large BoTs, time inthe system (Max–Min)

Low load High load

60–40 50–50 40–60 60–40 50–50 40–60

Without DVFS-DPM 0.27 0.27 0.28 0.28 0.28 0.29

AD-DVFS 0.61 0.61 0.63 0.78 0.85 0.91

GrEMaxMin 0.55 0.56 0.57 0.62 0.65 0.67

ConEMaxMin 0.38 0.38 0.39 0.45 0.47 0.48

Table 11 Time in the systemfor tasks withouttime-constraints (Min–Min)

Low load High load

60–40 50–50 40–60 60–40 50–50 40–60

Without DVFS-DPM 1.34 1.35 1.36 1.37 1.40 1.43

AD-DVFS 1.36 1.38 1.40 1.50 1.61 1.75

GrEMinMin 1.36 1.38 1.39 1.48 1.59 1.69

ConEMinMin 1.35 1.37 1.38 1.45 1.51 1.62

observe that greedy energy-saving approaches (GrEMinMin,GrEMaxMin) compared to more conservative approaches(ConEMinMin,ConEMaxMin) result in greater performancedegradation that is not justified by the extra energy-savingsthat are earned, especially for high load. When a BoT isexecuted according to a greedy policy, although energy canbe saved, tasks waiting in processors’ local queues delaymore. Since tasks arrive according to the Pareto distribution,arrivals are bursty and this makes performance degrada-tion even higher. Results show that large BoTs spend moretime in the system than small BoTs due to the fact thatit is more possible for a subtask to delay the entire BoT.Policies based on Max–Min (GrEMaxMin, ConEMaxMin)achieve slightly higher energy savings compared to thosebased on Min–Min (GrEMinMin, ConEMinMin), althoughresults indicate slightly higher deadline miss ratio for real-time tasks.

It is observed that performance loss is greater for highload since the energy saving mechanisms cause extra delayfor real-time tasks making them to miss their deadline. Incontrast, for low load, there is not any rise in the deadlinemiss ratio of real-time tasks since the increase in the delaytime is negligible. The ratio between BoTs and real-timetasks affects performance since a 40–60% ratio translatesinto more real-time tasks, thus higher arrival rate of real-

time tasks. Higher arrival rate yields higher deadline missratio for high-priority real-time tasks.

In the case where high-priority tasks submitted are with-out time-constraints, the time spent by these tasks in thesystem is a good indicator of the performance and through-put of the system. Both conservative policies (ConEMinMin,ConEMaxMin) result in significant lower times than AD-DVFS and achieve comparable performance toMin–Min andMax–Min. Furthermore, greedy approaches (GrEMinMin,GrEMaxMin), as a result of their greedy nature regardingenergy consumption, outperform AD-DVFS in terms of per-formance, although their performance is significantly lowerthan conservative policies. Time spent in the system forhigh-priority tasks without time-constraints is presented inTables11 and 12.

6.3 Utilization: idle time

In large-scale systems there are periods of high utilization dueto bursty arrivals and there are periods where the system isunderutilized. During underutilized periods, by using DPM,unneeded computing nodes can be deactivated since theirprocessing power is not needed. In general, DPM exploitsthe idle time of processors and turns it into sleep time. On thecontrary, DVFS stretches down the execution time of tasks

123

Cluster Comput (2016) 19:615–631 627

Table 12 Time in the systemfor tasks withouttime-constraints (Max–Min)

Low load High load

60–40 50–50 40–60 60–40 50–50 40–60

Without DVFS-DPM 1.34 1.36 1.37 1.38 1.40 1.43

AD-DVFS 1.36 1.38 1.40 1.52 1.63 1.77

GrEMaxMin 1.36 1.38 1.39 1.49 1.60 1.71

ConEMaxMin 1.35 1.37 1.38 1.45 1.53 1.63

Fig. 4 Idle time reductioncompared to a Min–Min andb Max–Min for highheterogeneity

and minimizes idle time. In this section the processors’ idletime is studied. Since AD-DVFS does not incorporate DPM,it would be interesting to see the effects of our proposedpolicies to processors’ idle time.

For both variations of the system, a comparison of idletime for all scenarios shows that DPM effectively takesadvantage of the idle time and saves energy. Since both vari-ations of the system have similar processing capability, idletime will be similar in both cases. In Fig. 4, idle time for highheterogeneity is depicted. It is apparent that the reduction inthe idle time due to DPM is higher when the load is low. Thefact that there are bursty arrivals in the system, translates intolong idle periods that DPM can exploit.

Furthermore, conservative policies result in higher idletimes than greedy policies, since tasks are executed at higherspeeds and finish execution at earlier times. Additionally, forhigh load, policies based onMax–Min result in slightly loweridle times than those based onMin–Min. Similar conclusionscan be made from simulation results obtained in the casewhere BoTs and high-priority tasks without time-constraintsare submitted.

7 Experiments on a real hardware platform

In order to validate the simulation results, the proposedpolicies were evaluated on a real system. The system is het-

erogeneous and consists of 16 laptops and a server. Thereare 8 laptops equipped with Intel Core i3-4000m processorsand the other 8 lapto ps are equipped with AMD Athlon IIM300 processors. All laptops are equipped with 4 Gb RAM.Both processors support DVFS and can be put to sleep, whileIntel Core i3-4000m supports Hyper-Threading. The serveris equipped with an Intel Core i7-4790K processor. The IntelCore i3-4000m is a dual core processor with 4 threads andsupports 16 P-states from 800 MHz to 2.40 GHz while theAMDAthlon II M300 is a dual core processor with 2 threadsand 3 P-states from 800MHz to 2.00 GHz. Computing nodesare connected through a 100 Mbps local area network andcommunication overhead among the clients and the server isless than 1 ms.

A client-server application was written in Java for evalu-ation purposes. In this application, the server generates tasksthat need to be executed by the computing nodes and appliesthe proposed scheduling policies in order to determine thecomputing nodes that will execute each task. Tasks are con-sidered to be CPU-intensive and for this reason the routine offinding the first prime numbers is used. In order to determinethe service rate of the processors in each P-state, various testswere conducted. As a result, in Table13, the computing timeneeded by each processor to find the first 20,000, 30,000 and40,000 primes at each P-state, is depicted.

In order to match the size of the tasks used in the simula-tion experiments, the number of primes calculated, varies.

123

628 Cluster Comput (2016) 19:615–631

Table 13 Time (ms) for thecalculation of prime numbers

Clock frequency (GHz) Intel Core i3-4000m AMD Athlon II M300

20,000 30,000 40,000 20,000 30,000 40,000

2.40 120 218 335 – – –

2.30 123 225 350 – – –

2.20 129 236 367 – – –

2.10 135 247 381 – – –

2.00 142 260 400 199 375 584

1.90 148 273 421 – – –

1.80 155 288 444 – – –

1.70 164 306 470 – – –

1.50 188 347 537 – – –

1.40 202 371 573 285 535 837

1.30 218 400 617 – – –

1.20 234 433 668 – – –

1.10 255 471 728 – – –

1.00 281 519 799 – – –

0.90 310 576 893 – – –

0.80 350 649 1000 499 935 1460

Fig. 5 Energy savingscompared to a Min–Min and bMax–Min

Furthermore, tasks’ execution time is significantly highercompared to the network latency so that latency would notaffect the results produced. All computing nodes of thesystem are running Ubuntu 14.04 LTS. In Ubuntu, CPU fre-quency scaling is implemented in the kernel and governorsare power schemes for the CPU. In our experiments, theuserspace governor is used, after enabling the acpi-cpufreqdriver, in order to run the CPU at the specified frequenciesimposed by the server. Several services running in the back-groundwere disabled in order to limit the effect of other tasksrunning during the evaluation. To measure power consump-tion, based on the time that processors spend at each P-state,PowerTOP [42], a Linux tool released by Intel to diagnoseissues with power consumption and power management wasused. PowerTOP can provide details about CPU power con-sumption and the time spent on CPU sleep and P-states.

Both BoTs and real-time tasks with deadlines were sub-mitted to the system. The system was evaluated for low(30%) and high (60%) load as in the simulation experi-ments. Results show that energy savings of the proposedpolicies are comparable to results produced by simulationfor the low heterogeneity system since the real systemconsist of two types of processors. Despite the fact thatnetwork latency and processors’ transition time during P-state changes were not taken into account in the simula-tion settings, real experiments show that proposed policiesachieve slightly higher energy savings. This is probablydue to the fact that Intel Core i3-4000m processors sup-port 16 P-states, in contrast to other processors used in thesimulation experiments, and the overhead of tasks beingexecuted at a higher speed than needed, is reduced. Asdepicted in Fig. 5, energy savings with GrEMinMin are up

123

Cluster Comput (2016) 19:615–631 629

Table 14 Deadline miss ratio(Min–Min)

Low load High load

60–40 (%) 50–50 (%) 40–60 (%) 60–40 (%) 50–50 (%) 40–60 (%)

Without DVFS-DPM 0.00 0.00 0.00 0.01 0.01 0.01

AD-DVFS 0.00 0.00 0.00 0.24 0.41 0.69

GrEMinMin 0.00 0.00 0.00 0.19 0.35 0.58

ConEMinMin 0.00 0.00 0.00 0.12 0.22 0.35

Table 15 Deadline miss ratio(Max–Min)

Low load High load

60–40 (%) 50–50 (%) 40–60 (%) 60–40 (%) 50–50 (%) 40–60 (%)

Without DVFS-DPM 0.00 0.00 0.00 0.01 0.01 0.01

AD-DVFS 0.00 0.00 0.00 0.28 0.47 0.82

GrEMaxMin 0.00 0.00 0.00 0.21 0.37 0.68

ConEMaxMin 0.00 0.00 0.00 0.14 0.24 0.35

to 20.1 % while with ConEMinMin, savings are up to 18.5%. GrEMaxMin achieves 20.6 % energy reduction whileConEMaxMin 18.9 %.

Regarding performance, experiments showed that perfor-mance is kept at satisfactory levels and slightly improvedcompared to simulation results. In Tables14 and 15, dead-line miss ratio for real-time tasks is depicted. In conclusion,experimentation on a real system showed that proposed poli-cies perform even better than in simulation experiments, dueto the different architecture of the real system.

8 Summary and conclusions

In this paper we study energy-efficient scheduling for BoTsin a heterogeneous cluster environment. We propose poli-cies that are based on and extend Min–Min and Max–Min.We show that these traditional policies that have been usedseverely in the literature can be extended in order to includeenergy-efficiency criteria. Theworkload submitted to the sys-tem consists of BoTs and high-priority tasks. BoTs arrive tothe system according to the Pareto distribution since stud-ies show that in large-scale systems this distribution is bestsuited. The system was evaluated for low and high load andthe ratio among BoTs and high-priority tasks varied from40–60 to 60–40%.

A series of simulation experiments were conducted alongwith experiments on a real system and useful conclusionswere produced. Two types of policies are proposed in thisstudy: conservative and greedy. Greedy policies (GrEM-inMin, GrEMaxMin) achieved greater energy savings upto 20.6 % in real experiments due to their greedy naturealthough performance was degraded. Nevertheless, com-pared to AD-DVFS, greedy policies behaved better even in

high load. On the other hand, conservative policies (ConE-MinMin, ConEMaxMin) achieved comparable performanceto Min–Min and Max–Min, great energy savings up to 18.9% in real experiments and outperformed in any scenario AD-DVFS. There is always a trade-off between performance andenergy when energy-saving mechanisms are used. In ourcase, the goal is to execute BoTs more efficiently withoutaffecting much the execution of high-priority tasks. Greedypolicies tend to execute BoTs at lower speeds. As a result, forhigh load the average time in the system of a BoTwas almost50 %more than in the case where conservative policies wereused. This led to an increase of 50 % of the deadline missratio for real-time tasks compared to conservative policies.Since greedy versions of our algorithms tend to save moreenergy than conservative versions but achieve significantlylower performance, we believe that it is preferable to applyconservative versions than greedy since they save energywithout sacrificing much performance. Thus, a conservativescheduling approach like ConEMinMin or ConEMaxMin isbest suited since it balances performance with energy effi-ciency.

Regarding the DPMmechanism, since a heavy-tailed dis-tribution is used for BoT arrivals, there will be large idleperiods and small periods of heavy load in the system.These idle periods are exploited by the DPM mechanism.For low load, DPM proved to be very effective and signifi-cant amounts of energy were gained. For high load it is moredifficult to gain energy and the deactivation of a computingnode could lead to severe performance loss since the over-all service rate of the system is reduced. The dynamic DPMmechanism used in this study assesses if the system is underheavy load or not and decides whether or not the deactivationof a computing nodewould benefit system’s performance andenergy consumption.

123

630 Cluster Comput (2016) 19:615–631

Energy savings achieved by our proposed policies can betranslated into cost savings. To calculate cost savings, wetook into account conclusions from study [38] and the factthat for a high-performance server with 300 W, the annualpower consumption is around 2628 kWh and the annualpower cost of the server is around $263 [40]. Based on theabove, ConEMinMin and ConEMaxMin result in 13.1–18.9% energy reduction depending on the load and for the realsystem used in this study, this translates into $229–$331annual savings. On large-scale systemswithmore computingnodes, cost savings would be even higher. Besides the costbenefits, lower power consumption results in lower tempera-ture of computing systems which further increases system’sreliability and leads to lower cooling costs.

Finally, from experiments, it is apparent that the architec-ture of the system affects the performance and the energysavings produced. Additionally, higher heterogeneity of thesystem improves performance since tasks are executed inmore suitable resources. Beyond that, a large number ofprocessor’s P-states can also assist in executing the tasks atthe desired by the server speed.

In future work, we believe that it is feasible to combinethe two approaches (greedy and conservative). Some studiesindicate that users can select the performance penalty that isacceptable by them and then the system can decide whichpolicy and what energy saving mechanism will be selected.Thus, in some execution environments, one might be willingto tolerate the extra performance penalty induced by greedypolicies for larger energy savings. For example, by allowinga 5 % performance penalty, a gain of about 5 % more energysavings may be acceptable. In other words, if users accept ahigher performance penalty, then the system could savemoreenergyby applying agreedypolicy, especially for low load. Inconclusion, in order to maximize performance and minimizeenergy consumption the system may apply a conservative ora greedy policy depending on the acceptable performancepenalty by the system administrator or the end-users.

Due to the fact that processors are themain contributors ofenergy consumption in a computer system, in this paper westudied only processors’ energy consumption and focused onits minimization. Since there are several studies in the litera-ture that use bothDVFSandDPMin a systemwith processorsand other peripheral devices, we intend to include additionalperipheral devices in the system that have stand-by or off-modes and can also benefit from DPM. In order to do this, adifferent type of workloadmust be submitted to the system ora mixed workload that consists of tasks demanding not onlyCPUbut also other resources such asHDDandWLAN.Thus,our next work will include BoTs and other types of tasksthat are scheduled in a system with peripheral devices usingDVFS and DPM schemes. Additionally, mobile devices willbe added to the system with variations in performance andenergy consumption in order to increase heterogeneity of the

system and evaluate our proposed scheduling policies in suchan environment. Load balancing techniques are important ina heterogeneous environment in order to balance the loadamong system’s resources so that there are not any underuti-lized and overutilized resources at the same time. In orderto evenly redistribute the load among computing nodes, weintend to propose load balancing techniques based on theenergy efficiency of the computing nodes in a future work.

References

1. Cademartori, H.: Green computing beyond the data cen-ter. http://www.powersavesoftware.com/Download/PS_WP_GreenComputing_EN.pdf (2007). Accessed 10 Dec 2014

2. Google. Data center efficiency. http://www.google.com/about/datacenters/efficiency/. Accessed 10 Dec 2014

3. Rajamani, K., Lefurgy, C.: On evaluating request-distributiondchemes for saving energy in server clusters. In: Proceedings ofthe International Symposium on Performance Analysis of Systemsand Software (ISPASS), pp. 111–122 (2003)

4. Patterson, D.A., Hennessy, J.L.: Computer Architecture: A Quan-titative Approach, 3rd edn. Morgan Kaufmann Publishers, SanFancisco (2003)

5. Tran, M., Wolters, L.: Towards a profound analysis of Bags-of-Tasks in parallel systems and their performance impact. In:Proceedings of the 20th International Symposium on High Perfor-mance Distributed Computing (HPDC ’11), pp. 111–122 (2011)

6. ARM. Big.LITTLE Processing. http://www.arm.com/products/processors/technologies/biglittleprocessing.php Accessed 10 Dec2014

7. Bessis, N., Sotiriadis, S., Pop, F., Cristea, V.: Using a novelmessage-exchanging optimization (MEO) model to reduce energyconsumption in distributed systems. Simul. Model. Pract. Theory39, 104–120 (2013)

8. Castañé, G., Núñez, A., Llopis, P., Carretero, J.:E-mc2: A for-mal framework for energy modelling in cloud computing. Simul.Model. Pract. Theory 39, 56–75 (2013)

9. Du, Z., Fan, W., Chai, Y., Chen, Y.: Priori information and slid-ing window based prediction algorithm for energy-efficient storagesystems in cloud. Simul. Model. Pract. Theory 39, 3–19 (2013)

10. Quarati, A., Clematis, A., Galizia, A., D’Agostino, D.: Hybridclouds brokering: business opportunities, QoS and energy-savingissues. Simul. Model. Pract. Theory 39, 121–134 (2013)

11. Ma,Y.,Gong,B., Sugihara,R.,Gupta,R.: Energy-efficient deadlinescheduling for heterogeneous systems. J. Parallel Distrib. Comput.72(12), 1725–1740 (2012)

12. Terzopoulos, G., Karatza, H.D.: Dynamic voltage scaling schedul-ing on power-aware clusters under power constraints. In: Pro-ceedings of the 17th IEEE/ACM International Symposium onDistributed Simulation and Real TimeApplications (DS-RT 2013),pp. 72–78 (2013)

13. Terzopoulos, G., Karatza, H.D.: Energy efficient real-time hetero-geneous cluster scheduling with node replacement due to failures.J. Supercomput. 68(2), 867–889 (2014)

14. Rafique, M.M., Ravi, N., Cadambi, S., Butt, A.R., Chakradhar, S.:Power management for heterogeneous clusters: An experimentalstudy. In: Proceedings of Green Computing Conference andWork-shops (IGCC 2011), pp. 1–8 (2011)

15. Terzopoulos, G., Karatza, H.D.: Performance evaluation andenergy consumption of a real-time heterogeneous grid systemusingDVS and DPM. Simul. Model. Pract. Theory 36, 33–43 (2013)

123

http://www.powersavesoftware.com/Download/PS_WP_GreenComputing_EN.pdf

http://www.powersavesoftware.com/Download/PS_WP_GreenComputing_EN.pdf

http://www.google.com/about/datacenters/efficiency/

http://www.google.com/about/datacenters/efficiency/

http://www.arm.com/products/processors/technologies/biglittleprocessing.php

http://www.arm.com/products/processors/technologies/biglittleprocessing.php

Cluster Comput (2016) 19:615–631 631

16. Iosup, A., Sonmez, O., Anoep, S., Epema, D.: The performanceof Bags-of-Tasks in large-scale distributed systems. In: Proceed-ings of the 17th International Symposium on High PerformanceDistributed Computing (HPDC ‘08), pp. 97–108 (2008)

17. Weng, C., Lu, X.: Heuristic scheduling for Bag-of-Tasks applica-tions in combination with QoS in the computational grid. FutureGener. Comput. Syst. 21(2), 271–280 (2005)

18. Liu, C., Baskiyar, S.: A general distributed scalable grid schedulerfor independent tasks. J. Parallel Distrib. Comput. 69(3), 307–314(2009)

19. Nesmachnow, S., Dorronsoro, B., Pecero, J.E., Bouvry, P.: Energy-aware scheduling on multicore heterogeneous grid computingsystems. J. Grid Comput. 11(4), 653–680 (2013)

20. Al-Daouda, H., Al-Azzonib, I., Downa, D.G.: Power-aware linearprogramming based scheduling for heterogeneous computer clus-ters. Future Gener. Comput. Syst. 28(5), 745–754 (2012)

21. Laszewski, G., Wang, L., Younge A.J., He, X.: Power-awarescheduling of virtual machines in DVFS-enabled clusters. In: Pro-ceedings of IEEE International Conference on Cluster Computingand Workshops, pp. 1–10 (2009)

22. Kim, K.H., Buyya, R., Kim, J.: Power aware scheduling of Bag-of-Tasks applications with deadline constraints on DVS-enabledclusters. In: Proceedings of the 7th IEEE International Symposiumon Cluster Computing and the Grid (CCGrid), pp. 541–548 (2007)

23. Baskiyar, S., Abdel-Kader, R.: Energy aware DAG scheduling onheterogeneous systems. Cluster Comput. 13(4), 373–383 (2010)

24. Mei, J., Li, K., Li, K.: Energy-aware task scheduling in heteroge-neous computing environments. Cluster Comput. 17(2), 537–550(2014)

25. DaSilva, F.A.B., Senger,H.: Improving scalability ofBag-of-Tasksapplications running onmaster-slave platforms. J. Parallel Comput.35(2), 57–71 (2009)

26. Casanova, H., Gallet, M., Vivien, F.: Non-clairvoyant schedulingof multiple Bag-of-tasks applications. In: Proceedings of the 16thInternational Euro-Par Conference, pp. 168–179 (2010)

27. Da Silva, F.A.B., Carvalho, S., Hruschka, E. R.: A scheduling algo-rithm for running Bag-of-Tasks data mining applications on thegrid. In: Proceedings of the 10th International Euro-Par Confer-ence, pp. 254–262 (2004)

28. Oprescu, A., Kielmann, T.: Bag-of-Tasks scheduling under budgetconstraints. In: Proceedings of Cloud Computing Technology andScience (CloudCom), pp. 351–359 (2010)

29. Terzopoulos, G., Karatza, H.D.: Bag-of-Task scheduling on power-aware clusters using a DVFS-based mechanism. In: Proceedings ofthe 10th Workshop on High-Performance, Power-Aware Comput-ing (HPPAC 2014), in Conjunction with the 28th IEEE Interna-tional Parallel & Distributed Processing Symposium (2014)

30. Maheswaran, M., Ali, S., Siegel, H.J., Hensgen, D., Freund, R.F.:Dynamic mapping of a class of independent tasks onto hetero-geneous computing systems. J. Parallel Distrib. Comput. 59(2),107–131 (1999)

31. Weisstein, E.W.: Pareto distribution. http://mathworld.wolfram.com/ParetoDistribution.html. Accessed 16 May 2015

32. Weste, N.H.E., Eshraghian, K.: Principle of CMOS VLSI design.Addison Wesley, Boston (1993)

33. AMD. Power and cooling in the data center. http://www.amd.com/Documents/34146A_PC_WP_en.pdf. Accessed 10 Dec 2014

34. Intel. Enhanced intel speedstep technology for the intel PentiumMprocessor white paper. ftp://download.intel.com/design/network/papers/30117401.pdf. Accessed 10 Dec 2014

35. VIA. PowerSaver™ Technology. http://www.via.com.tw/en/initiatives/greencomputing/powersaver.jsp. Accessed 10Dec 2014

36. Zhu, Y., Mueller, F.: DVSleak: combining leakage reduction andvoltage scaling in feedback EDF scheduling. In: Proceedings of the2007 ACM SIGPLAN/SIGBED conference on Languages, com-

pilers, and tools for embedded systems (LCTES ’07), pp. 31–40(2007)

37. Chen, J.J., Kuo, T.W.: Procrastination determination for periodicreal-time tasks in leakage-aware dynamic voltage scaling systems.In: Proceedings of the 2007 IEEE/ACM International Conferenceon Computer-Aided Design (ICCAD ’07), pp. 289–294 (2007)

38. Fan, X., Weber, W.D., Barroso, L.A.: Power provisioning for awarehouse-sized computer. In: Proceedings of the 34th AnnualInternational Symposium on Computer Architecture, pp. 13–23(2007)

39. Valentini, G.L., Lassonde,W., Khan, S.U., Min-Allah, N., Madani,S.A., Li, J., Zhang, L., et al.: An overview of energy efficiencytechniques in cluster computing systems. Cluster Comput. 16(1),3–15 (2013)

40. Chen, J.J., Huang, K., Thiele, L.: Power management schemes forheterogeneous clusters under quality of service requirements. In:Proceedings of the 2011 ACM Symposium on Applied Computing(SAC ‘11), pp. 546–553 (2011)

41. Zhu, X., He, C., Li, K., Qin, X.: Adaptive energy-efficient schedul-ing for real-time tasks on DVS-enabled heterogeneous clusters. J.Parallel Distrib. Comput. 72(6), 751–763 (2012)

42. PowerTOP. https://01.org/powertop Accessed 16 May 2015

George Terzopoulos is a Ph.D.student at the Department ofInformatics of the Aristotle Uni-versity of Thessaloniki, Greece.His research interests includesimulation, performance evalua-tion and energy consumption oflarge scale distributed systems.

Helen D. Karatza is a Profes-sor at the Department of Infor-matics of the Aristotle Univer-sity of Thessaloniki, Greece. Herresearch interests include perfor-mance evaluation of parallel anddistributed systems, schedulingand simulation.

123

http://mathworld.wolfram.com/ParetoDistribution.html

http://mathworld.wolfram.com/ParetoDistribution.html

http://www.amd.com/Documents/34146A_PC_WP_en.pdf

http://www.amd.com/Documents/34146A_PC_WP_en.pdf

ftp://download.intel.com/design/network/papers/30117401.pdf

ftp://download.intel.com/design/network/papers/30117401.pdf

http://www.via.com.tw/en/initiatives/greencomputing/powersaver.jsp

http://www.via.com.tw/en/initiatives/greencomputing/powersaver.jsp

https://01.org/powertop