robust performance-based resource provisioning …ktarplee/pdf/tcc_capacity.pdfrobust...

1

Robust Performance-Based ResourceProvisioning Using a Steady-State Model for

Multi-Objective Stochastic ProgrammingKyle M. Tarplee Senior Member, IEEE , Anthony A. Maciejewski Fellow, IEEE , and

Howard Jay Siegel Fellow, IEEE

Abstract—Cloud computing has enabled entirely new business models for high-performance computing. Having a dedicated localhigh-performance computer is still an option for some, but more are turning to cloud computing resources to fulfill theirhigh-performance computing needs. With cloud computing it is possible to tailor your computing infrastructure to perform best for yourparticular type of workload by selecting the correct number of machines of each type. This paper presents an efficient algorithm to findthe best set of computing resources to allocate to the workload. This research is applicable to users provisioning cloud computingresources and to data center owners making purchasing decisions about physical hardware. Studies have shown that cloud computingmachines have measurable variability in their performance. Some of the causes of performance variability include small changes inarchitecture, location within the datacenter, and neighboring applications consuming shared network resources. The proposedalgorithm models the uncertainty in the computing resources and the variability in the tasks in a many-task computing environment tofind a robust number of machines of each type necessary to process the workload. In addition, reward rate, cost, failure rate, andpower consumption can be optimized, as desired, to compute Pareto fronts.

Index Terms—planning; stochastic programming; vector optimization; high performance computing; scheduling; resourcemanagement; bag-of-tasks; many-task computing; energy-aware; heterogeneous computing; linear programming

F

1 INTRODUCTION

SOME high-performance computing (HPC) users are turn-ing to cloud providers to complete their work due to

the potential cost effectiveness and/or ease of use of cloudcomputing. The ability to provision hardware on-demandfrom a pre-defined set of different machine types, known asinstance types, is very powerful. In fact, a proof of conceptcluster was built by Amazon Web Services from their highperformance instance types composed of over 26,000 coreswith nodes connected via 10G Ethernet. This cluster ranked101 on the Top 500 list for November 2014 [1]. The cloud hasbeen successfully employed to process HPC jobs for actualscientific experiments [2]. Recent studies have shown thatthe performance of small and medium virtual clusters cancompete with physical hardware [3]. Cloud infrastructureas a service (IaaS) providers [4] charge for the amount oftime a virtual machine, known as an instance, is allocated(idle or active). This means that it is advantageous to ter-minate some or all instances once the workload has beenprocessed. Leaving instances idle in the cloud is usuallynot cost effective. Once a new set of work needs to beprocessed, the decision of what instance types to start can bereevaluated each time, considering the size and composition

• K. M. Tarplee is with the Department of Physical Sciences and Engi-neering at Anderson University and the Department of Electrical andComputer Engineering at Colorado State University.

• A. A. Maciejewski is with the Department of Electrical and ComputerEngineering at Colorado State University.

• H. J. Siegel is with the Department of Electrical and Computer En-gineering and the Department of Computer Science at Colorado StateUniversity.

of the workload and the current prices of the availableinstance types. Selecting the ideal number of instances ofeach instance type a user needs is challenging.

The approach to provisioning computational resourcesgiven in this paper not only applies to cloud resource provi-sioning but also to selecting physical machines to purchasefor use within HPC systems. The goal for provisioning HPCsystems is to determine how to originally select or upgradea system in such a way that maximizes the performanceof the resultant system while meeting specific requirementsthat often include a budget constraint.

The instance types available in the cloud have widelyvarying capabilities, by design, so that users can choosethe resources that best match their workload and in doingso minimize the cost. For example, there is no need toprovision high memory instance types if the workload doesnot require large amounts of memory. The cost for thesmaller memory instance type will often be significantly lessand provide nearly identical performance assuming all elseis equal. Within a single IaaS provider, instance types varyin the amount of memory, number and type of CPUs, diskcapacity and performance, and network performance. All ofthese properties of instance types affects the performanceof the workload executing on the instances [5]. Due to theavailability of heterogeneous resources, IaaS is inherently aheterogeneous computing (HC) system.

This paper focuses on bag-of-tasks or many-task com-puting (MTC) workloads composed of a large number ofmany independent tasks. Each task is processed on a singlemachine. Bag-of-tasks workloads are commonly run onMTC systems [6].

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TCC.2016.2608345

Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

In MTC and high-throughput computing (HTC), theusual goal is to maximize the number of completed tasks orjobs per unit time. In this research, a steady-state model ofMTC is presented and used to formulate a linear optimiza-tion problem that can be used to optimize the number oftasks completed per unit time. In this work, types of tasksare assigned different rewards for completing. The rewardrate is the reward earned per unit time by the system. Insome situations, maximizing solely the reward rate is notdesirable. Sometimes a conflicting objective such as the up-grade cost should be optimized along with the reward rate.The optimization problem we pose has multiple objectivesfrom which any subset can be chosen, or new objectivesadded, as needed. The four objectives in the multi-objectiveoptimization problem are (1) the amount of reward earnedper unit time, called the reward rate, (2) the net cost toupgrade the system, called the upgrade cost, (3) the numberof machine failures per unit time, called the failure rate, and(4) the total machine power consumption.

When making a purchasing decision not all the informa-tion is necessarily available nor is the information perfectlyaccurate. For example, arrival rates of tasks or the perfor-mance of the machines might not be known perfectly. Infact, studies have shown that instances of the same instancetype can vary significantly in performance as discussed inSection 2. Often only the distributional assumption can bemade. That is, the probability distribution of key parametersis known but the actual value of the parameter is unknown.This uncertainty is incorporated in our steady-state model.A multi-objective stochastic programming problem formu-lation is presented that incorporates the uncertainty in theparameters. Stochastic programming techniques are appliedto this provisioning problem to handle uncertainty.

In summary, the contributions of this paper are:

1) the formulation of an energy-aware steady-statemodel for MTC,

2) the design of a linear optimization problem forresource provisioning (in the cloud or physical hard-ware procurement),

3) a model of uncertainty and a procedure for fittingthis model,

4) a stochastic programming formulation that com-bines the steady-state model and the uncertaintymodel,

5) an orthogonal weighted sum algorithm for generat-ing Pareto fronts that illustrates potential trade-offsbetween conflicting objectives, and

6) a performance evaluation of the stochastic program-ming formulation.

The rest of this paper is organized as follows. First somerelated work is given in Section 2. The steady-state model ofMTC is in Section 3. Section 4 presents the model of uncer-tainty and Section 5 details the stochastic programming for-mulation. Quantitative measures for comparing the stochas-tic programming solution to other approaches is given inSection 6. Section 7 describes the heuristics implemented forcomparison purposes. A technique to solve multi-objectivestochastic programs is in Section 8. In Section 9, the simu-lation results are presented. Section 10 concludes this studyand presents some ideas for future work.

2 RELATED WORK

Scheduling resources on the cloud is not new. An interestingtechnique was presented in [7] that uses ordinal optimiza-tion to approximately solve a multi-objective optimizationproblem. The approach schedules tasks to virtual clusters inthe cloud. Scheduling preemptable tasks on cloud resourceshas been studied recently as well [8]. Their approach usesinformation from the actual execution times to improve thesubsequent resource allocations. The HARNESS project iscurrently designing algorithms and implementing softwareto provision resources on the cloud that benefit from highlyheterogenous resources such as hardware accelerators andSSDs [9]. The uncertainty-aware scheduling approach pre-sented here could one day be incorporated into such toolsto improve the provisioning of resources.

A case study of using the cloud for HPC applications isin [10]. They show that the performance degradation dueto virtualization is low but the networking performance canbecome a performance bottleneck if one is not careful.

The issue of HPC cluster reliability is addressed in [11].This paper forms a bi-objective optimization problem toschedule tasks onto the cluster that has machines that varyin reliability. The author tries to minimize the maximumof all task completion times, known as the makespan, andmaximize the reliability. The reliability measure describedin Section 3 is similar to this measure.

We consider, in part, minimizing power consumptiondue to the explosive growth of power consumption in datacenters and HPC systems over recent years [12]. Energyusage is becoming a major operating cost that requiresalgorithms to consider this from the start.

A technique to automatically scale the number of re-sources up or down based on the dynamic arrival of work-loads is presented in [13]. Their technique utilizes the inheritheterogeneity in the cloud offerings to select instance types.In a closely related paper [14], automatic scaling is ad-dressed. The paper models the uncertainty in the executiontimes of the tasks but does not consider the heterogeneousaspects of IaaS.

Task scheduling is NP-hard so reasonable approxima-tions to model the problem, and scalable algorithms forits solution, are sought [15]. To design a scalable algo-rithm we use a steady-state scheduling algorithm withinthe resource provisioning problem. A motivation for us-ing steady-state scheduling is given in [16]. Our steady-state model is inspired by the Linear Programming AffinityScheduling (LPAS) algorithm [17].

The performance variation within the cloud, from in-stance to instance, is high compared to traditional hardwareas studied in [18]–[20]. In [18], the measured coefficient ofvariation (CoV) was up to 24 % for Amazon Web Service’sEC2 instances. They also showed that the distribution ofinstance performance can be bi-modal. Co-location of in-stances on the same physical hardware can cause variationin performance of up to 2.5 times [20]. Another surprisingfinding is that IO contention between VMs can cause uptoa factor of 5 in performance degradation [19]. In [21], someuncertainty in the performance of an instance was taken intoaccount when determining both a set of instance types andtask schedule via a particle swarm optimization problem.



3

Due to this inherit uncertainty in cloud instance perfor-mance and task arrivals, the stochastic nature of the problemcannot be ignored. Stochastic programming is the approachwe take to rigorously account for the stochastic nature ofthe problem [22]. Stochastic programming has been usedfor capacity expansion in telecommunications systems [23]which is a similar application to computational resourceprovisioning.

3 STEADY-STATE MODEL

Often there are millions of tasks and thousands of machinesin large scale scheduling problems. To build scheduling andplanning algorithms that have manageable run times forlarge problems, we use a steady-state formulation of theproblem. This formulation focuses on the behavior of thesystem on average and avoids considering the schedulingof each task onto a particular machine. The steady-statemodel allows our algorithm’s computational complexity tobe independent of the number of tasks and machines inthe problem. To build a compact steady-state model, weassume that the tasks of the workload can be grouped into arelatively small number of task types. All tasks belonging toa task type have similar run time and power consumptionproperties. This is often the case in scientific workloads. Forinstance, Monte Carlo simulations consist of a large numberof tasks in the workload that can usually all be considered asingle task type. Machines (or instances in the cloud) have asimilar natural grouping called machine types (or instancetypes in the cloud). Task types and machine types will beused to reduce the computational complexity and enable asteady-state formulation of the problem.

The following steady-state model can be used for ei-ther determining which instance types to launch for cloudresource provisioning or determining which physical ma-chines to purchase. Let there be T task types andM machinetypes. Let ETC ij be the estimated time to compute (ETC)a task of type i running on a machine of type j. Likewiselet APC ij be the average dynamic power consumption of atask of type i running on a machine of type j. The staticpower consumption of a machine of type j is given byAPC ∅j . The ETC and APC matrices are commonly usedin scheduling applications for HC systems. These matricesare often found by benchmarking the tasks. To model ma-chines being turned off when not in use let APC ∅j = 0.

This steady-state model can be used for either cloudprovisioning or physical hardware purchasing by correctlydefining the cost of the machines. Let βB

j be the buying priceor cost of a machine of type j. For the physical hardwarepurchasing problem, the cost is the total cost of ownership ofthe hardware (likely on a depreciation schedule) includingsupport and maintenance. The cost for a cloud instance is itscost per unit time that a user pays for running an instanceof type j. Let βS

j be the selling price of a machine of typej. The selling price is only applicable to the purchasing ofphysical hardware. Typically βB

j > βSj ≥ 0.

Cloud IaaS providers often limit the number of instancesa user can have without prior approval. Let M cur

j , Mminj ,

and Mmaxj be the current, minimum, and maximum number

of machines of type j, respectively. Let Mmin and Mmax bethe overall minimum and maximum number of machines,

respectively. These parameters can be used to require thesolution to adhere to those type of restrictions. When pur-chasing physical hardware these parameters may map torestrictions on the number of rack units available. Let MB

j

and MSj be the number of machines of type j to buy and

sell, respectively. The total number of machines of type j isthen Mj = M cur

j +MBj −MS

j . Due to the buying price beinghigher than the selling price, it makes no sense to buy andsell the same machine type.

Each task that is completed earns a reward based on thetask type. Let ri be the reward earned for completing a taskof type i. The number of tasks of type i arriving per unittime is given by λi.

To compute the reward rate, failure rate, and powerconsumption, one must have a schedule that maps tasksto machines. In the steady state, it is sufficient to know thefraction of time a task of type i is running on a machineof type j. Let pij be this fraction of time. The matrix p isreferred to as the schedule as it denotes the fraction of timeeach task type will be running on each machine type.

Sometimes it is useful to control the system failure rate.Machine failures when not executing a task are ignored asthey have little consequence. Let νj be the failure rate ofa machine of type j then the overall system failure rate is∑

i

∑j νjMjpij .

The optimization problem to determine the number ofmachines to buy and sell (i.e., MB and MS) and the resul-tant schedule (i.e., p) when maximizing the reward rate isgiven by:

maximizeMB,MS,p

∑i

ri∑j

1

ETC ijMjpij (1a)

subject to:

∀j MBj ≥ 0, MS

j ≥ 0 (1b)

∀i∑j

1

ETC ijMjpij ≤ λi (1c)

∀j∑i

pij ≤ 1 (1d)

∀i, j pij ≥ 0 (1e)

The optimization problem in (1) will maximize the thereward earned per unit time, namely the reward rate. Thenumber of machines to buy and sell is required to benonnegative, (1b). The arrival rate constraint is given by(1c). The machine utilization constraint, (1d), ensures thatmachines work no more then 100 % of the time on process-ing tasks. The last constraint, (1e), ensures that the fractionof time a machine is processing tasks is nonnegative.

The optimization problem in (1) has a non-linear objec-tive (1a) and a non-linear constraint (1c) due to the termsthat containMjpij . BothMj and pij are decision variables inthe optimization problem. Solving non-linear optimizationproblems is considerably more computationally expensivethan linear optimization problems. Fortunately, this non-linear problem can be transformed into an equivalent linearoptimization problem. One can replace Mjpij with a singlevariable, pij . The variable pij can be interpreted as the



4

effective number of machines of type j that are runningtasks of type i. The constraint, (1d), can be rewritten as

∀j∑i

pij ≤Mj (2)

because Mj ≥ 0 and pij ≥ 0.After adding the objectives (e.g., cost, failure rate, and

power) and constraints, and converting the non-linear prob-lem in (1) to a linear optimization problem, the steady-statemodel based optimization problem becomes

minimizeMB,MS, p

−∑

i ri∑

j1

ETC ijpij∑

j MBj β

Bj −

∑j M

Sj β

Sj∑

i

∑j νj pij∑

i

∑j APC ij pij +

∑j APC ∅jMj

(3a)

subject to:

∀j Mminj ≤Mj ≤Mmax

j (3b)

Mmin ≤∑j

Mj ≤Mmax (3c)

∀j MBj ≥ 0, MS

j ≥ 0 (3d)

∀i∑j

1

ETC ijpij ≤ λi (3e)

∀j∑i

pij ≤Mj (3f)

∀i, j pij ≥ 0 (3g)∑j

MBj β

Bj −

∑j

MSj β

Sj ≤ β (3h)∑

i

∑j

νj pij ≤ νmax (3i)∑i

∑j

APC ij pij +∑j

APC ∅jMj ≤ Pmax

(3j)

In (3), all the objectives are to be minimized. The first ob-jective in (3a) is the negative of the reward rate. If ∀i ri = 1,the reward rate objective function is the number of taskscompleted per unit time, also known as the throughput ofthe system. The second objective is the upgrade cost. This isthe cost of purchasing machines minus the cost of sellingmachines. The third objective is the system failure rate.The last objective is the power consumption of the systemincluding static power consumption.

The constraints (3b) and (3c) limit the number of ma-chines of each type and total, respectively. The constraintsin the original problem correspond to (3d) to (3g). It isoften beneficial to include bounds on the objective func-tions in multi-objective optimization problems. The linearoptimization problem (3) has three additional constraintscorresponding to a bound on the upgrade cost with budgetβ, a failure rate bound νmax, and a power consumptionbound Pmax as constraints (3h) to (3j).

The optimization problem in (3) is a linear program-ming problem [24]. Single objectives of this problem can bequickly solved with either the simplex algorithm or interiorpoint algorithm. A discussion of how to solve the multi-objective problem is presented in Section 8.

The computational complexity of optimally solving asingle objective of the linear programming problem in (3)

is the square of the number of constraints multiplied by thenumber of decision variables [24]. The number of non-trivialconstraints is 4M + 4 and the number of decision variablesis M +M + TM . Therefore, solving the linear program viathe simplex algorithm would have a computational com-plexity of O

((4M + 4)2(TM + 2M)

). The computational

complexity does not depend on the number of machines ortasks. Instead the complexity depends only on the numberof types of machines and types of tasks used in the model.

4 PARAMETER UNCERTAINTY MODEL

4.1 Overview

Uncertainty is a fact of life. Few parameters are knownperfectly, so employing (3) is difficult because the effectof the unknown parameters on the solution is hard toascertain. A model of the uncertainty in the parameters isneeded to rigorously find and evaluate the solution to (3). Amajor source of uncertainty is the execution time and powerconsumption of the tasks running on the various machines.Subsection 4.2 describes a model that decomposes ETC andAPC into parts. These parts are used in Subsection 4.3 tomodel the randomness in the system due to uncertainty inexecution time.

4.2 Execution Time and Power Consumption Models

This model decomposes ETC into a linear combination ofabstract computational operations. As we will see, theseabstract operations need not map to physical (machine)instructions.

Let the number of abstract operation types be L, whichis as small as possible to permit a sufficiently accuratemodel. Let ηil be the number of abstract operations of typel necessary to complete a task of type i. Let τlj be the timeto complete one abstract operation of type l on a machine oftype j. Then the ETC ij =

∑l ηilτlj . In matrix form this is

ETC = ητ . (4)

This model can represent task heterogeneity and machineheterogeneity within the ETC matrix. For L > 1, the modelallows for arbitrary task machine affinity [25]. This minimalmodel is a mixture model that has the necessary propertiesfor characterizing the execution time characteristics of anHC system. The model splits the ETC into two compo-nents. The first component is η that represents the size ofthe tasks in terms of operations and is only dependent onthe task types’ properties. The second component τ is aproperty of only the machine types. This decomposition isuseful in representing the components of ETC as correlatedrandom variables. In practice an accurate model can befound with a small value for L.

The APC matrix can be decomposed similarly. Let ψlj

be the dynamic energy to execute an abstract operation oftype l on a machine of type j. The energy of type l abstractoperations is then ηilψlj . The total energy can be computedas∑

l ηilψlj . The total average dynamic power consumptionis

APC ij =

∑l ηilψlj∑l ηilτlj

. (5)



5

Equation (5) can be represented in matrix form, wheredivision is element-wise, as

APC =ηψ

ητ. (6)

Generally only ETC and APC are available so theabove model parameters, namely η, τ , and ψ, must bederived. These three parameter matrices have all nonneg-ative elements. The first step is then to compute the non-negative matrix factorization (NNMF) of ETC to find ηand τ [26]. The NNMF is similar to the singular valuedecomposition but the NNMF produces nonnegative ma-trices for the decomposition. The energy can be written asAPC ∗ ETC = ηψ so we can approximate ψ via leastsquares. In the E3 environment, described in Subsection 9.4,the relative errors from this approach for ETC and APCare 0.8 % and 1.5 %, respectively.

Recall that for a task of type i running on a machinetype j the run time is

∑l ηilτlj . Each term in the sum-

mation is the time that a particular abstract operation l isrunning. These abstract computational operations are foundby decomposing the ETC matrix with a given numberof abstract operation types L. These abstract operationsrepresent resource bound operations within a computingsystem including CPU operations, disk and/or networkI/O operations, and memory. The algorithm provides nomapping from abstract operation to an actual operation ora set of operations, however the algorithm does provide thecorrelation among task type run times.

4.3 Parameter DistributionsThe dominant sources of uncertainty are generally fromthe arrival rates λ, ETC, and APC parameters. If theprobability distributions of the ETC and APC model areknown then they can be used directly. If instead only theCoV of ETC is known then the following procedure can beused to estimate the variances of the elements of τ . Here weassume that all the uncertainty is in the machines and not inthe tasks as was discussed in [18].

The mean and variance of the ETC entries are

E[ETC ij ] =∑l

ηilE[τlj ] (7)

Var[ETC ij ] =∑l

η2ilVar[τlj ] (8)

assuming all the elements of τ are independent. Using thedefinition of CoV, then substituting from (8), and finallysquaring both sides we have the following three equations√

Var[ETC ij ] = CoVijE[ETC ij ] (9)√∑l

η2ilVar[τlj ] = CoVijE[ETC ij ] (10)

∑l

η2ilVar[τlj ] = CoV2ijE

2[ETC ij ] . (11)

Converted to matrix form this is

η2σ2 = CoV2 ∗ETC2 (12)

where the squares are element-wise. One can then computeσ2 via least squares. For the E3 environment described in

Subsection 9.4 the relative error of this approach is 0.8 % forthe variance of τ .

Nearly any probability distribution can be used withinthe stochastic programming formulation to model uncer-tainty. Even multi-modal distributions can be used. The pa-rameters in the steady-state model are averages of executiontimes and arrival rates. There are no parameters that modeltemporal changes in arrival rate as it is a steady-state model.Thus, for our simulations we used uniform distributionsfor all uncertain parameters. Each uniform distribution isspecified by a mean and a variance. To ensure the mean ispreserved and that parameters only takes on positive values,the variance was capped as necessary.

5 STOCHASTIC PROGRAMMING

Stochastic linear programming is an extension of linearprogramming, where some of the coefficients in the ob-jective and the constraints are random variables. For moreinformation on stochastic programming see [27].

The particular stochastic program we use is the recourseproblem (RP) given in standard form as

minimizex

cTx + Eξ [Q(x, ξ)]

subject to: Ax = bx ≥ 0

where: Q(x, ξ) = miny

q(ξ)Ty(ξ)

such that: T(ξ)x + W(ξ)y(ξ) = h(ξ)y(ξ) ≥ 0

(13)

where ξ is a random vector representing the uncertainparameters.

For the RP in (13), the first stage decision variable, xis a flattened version of MB and MS. The second stagedecisions, y are flattened versions of the schedule p. Theconstraints of the steady-state model that contain randomparameters or elements of p, namely (3e) to (3g), (3i) and (3j)are represented by T, W, and h. The constraints withoutany random variables and thus no dependence on thescenarios, (3b) to (3d) and (3h), define A and b in (13).The objective coefficients are separated in a similar way.The coefficients that are deterministic are incorporated intoc and the coefficients that are random are incorporated intoq.

This linear RP is similar to a linear program exceptfor the expectation of the value function, Q(x, ξ), in theobjective. The RP in (13) is known as a two-stage RP. Theoptimization problem finds the optimal x that minimizesthe sum of the linear function cTx and the expectationof Q(x, ξ).1 The second stage optimization problem findsoptimal y given a fixed realization of ξ. The random vectory is known as the recourse decision vector. This RP findsa robust solution for x in the sense that the objective valuewill on average be minimal when the optimal value of xis used. The solution x is robust to unknown values of theparameters ξ. The vector x is often referred to as a strategyof the RP.

There are many ways to solve stochastic programs. Wewill only describe a rather versatile approach to solving

1This formulation of the value function is risk neutral. Risk seekingand risk averse formulations are also possible.



6

large scale stochastic programs that utilizes sample averageapproximation (SAA) to build the deterministic equivalentprogram (DEP).2 The primary issue with stochastic pro-gramming is accurately computing the expectation in theobjective function. The SAA approach uses many samplesof ξ to compute the expectation as a sample average. Re-alizations of ξ are called scenarios. The process of creatingscenarios is known as scenario generation. When using SAA,generating a reasonably small set of representative scenariosis important. Let there be K scenarios. For scenario k, letthe probability of occurring be given by pk. With SAA, thescenarios are generated by randomly sampling ξ accordingto its distribution, thus all the samples are equally probable.The expectation operator is linear and is applied to a linearfunction, namely qTy(ξ). This lends itself to a linear pro-gram that is much larger, yet is equivalent to the stochasticprogram (in the limit as K → ∞) given in (13). The DEP isgiven by

minimizex,yk

cTx +p1qT1 y1 · · · +pKqT

KyK

subject to: Ax = bT1x +W1y1 = h1

.... . .

...TKx +WKyK = hK

x, y1, · · · yK ≥ 0(14)

Each scenario (i.e., realization of the ξ) defines a set ofmatrices Tk and Wk, and vectors qk and hk. Each scenarioalso introduces a new vector of decision variables yk intothe problem.

The SAA is an unbiased estimator of the mean. Inpractice, it converges to the mean quickly in K. The DEPcan have a large number of variables and constraints. Forvery large problems a technique called the L-method can beused to exploit the block structure of the constraint matrixto distribute the work of solving the linear program to manynodes [28].

The problem is broken into two coupled decisions. Thefirst is what to provision or purchase, namely MB and MS.Then the random variables in the problem are realized andthe second decision, known as the recourse decision, can bemade. For this work, the random variables are the arrivalrates, execution times, and power consumption, but virtu-ally any other parameter in the model can be converted toa random variable. The recourse decision involves selectingthe schedule p that is optimal for the actual arrival rates,execution times, and power consumption of the tasks.

6 VALUE OF INFORMATION

To provide insight into different decision making ap-proaches, we will evaluate other approaches besides thestandard RP. Let z(x, ξ) be the objective value using theoptimal second stage decision given the first stage decisionx and a particular scenario ξ.

The wait-and-see (WS) solutions are found by waitinguntil ξ is realized then computing the optimal solution.

2The cloud resource provisioning problem is a two-stage stochasticprogram with relatively complete recourse [27].

Formally, the objective value of the WS solution is givenby

WS = Eξ

[minxz(x, ξ)

]. (15)

This objective value is generally unachievable because it re-quires perfect information about the random variable. Thus,it provides an unachievable lower bound on the problem.

The RP is found by solving (13) or (14). The objectivevalue of the RP is given by

RP = minx

Eξ [z(x, ξ)] . (16)

This objective value is achievable and is the optimal strategyor solution to the problem.

An optimization problem that is often used in lieu of theRP is the expected value problem. This problem is also knowas the mean value problem (MVP) because it uses only themeans of all parameters to pose the optimization problem.Specifically the objective value for the MVP is

EV = minxz (x,Eξ[ξ]) . (17)

Let xEV be the optimal solution to the MVP in (17). Tocompare this solution to the WS and RP solutions one hasto take the expected value over ξ of using xEV . Specificallythis is

EEV = Eξ [z(xEV , ξ)] . (18)

This equation uses the optimal second stage decision that isusing a potentially suboptimal (based on the mean value ofthe parameters) decision for the first stage.

There are two standard measures used to compare thesethree standard approaches. The first is the expected value ofperfect information (EVPI) defined as EVPI = RP−WS ≥0. The second is the value of the stochastic solution (VSS)defined as VSS = EEV −RP ≥ 0. The EVPI is the amountthe objective value for the RP, on average, would be lowered(i.e., improved) if the random vector ξ is known perfectly.VSS is the expected improvement in the objective value overthe MVP if the uncertainty in the parameters is handledproperly by using the RP.

7 TRADITIONAL HEURISTIC STRATEGIES

Traditional strategies for purchasing hardware or provi-sioning cloud resources usually involve selecting the singlemachine that has the “best” desired property. Often the priceand performance are used to select the machine to purchase.For comparison, we define two common heuristics followedby one heuristic specific to our problem formulation. Foreach heuristic, the machine type to purchase, j∗, is selectedby

H1: j∗ = arg maxj

∑i

1

ETC ij

H2: j∗ = arg maxj

1

βBj

∑i

1

ETC ij

H3: j∗ = arg maxj

1

βBj

∑i

λiri1

ETC ij

(19)

The three heuristics will be referred to as H1, H2, and H3.For the selected machine type, the next step is to find themaximum number of machines that do not violate any



7

constraints (such as budget and power). The strategy issimply to let MB

j∗ equal the maximum number of machinesof that type that is feasible.

The first heuristic, H1 selects the machine that performsthe best across all task types. Heuristic H2 uses the price andperformance ratio to select the machine to purchase. Heuris-tic H3 weights the machine performance by the arrival rateand the reward for each task type. There are clear limitationsof these heuristics in terms of flexibility and performance.H1 and H2 do not consider workload or differing rewardbetween task types. None of the heuristics incorporatereliability or the option to potentially sell machines. Theheuristics only select one machine type to purchase. As theresults in Section 9 will show, typically the best solution isfound by combining multiple machine types to match theload. None of the heuristics account for uncertainty in theparameters.

These strategies simply define the first stage of theproblem. They do not describe how to schedule the tasksafter such a purchasing strategy is executed. To providea fair comparison, we use the optimal schedule from theMVP given that this particular strategy was chosen. Thenwe compute the expected value of the objective using (18)with xEV replaced by the purchasing decision made by theheuristic.

8 MULTI-OBJECTIVE STOCHASTICPROGRAMMING

8.1 Introduction

The RP derived from the base problem in (3) is a multi-objective stochastic programming problem. In this section,we extend the scalar stochastic programming problem de-scribed in Section 5 to the multi-objective case [29], [30].

Multi-objective optimization is challenging becausethere is usually no single solution that is superior to allothers. Instead, there is a set of superior feasible solutionsthat are referred to as the non-dominated solutions [31]. Whenall objectives are to be minimized, a feasible solution x1dominates a feasible solution x2 when

∀d fd(x1) ≤ fd(x2)

∃d fd(x1) < fd(x2)(20)

where fd(·) is the dth objective function. Feasible solu-tions that are dominated are generally of little interestbecause one can always find a better solution from the non-dominated set. The non-dominated solutions, also known asoutcomes and efficient points, compose the Pareto front.

There are many techniques for solving multi-objectiveoptimization problems. For linear optimization problems,there are two primary approaches. The first is known asBenson’s algorithm that iteratively refines the Pareto front[32], [33]. The second is a technique that converts themulti-objective problem into a set of scalar optimizationproblems through a process called scalarization. There aremany scalarization techniques but most are specializationsof Pascoletti-Serafini scalarization [34], such as the weightedsum algorithm. We use the weighted sum algorithm, de-scribed in Subsection 8.2, to compute the Pareto front forthe multi-objective stochastic program.

8.2 Weighted Sum AlgorithmThe weighted sum algorithm forms the positive convexcombination of the objectives and then sweeps the weightsto generate the Pareto front [35]. The optimization problemin (14) is linear and convex, thus by Theorem 3.15 in [36] theweighted sum algorithm can find all of the non-dominatedsolutions.

Consider a D-objective optimization problem. Let theweight vector be ω that is used to combine the objectives.To avoid a degenerate objective function, exclude ω = 0from the set of weights by imposing a somewhat arbitraryconstraint that the

∑Di=1 ωi = 1. The vector ω is in a D − 1

dimensional linear subspace of RD.The first step in the weighted sum algorithm is to com-

pute the utopia and nadir points. Let the optimal solutionvector for objective d be qd = arg min fd(·). The dth elementof the utopia and nadir points are then computed as

∀d yutopiad = min

(fd(q1), . . . , fd(qD)

)= fd(qd) (21)

∀d ynadird = max

(fd(q1), . . . , fd(qD)

). (22)

In other words, the utopia point is the best possible value forall objectives and the nadir point is the worst possible valuefrom optimizing each objective individually. These two vec-tors are used to normalize the objective functions to betterspan the space. The utopia and nadir points also removeall units from the objective making the scalarized objectiveunitless. The normalized objective function is defined as

f(x) =f(x)− ynadir

ynadir − yutopia (23)

where the division is taken to be element-wise.The second step in the weighted sum algorithm is tra-

ditionally performed by recursively combing the objectiveswhile ensuring that the weights sum to one [35]. Let ω1 = 1,then the recursion is

∀d = {2, 3, · · · , D} ωd =

[αd−1ω

d−1

1− αd−1

](24)

The final weight vector is ωD ∈ RD . To sweep the space,each αd is varied uniformly from 0 to 1. This approach pro-duces duplicate weight vectors and performs non-uniformsampling in the subspace defined by

∑Di=1 ωi = 1.

The orthogonal weighted sum algorithm finds an or-thonormal basis (i.e., spanning set) for the null space of 1D

and sweeps independently in the D − 1 dimensional spacedefined by the basis vectors. Weight vectors with any neg-ative components are dropped. This sweeps the subspaceuniformly, therefore, it does not prefer any objective to anyother. To ensure the whole space is swept, one must sweepαd from −∆ to +∆ where ∆ =

√1− 1

D .The third step in the weighted sum algorithm is to solve

the optimization problem for each weight computed in steptwo. Specifically, for each l solve

minxωT

l f(x) (25)

to find the Pareto front.To compare the recursive and the orthogonal sweeping

methods, Fig. 1 shows the weights in the subspace wherethe weights sum to one. The recursive algorithm in Fig. 1(a)



8

(a) recursive

(b) orthogonal

Fig. 1: Comparison of the recursive and the orthogonal weights drawnin the 2D plane that they span: The recursive algorithm unnecessarilyputs more samples in the bottom right quadrant while the orthogonalalgorithm more uniformly spans the space thus making better use of alimited number of samples.

uses more samples in the bottom right corner than it doeselsewhere. That region is no more important than the rest ofthe space so it is wasted sampling. Using roughly the samenumber of points, Fig. 1(b) more uniformly distributes thesamples. The orthogonal weighted sum algorithm was usedto generate the Pareto fronts in Section 9.

8.3 Confidence RegionsWhile solving a single stochastic programming problem itis useful to know the quality of the solution. To increase thequality of the solutions, the number of generated scenariosK can be increased but only at the cost of increased run timeof the linear programming solver.

Confidence intervals are often used as a measure of thequality of these types of algorithms. For multi-objectiveoptimization, a multi-dimensional confidence region is nec-essary. By the central limit theorem, the sample mean willconverge to the multivariate normal distribution asK →∞.Let y and S be the sample mean and unbiased covariancematrix of the samples of Q(x, ξ), respectively. Then the100(1 − α)% confidence region of the true mean, µ, isdefined by

(y − µ)TS−1(y − µ) ≤ (n− 1)p

(n− p)nFp,n−p(1− α) (26)

where Fp,n−p(·) is the inverse CDF of the F-ratio distribu-tion with p numerator and n − p denominator degrees offreedom [37]. The boundary of the confidence region is anellipse in 2D and an ellipsoid in 3D. In 1D, (26) collapses

down to regular confidence intervals. Confidence regionswill be plotted in Section 9.

9 RESULTS

9.1 OverviewThree very different environments are used to analyze thebehavior of the proposed algorithms for resource provision-ing. The heuristic-based algorithms H1, H2, and H3, andthe RP that uses stochastic programming are compared. Thefirst environment is a small example used to illustrate thebehavior of the algorithms. The second environment is alarger environment. The third environment was built basedon benchmark data. A complete description of the sourcecode, system parameters, and simulation results are avail-able in the supplementary material to aid in reproducibility.The data is provided as CSV and JSON files with furtherdetails available in the accompanying README.txt file.

The steady-state model and the scenario generation arewritten in C++. To generate the DEP in (14) the Coin-OR Stochastic Modeling interface (SMI) is used [38]. Theunderlying linear programming problem is solved with theCoin-OR Linear Programming (CLP) solver [39]. CLP is ahigh quality open-source solver written in C++. All thesimulations where run on an Apple MacBook Pro Mid 2014,2.2 GHz Intel Core i7. The solver is single threaded so timingresults are for one core.

To compare MVP, RP, and WS against the heuristicsdescribed in Section 7, one must be able to compute allthe objectives including reward rate. The reward rate is afunction of the steady-state schedule. To allow heuristics toperform as best as possible we use the optimal schedulefrom the steady-state model by solving a linear program-ming problem where the MB is fixed by the heuristic andMS = 0. When computing the expected objective valuesthe optimal schedule is used for each scenario.

9.2 Highly Heterogeneous Environment (E1)This environment is composed of two task types and threemachine types. The ETC matrix is given in Table 1. Ma-chine types 1 and 3 are special purpose machines andmachine type 2 is a general purpose machine. Machines oftype 1 can execute tasks of type 1 rapidly but are slow toprocess tasks of type 2. The reverse is true for other specialpurpose machine type (i.e., type 3).

TABLE 1: ETC for E1

M1 M2 M3T1 1 3 100T2 100 3 1.1

The cost for each type of machine is one per unit time(e.g, $1/hour instance on AWS EC2 for a particular instancetype). One task of each task type arrives (on average) everytime unit. There are no pre-existing machines in the environ-ment. The power and failure rate constraints are inactive.The only uncertainties in this environment are the arrivalrates of the tasks. The arrival rate for tasks of type 1 isuniform from 0 to 2. For tasks of type 2 the arrival rateis uniform from 1− 0.547 to 1 + 0.547. The budget is set to3.5 so a total of 3.5 machines can be purchased.



9

Table 2 shows the number of machines of each type thatthe algorithms chose. All three heuristics select machinetype 1 to buy because it has the highest machine perfor-mance (19). Machine type 3 is the other special purposemachine but it has a slightly lower machine performance,therefore, is not selected by the heuristics. The MVP and RPboth select the special purpose machines but they differ onquantity of the machines.

TABLE 2: Solutions for E1

H1 H2 H3 MVP RPM1 3.5 3.5 3.5 1. 1.9M2 0 0 0 0 0M3 0 0 0 2.5 1.6

Fig. 2 shows the performance of the three heuristics (H1,H2, and H3), MVP, RP, and WS. Fig. 2(a) shows the rewardrates when using the mean value of the parameters forevaluating the reward rate based on the model in Section 3.The WS solution is not available because there is one so-lution (e.g., buying strategy) for each scenario and not onesolution for all scenarios. This is the same reason why theWS algorithm is not realizable but can be used as an upperbound on performance. The heuristics all achieve a rewardrate of about 1.0 because they can only efficiently processtasks of type 1, the reward per task is 1.0, and the meanarrival rate is 1.0. The MVP and RP purchase machines ofdifferent types and achieve the maximum achievable rewardrate of 2.0.

Fig. 2(b) shows the expected value of the reward rate.This is the average reward one could expect if the givenalgorithm was employed to select the machines to purchase.All the solutions use the full budget that was allotted. TheVSS and EVPI are also shown in the figure. The MVP andRP perform much better than the heuristics because theheuristics only choose one special purpose machine. The RPperforms 13.8 % better than the MVP indicated by the VSS.This is due to RP selecting more of machine type 1 becausetask type 1 has a much larger uncertainty in the arrivalrate compared to task type 2. In this example the EVPI isvery small indicating that having fully realized parameterswould not improve the solution any further than what theRP already found.

The budget has a strong influence on the expected re-ward rate. Fig. 3 shows the expected reward rate for thealgorithms for different values of the budget. The resultsare averaged over ten Monte Carlo trials with 3,000 sce-narios each. With no budget, there is no possibility of anyreward. As the budget increases the reward increases untilthe maximum achievable reward rate is reached at 2.0. Theheuristics flatten out at around 1.0 because they only processtasks of type 1. As the budget increases the heuristics buymore machines of type 1 but provide no value. The MVPstarts hitting a limit on the reward rate once it has enoughbudget to purchase all the necessary machines to run theaverage number of tasks. The MVP does not consider theuncertainty in the arrival rates and so has no means of beingrobust against this uncertainty. The RP, on the other hand,fully considers the uncertainty in the arrival rates to makethe best use of the budget.

(a) objective value with mean parameters

(b) expected objective value

Fig. 2: Reward rates for various solutions for the E1 environment:(a) shows the reward rate computed with the mean of the parame-ters. (b) shows the expected reward rate over all uncertainty in theparameters.

Fig. 3: Expected reward rate for different budgets for the E1 environ-ment: The reward rate asymptotes at 2.0 with RP approaching WS. Theother algorithms never reach the maximum reward rate.

9.3 Medium Sized Problem (E2)

Environment E2 has ten task types and five machine types.The number of abstract operations is set to two. This en-vironment is a representative system and workload usedto show some interesting properties of the different ap-proaches. In this environment, the arrival rates, η, τ , andψ are all stochastic with known means and a CoV of100 %. There are ten machines of type five in the initialenvironment that can be retained or sold by the algorithms.The ETC matrix is given in Table 3.



10

(a) confidence

(b) run time

Fig. 4: Confidence interval (a) and algorithm run time (b) versus thenumber of scenarios for the E2 environment: The quality of the solutionis again acceptable after only a relatively small number of scenarios.

TABLE 3: ETC for E2

M1 M2 M3 M4 M5T1 5 10 10 101 30T2 15 30 30 300 90T3 50 101 11 1010 303T4 50 101 100 1010 303T5 505 1010 1001 10100 3030T6 15 30 12 300 90T7 55 110 110 1100 330T8 17 34 16 340 102T9 6 11 10 110 33T10 5 10 10 100 30

To understand how the stochastic programming ap-proach scales, Fig. 4 shows the confidence in the solutionsand run time for solving the RP for various numbers ofscenarios. The results shown in Fig. 4 are the average often Monte Carlo trials. The one-sided confidence interval inFig. 4(a) shows that a ±1.2 % confidence can be obtainedafter about 8,000 scenarios. It takes only about four minutesto compute that solution. These algorithms are meant to beused offline to aid in purchasing decision so these run timesare reasonable.

The solutions from the different algorithms is in Table 4.The heuristics only select a single machine type to buy andretain all 10 machines of type five. H2 and H3 both decidedon the same machine to purchase so they have identicalresults in Fig. 5. The MVP solution chose mostly machinesof type three, but some type 2 machines were selected. TheMVP solution decided to sell all 10 of the existing type fivemachines indicated by the minus sign. The RP picks mostlytype 2 machines but some of type 1 and 3. Only 6.7 of the

existing type five machines where sold.


H1 H2 H3 MVP RPM1 31.8 0 0 0 10.M2 0 67.3 67.3 8.5 26.9M3 0 0 0 32. 11.5M4 0 0 0 0 0M5 0 0 0 -10. -6.7

Similar to Fig. 2, the raw objective value and expectedobjective value are show in Fig. 5 for the E2 environmentwith 8,000 scenarios. When considering the performance ofeach algorithm using the mean of the parameters, Fig. 5(a),the MVP performs slightly better than all other algorithms.It even appears to outperform the RP solution. This ismisleading because the mean of the parameters is not agood measure of the expected performance. It is only onepossible realization of the parameters of the problem.3 Manymore realizations or scenarios are possible that are com-pletely ignored in this measure, however this is what iscommonly used by practitioners. A better measure is thetrue expected reward rate shown in Fig. 5(b). Due to thesteady-state model and the stochastic model presented inSections 3 and 4, computing the expected reward rate iseasily accomplished. The RP’s expected reward rate is over13 % (given by VSS) higher than the MVP and the heuristics.If the parameters could be perfectly known at the time ofmaking the decision then an additional 26.2 % (given byEVPI) improvement can be expected. The diversity in themachine types for the MVP and the RP allowed the resultantsystems to better match the workload characteristics. In thecase of the RP the workload was matched not only for themean parameters but for nearly all possible realizations ofthe parameters. It is not clear how one could modify theheuristic based algorithms to better match the workload.

There are many possible scenarios that are possible inthis environment. The RP is guaranteed to be the bestperforming solution on average but that does not implyit is the best for each possible scenario. Fig. 6 illustratesthis by plotting the probability distribution of the relativeimprovement that RP has over the other algorithms (i.e.H1, H2, H3, MVP). The width of the glyphs represent thenormalized probability density of the relative increase inreward rate. The RP can out perform the H1 heuristic by upto 300 %. For some scenarios, RP can also underperform H1by 40 %. However, on average, RP produces significantlyhigher reward rates as seen by the positive mean of thedistribution and by Fig. 5(b).

9.4 Benchmark Based Environment (E3)This environment is based on a set of five benchmark appli-cations that were run on nine different types of hardware.The execution time and power consumption was recordedfor these systems [40]. Not all the task types representtraditional HPC workloads, but the benchmarks serve asa good baseline of tasks that utilize the servers in vastlydifferent ways. The servers used include AMD and Intel

3This is assuming the mean parameters have non-zero probabilitydensity. For multimodal or discrete distributions the mean value of theparameters might not even be realizable.



11

(a) objective value with mean parameters

(b) expected objective value

Fig. 5: Reward rates for various solutions for the E2 environment:(a) shows the reward rate computed with the mean of the parame-ters. (b) shows the expected reward rate over all uncertainty in theparameters.

Fig. 6: Distributions of the relative improvement of RP over the otheralgorithms for the E2 environment: The RP for certain scenarios canperform as much as 300% times better than the heuristics, but RP alsoperforms worse for some scenarios by as much as 40%.

architectures ranging from four to twelve cores. The methodin [41] was then used to increase the number of task types toten. This environment has nine machine types and ten tasktypes that define the ETC and APC matrices. The ETCmatrix is shown in Table 5. The algorithm in Subsection 4.2was used to generate η, τ , and ψ. Based on [18], the CoVfor the ETC elements was taken to be 25 %. The algorithmin Subsection 4.3 is used to generate the variance of τ .

The number of abstract operations, L, is set to three. Thearrival rates have a known mean with a CoV of 25 %. Thereis no power constraint in this environment. The budget is$400,000 with the average machine costing $3,183, thushundreds of machines are provisioned by the algorithms.

TABLE 5: ETC for E3

M1 M2 M3 M4 M5 M6 M7 M8 M9T1 57 28 72 45 41 19 27 28 26T2 98 50 120 77 70 37 49 49 45T3 463 303 362 342 314 311 303 290 264T4 165 113 113 120 111 122 114 108 98T5 167 91 185 129 118 74 90 88 81T6 162 87 185 125 114 68 85 84 77T7 45 22 57 36 33 15 22 22 20T8 57 28 74 45 41 18 27 27 25T9 59 36 54 44 41 34 36 35 32T10 39 22 41 30 27 19 21 21 19

Fig. 7(a) shows the one sided confidence interval fordifferent number of scenarios. The results shown in Fig. 7(a)are the average of ten Monte Carlo trials. At 1,500 scenarios,the error in the reward rate is under ±1 %. Fig. 7(b) showsthe corresponding run times for solving RP and takes lessthan a minute in all cases. Due to the (at least) quadraticgrowth of the run time w.r.t. the number of scenarios, itis important to use as few scenarios as possible. At 3,000scenarios, the DEP is rather large with 63,011 constraints(i.e., rows) and 270,018 decision variables (i.e., columns).The constraint matrix in the DEP is very sparse so solv-ing this large linear programming problem only consumesabout 180 MB of memory.

For these simulations, MVP, RP, and WS are all tryingto maximize reward rate but as a secondary objective theyare also trying to reduce the cost. This is accomplished byweighting the reward rate with 1.0 and the upgrade costobjective with a small positive constant. Fig. 8 gives theexpected negative of the reward rate for 3,000 scenarios.The upgrade cost is indicated below the labels at the bottomof the graph.

Table 6 shows the solution for each of the algorithms.All the proposed algorithms can be solved with integer con-straints on these variables however at a huge computationalcost. Even though the number of machines is fractional, asimple rounding policy can easily be applied to the solutionsin Table 6 when integers are required with negligible effecton the reward rate.


H1 H2 H3 MVP RPM1 0 0 0 0 0M2 0 168.8 168.8 0 0M3 0 0 0 0 4.3M4 0 0 0 0 0M5 0 0 0 0 0M6 121.2 0 0 0 12.7M7 0 0 0 0 4.5M8 0 0 0 100. 73.8M9 0 0 0 0 1.1

All the heuristics spend the entire budget and still per-form worse than MVP and RP. The MVP solution does notuse all the available budget. In fact, it only used $291K of thebudget. This is because the MVP solution does not provisionfor the uncertainty in the arrival rates of tasks nor in the



12

(a) confidence

(b) run time

Fig. 7: Confidence interval (a) and algorithm run time (b) versus thenumber of scenarios for the E3 environment: Computing an accuratesolution is fast for this larger problem even though the number ofmachines being provisioned and scheduled is large.

machine performance. In the MVP formulation, there is nobenefit to buying more machines once the workload canbe handled hence it does not use all the available budget.The MVP has extra degrees of freedom to improve thesolution but has no practical way of determining how tobest select the machines. From the MVP perspective, thealgorithm is already achieving maximal reward. The RPtakes the uncertainty of the arrival rates and the machinesinto account and can achieve the same performance as theMVP solution but at a lower cost of $274K. The RP solutionuses many different types of machines to reduce the costand still achieve the optimal performance. The upgrade costunder the WS solution is the cost one would pay on averageif they could select the machines after knowing precisely theexecution time of each task type on all machine types andthe arrival rates of the tasks.

9.5 Pareto FrontsPareto fronts are useful tools to quantify the trade-off be-tween conflicting objectives. The weighted sum algorithmdescribed in Subsection 8.2 is used to generate the Paretofront for the reward rate and power consumption objectives.Fig. 9 shows the Pareto front for the E3 environment with1,000 scenarios. This took only eight minutes to compute.The solutions found by the weighted sum algorithm areblue dots. The 95 % confidence regions are shown as redellipses centered at the blue dots. The confidence regions arerelatively small. The light blue shaded region is the feasibleobjective space for this problem. Objective values outside

Fig. 8: Expected reward rates for various solutions for the E3 environ-ment: The cost for each solution is indicated at the bottom of the figure.RP ties for the best performance with MVP however with a solutionthat has a lower cost.

Fig. 9: Reward rate and power consumption Pareto front and feasibleobjective space for the E3 environment: The relatively small 95%confidence ellipses are shown in red. The blue shaded region is thefeasible region of the objective space. The blue dots are solutions thatthe weighted sum algorithm found.

this feasible space are not possible due to one or more of theconstraints in the problem. The feasible region is computedby solving the RP with

ω =

(cos θsin θ

)(27)

for values of θ from 0 to 2π. More details on computingthe feasible regions for linear programming problems isavailable in [42].

10 CONCLUSIONS AND FUTURE WORK

Stochastic programming is a powerful tool that can beapplied to make robust decisions in the midst of the inherituncertainty in computing systems in both IaaS providerclouds and traditional environments. The linear steady-statemodel and representative stochastic model enables the useof an efficient two-stage stochastic program for solving themachine provisioning problem. The new algorithms werecompared to heuristic based algorithms in a few differentenvironments. The heuristic approaches tend to performpoorly when considering their average performance. The



13

RP produces the best quality solution on average comparedto the heuristics and the MVP. The inherit uncertainty inthe execution times and arrival rates necessitates algorithmsthat can incorporate random variables and their statistics.The RP can be used to reduce the upgrade cost for a comput-ing system by exploiting the uncertainty in the environmentwhile maintaining the optimal level of performance. Themulti-objective optimization problem can be used to quicklygenerate Pareto fronts.

In the future, we would like to adapt this model toinclude aspects of spot pricing and the uncertainty sur-rounding the price of the instances. We would also like toadapt these concepts to existing cloud provisioning toolsand computational models. Risk averse formulations arealso possible that minimize the expected reward rate andthe variance of the reward rate.

ACKNOWLEDGMENTS

This work was supported by Numerica Corporation, theNational Science Foundation (NSF) under grants CNS-0905399 and CCF-1302693, and by the Colorado State Uni-versity George T. Abell Endowment. Any opinions, findings,and conclusions or recommendations expressed in this ma-terial are those of the authors and do not necessarily reflectthe views of the NSF. A special thanks to Ryan Friese andBhavesh Khemka for their valuable comments.

NOMENCLATURESymbol DescriptionT Number of task typesM Number of machine typesi Task typej Machine typeβBj Buying price (cost) of a machine of type jβSj Selling price of a machine of type jM cur

j Current number of machines of type jMmin

j Minimum number of machines of type j de-sired

Mmaxj Maximum number of machines of type j de-

siredMB

j Number of machine of type j to buyMS

j Number of machine of type j to sellMj Total number of machines of type j after buy-

ing and sellingri Reward for completing a task of type iλi Arrival rate - number of tasks arriving of type

i per secondpij Schedule - fraction of time tasks of type i run

on machines of type jνj Failure rate - number of machines of type j that

fail per secondL Number of abstract computational operationsηil Number of abstract operations of type l in a

task of type iτlj Run time [s] per abstract operation of type l on

a machine of type j

ETC ij Run time [s] of a task of type i on a machine oftype j

ψlj Dynamic energy [J] consumed by an abstractoperation of type l on a machine of type j

CoVij Coefficient of variation for a task of type i on amachine of type j

ξ Random vectorx First-stage decision variable vectory Second-stage decision variable vectorQ(x, ξ) Value functionWS Wait-and-see problemRP Recourse problemMVP Mean value problemEVPI Expected value of perfect informationVSS Value of the stochastic solution

Kyle M. Tarplee is a Senior member of IEEE.He is an Assistant Professor of Engineer-ing at Anderson University. Previously he hasworked for Numerica Corporation for ten yearsdeveloping multi-target tracking algorithms forthe Department of Defense. He has beenthe principal investigator on several projects.He has a PhD from Colorado State Univer-sity (CSU) in Electrical Engineering and aBSEE and Masters from University of Califor-nia at San Diego (UCSD). Further details at:

https://www.engr.colostate.edu/∼ktarplee and http://www.anderson.edu/science-engineering/faculty/engineering/tarplee.

Anthony A. Maciejewski received the BSEE,MS, and PhD degrees from The Ohio State Uni-versity in 1982, 1984, and 1987. From 1988 to2001 he was a professor of Electrical and Com-puter Engineering at Purdue University, WestLafayette. He is currently a Professor and De-partment Head of Electrical and Computer En-gineering at Colorado State University. He is aFellow of the IEEE. A complete vita is availableat: https://www.engr.colostate.edu/∼aam

Howard Jay Siegel was appointed the Abell En-dowed Chair Distinguished Professor of Electri-cal and Computer Engineering at Colorado StateUniversity (CSU) in 2001, where he is also a Pro-fessor of Computer Science. From 1976 to 2001,he was a professor at Purdue University, WestLafayette. He is an IEEE Fellow and an ACM Fel-low. He received BS degrees from MIT, and thePhD degree from Princeton. A complete vita isavailable at: https://www.engr.colostate.edu/∼hj



https://www.engr.colostate.edu/~ktarplee

http://www.anderson.edu/science-engineering/faculty/engineering/tarplee

http://www.anderson.edu/science-engineering/faculty/engineering/tarplee

https://www.engr.colostate.edu/~aam

https://www.engr.colostate.edu/~hj

14

REFERENCES

[1] (October). [Online]. Available: http://www.top500.org/system/178321

[2] I. Zinno, L. Mossucca, S. Elefante, C. D. Luca, V. Casola, O. Terzo,F. Casu, and R. Lanari, “Cloud computing for earth surface de-formation analysis via spaceborne radar imaging: A case study,”IEEE Transactions on Cloud Computing, vol. 4, no. 1, pp. 104–118,Jan 2016.

[3] E. Roloff, F. Birck, M. Diener, A. Carissimi, and P. O. A. Navaux,“Evaluating high performance computing on the windows azureplatform,” in Cloud Computing (CLOUD), 2012 IEEE 5th Interna-tional Conference on, June 2012, pp. 803–810.

[4] J. Gibson, R. Rondeau, D. Eveleigh, and Q. Tan, “Benefits andchallenges of three cloud computing service models,” in FourthInternational Conference on Computational Aspects of Social Networks(CASoN), 2012, pp. 198–205.

[5] W. Lloyd, S. Pallickara, O. David, J. Lyon, M. Arabi, and K. Rojas,“Performance implications of multi-tier application deploymentson infrastructure-as-a-service clouds: Towards performance mod-eling,” Future Generation Computer Systems, vol. 29, no. 5, pp. 1254–1264, 2013.

[6] A. Iosup and D. Epema, “Grid computing workloads,” IEEEInternet Computing, vol. 15, no. 2, pp. 19–26, March 2011.

[7] F. Zhang, J. Cao, K. Li, S. U. Khan, and K. Hwang, “Multi-objectivescheduling of many tasks in cloud platforms,” Future GenerationComputer Systems, vol. 37, pp. 309–320, 2014.

[8] J. Li, M. Qiu, Z. Ming, G. Quan, X. Qin, and Z. Gu, “Online opti-mization for scheduling preemptable tasks on iaas cloud systems,”Journal of Parallel and Distributed Computing, vol. 72, no. 5, pp. 666–677, 2012.

[9] J. G. F. Coutinho, O. Pell, E. O’Neill, P. Sanders, J. McGlone,P. Grigoras, W. Luk, and C. Ragusa, “HARNESS project: Man-aging heterogeneous computing resources for a cloud platform,”in Reconfigurable Computing: Architectures, Tools, and Applications.Springer, 2014, pp. 324–329.

[10] Q. He, S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, “Case studyfor running HPC applications in public clouds,” in 19th ACMInternational Symposium on High Performance Distributed Computing,2010, pp. 395–401.

[11] E. Jeannot, E. Saule, and D. Trystram, “Optimizing performanceand reliability on heterogeneous parallel systems: Approximationalgorithms and heuristics,” Journal of Parallel and Distributed Com-puting, vol. 72, no. 2, pp. 268–280, 2012.

[12] J. Koomey, “Growth in data center electricity use 2005 to 2010,”The New York Times, vol. 49, no. 3, 2011.

[13] M. Mao and M. Humphrey, “Auto-scaling to minimize cost andmeet application deadlines in cloud workflows,” in InternationalConference for High Performance Computing, Networking, Storage andAnalysis (SC), Nov 2011, pp. 1–12.

[14] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski, “Cost- anddeadline-constrained provisioning for scientific workflow ensem-bles in iaas clouds,” in International Conference for High PerformanceComputing, Networking, Storage and Analysis (SC), Nov 2012, pp.1–11.

[15] K. Jansen and L. Porkolab, “Improved approximation schemesfor scheduling unrelated parallel machines,” Mathematics ofOperations Research, vol. 26, no. 2, pp. 324–338, 2001. [Online].Available: http://dx.doi.org/10.1287/moor.26.2.324.10559

[16] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert, “Steady-state scheduling on heterogeneous clusters,” International Journalof Foundations of Computer Science, vol. 16, no. 2, pp. 163–194, 2005.

[17] I. Al-Azzoni and D. G. Down, “Linear programming-based affinityscheduling of independent tasks on heterogeneous computingsystems,” IEEE Transactions on Parallel and Distributed Systems,vol. 19, pp. 1671–1682, 2008.

[18] J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz, “Runtimemeasurements in the cloud: Observing, analyzing, and reducingvariance,” VLDB Endowment, vol. 3, no. 1-2, pp. 460–471, Sep. 2010.[Online]. Available: http://dx.doi.org/10.14778/1920841.1920902

[19] M. Rehman and M. Sakr, “Initial findings for provisioning varia-tion in cloud computing,” in IEEE Second International Conferenceon Cloud Computing Technology and Science (CloudCom), Nov 2010,pp. 473–479.

[20] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica,“Improving mapreduce performance in heterogeneous environ-ments,” in 8th USENIX Conference on Operating Systems Design andImplementation. Berkeley, CA, USA: USENIX Association, 2008.

[21] M. Rodriguez and R. Buyya, “Deadline based resource provision-ing and scheduling algorithm for scientific workflows on clouds,”IEEE Transactions on Cloud Computing, vol. 2, no. 2, pp. 222–235,April 2014.

[22] S. Ahmed, A. J. King, and G. Parija, “A multi-stage stochasticinteger programming approach for capacity expansion under un-certainty,” Stochastic Programming E-Print Series, http://www.speps.org, 2001.

[23] M. Riis and J. Lodahl, “A bicriteria stochastic programmingmodel for capacity expansion in telecommunications,” Mathemati-cal Methods of Operations Research, vol. 56, no. 1, pp. 83–100, 2002.

[24] D. Bertsimas and J. Tsitsiklis, Introduction to Linear Optimization,1st ed. Athena Scientific, 1997.

[25] A. M. Al-Qawasmeh, A. A. Maciejewski, R. G. Roberts, and H. J.Siegel, “Characterizing task-machine affinity in heterogeneouscomputing environments,” in International Parallel and DistributedProcessing Workshops and Phd Forum (IPDPSW), Heterogeneity inComputing Workshop. IEEE Computer Society, May 2011, pp. 34–44.

[26] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Advances in Neural Information Processing Systems,2001, pp. 556–562.

[27] J. R. Birge and F. Louveaux, Introduction to Stochastic Programming,ser. Operations Research and Financial Engineering. Springer,2011.

[28] R. M. Van Slyke and R. Wets, “L-shaped linear programs withapplications to optimal control and stochastic programming,”SIAM Journal on Applied Mathematics, vol. 17, no. 4, pp. 638–663,1969.

[29] R. B. Franca, E. C. Jones, C. N. Richards, and J. P. Carlson, “Multi-objective stochastic supply chain modeling to evaluate tradeoffsbetween profit and quality,” International Journal of ProductionEconomics, vol. 127, no. 2, pp. 292–299, 2010.

[30] J. Teghem, D. Dufrane, M. Thauvoye, and P. Kunsch, “STRANGE:An interactive method for multi-objective linear programmingunder uncertainty.” European Journal of Operational Research, vol. 26,pp. 65–82, 1986.

[31] V. Pareto, Cours d’economie Politique. Lausanne: F. Rouge, 1896.[32] H. Benson, “An outer approximation algorithm for generating all

efficient extreme points in the outcome set of a multiple objec-tive linear programming problem,” Journal of Global Optimization,vol. 13, no. 1, pp. 1–24, 1998.

[33] A. Lohne, Vector Optimization with Infimum and Supremum, ser.Vector Optimization. Berlin, Heidelberg: Springer, 2011.

[34] G. Eichfelder, Adaptive Scalarization Methods in Multiobjective Opti-mization. Springer, 2008.

[35] I. Y. Kim and O. De Weck, “Adaptive weighted sum methodfor multiobjective optimization: a new method for pareto frontgeneration,” Structural and Multidisciplinary Optimization, vol. 31,no. 2, pp. 105–116, 2006.

[36] M. Ehrgott, Multicriteria Optimization. Springer Science & Busi-ness Media, 2006.

[37] V. Chew, “Confidence, prediction, and tolerance regions for themultivariate normal distribution,” Journal of the American StatisticalAssociation, vol. 61, no. 315, pp. 605–617, 1966.

[38] (2015, January) Coin-OR SMI. [Online]. Available: https://projects.coin-or.org/Smi

[39] (2015, January) Coin-OR CLP. [Online]. Available: https://projects.coin-or.org/Clp

[40] (2013, May) Intel core i7 3770k power consumption,thermal. [Online]. Available: http://openbenchmarking.org/result/1204229-SU-CPUMONITO81

[41] R. Friese, B. Khemka, A. A. Maciejewski, H. J. Siegel, G. A. Koenig,S. Powers, M. Hilton, J. Rambharos, G. Okonski, and S. W. Poole,“An analysis framework for investigating the trade-offs betweensystem performance and energy consumption in a heterogeneouscomputing environment,” in 27th International Parallel and Dis-tributed Processing Symposium Workshops and Phd Forum (IPDPSW),Heterogeneity in Computing Workshop. IEEE Computer Society,2013, pp. 19–30.

[42] T. Huynh, C. Lassez, and J.-L. Lassez, “Practical issues on theprojection of polyhedral sets,” Annals of Mathematics and ArtificialIntelligence, vol. 6, no. 4, pp. 295–315, 1992.



http://www.top500.org/system/178321

http://www.top500.org/system/178321

http://dx.doi.org/10.1287/moor.26.2.324.10559

http://dx.doi.org/10.14778/1920841.1920902

http://www.speps.org

http://www.speps.org

https://projects.coin-or.org/Smi

https://projects.coin-or.org/Smi

https://projects.coin-or.org/Clp

https://projects.coin-or.org/Clp

http://openbenchmarking.org/result/1204229-SU-CPUMONITO81

http://openbenchmarking.org/result/1204229-SU-CPUMONITO81

robust performance-based resource provisioning …ktarplee/pdf/tcc_capacity.pdfrobust...

Documents