elastic scheduling in hpc resource management systems

Elastic Scheduling in HPC Resource Management Systems

A DISSERTATION

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTA

BY

Feng Liu

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Professor Jon Weissman

Dec, 2018

c© Feng Liu 2018

ALL RIGHTS RESERVED

Acknowledgements

I would like to express my utterly most gratitude to my advisor Professor Jon Weissman.

During the past 6 years and 4 months, there were countless occasions when I couldn’t

deliver results on time. He showed tremendous patience and guided me through the

difficulties. He gave me valuable advices and hints and tolerated my mistakes. He spent

enormous efforts advising me. I appreciate all of our discussions, separate meetings,

emails, and collaborative writings.

I would like to express my great gratitude and appreciation to Dr. Kate Keahey from

Argonne National Lab. She directed me with her incredible insights. She absolutely

insists on high work standards, emphasizes on details, and holds big pictures. I have

gained numerous experiences working under her direction, which will help me to become

a lot more productive in my future career.

The works presented in Chapter 2 and Chapter 4 were sponsored by the Department

of Energy under the AIMES (Abstractions and Integrated Middleware for Extreme-Scale

Science) project (DE-FG02-12ER26115, DE- SC0008617, DE-SC0008651). Thanks to

the participants of the AIMES project: Shantenu Jha, Matteo Turilli, Andre Merzky,

Daniel S. Katz, Michael Wilde, Zhao Zhang, and Yadu Nand Babuji.

I would like to thank Dr. Pierre Riteau for the initial implementation of Balancer

service in Chapter 3. The work presented in the chapter was supported by the U.S.

Department of Energy, under the DOE-LAB-14-1003 and the NSF under the NSF-

1443080 award. Results presented in the chapter were obtained using the Chameleon

testbed supported by the National Science Foundation.

Finally, I would like to thank my parents. They always support me, and encourage

me to pursuit my goals.

i

Dedication

To my parents.

ii

Abstract

High Performance Computing (HPC) aggregates the power of computer clusters to

tackle large problems empowering science. HPC resource scheduling today is faced with

multiple challenges. Firstly, most HPC clusters are managed by queue batch systems.

Batch scheduler maximizes application run-time efficiency while sacrifices response time

and sometimes utilization. Secondly, HPC clusters reserved for on-demand data analysis

are operated at low utilization. Thirdly, multiple heterogeneous and dynamic HPC

resources greatly complicate resource scheduling for distributed applications.

To solve these problems, this thesis presents several elastic scheduling approaches.

Elasticity means the ability to dynamically allocate resources based on workloads. Elas-

ticity is commonly supported in Cloud but is lacking in HPC. Our approaches include

new scheduling algorithms and implementations of the algorithms as services. Our ser-

vices leverage existing techniques and are non-invasive, meaning that they minimize the

changes to user interfaces.

We address the first problem using Elastic Job Bundling (EJB), a technique that

dynamically transforms a large batch job into multiple smaller subjobs so that the

subjobs will start early on immediately available resources. Simulation results show

that our approach reduces application mean turnaround time by up to 48%, reduces

resource fragmentation by up to 59%, and reduces priority inversions by 20%.

We address the second problem using Balancer, a technique that combines and

dynamically moves nodes between an on-demand cluster and a batch cluster. Our

results show that for a real-life scenario, our approach reduces the current investment

in on-demand cluster by 82% while at the same time improving the mean batch wait

time by 8x.

We address the third problem using Bundle, a resource abstraction that represents

heterogeneous resource capacities and capabilities in a uniform way. We implement

Bundle as a service on 10+ heterogeneous HPC resources. We use Bundle to draw on

insights of resources.

iii

Contents

Acknowledgements i

Dedication ii

Abstract iii

List of Tables vii

List of Figures viii

1 Introduction 1

1.1 Resource Scheduling in HPC Clusters . . . . . . . . . . . . . . . . . . . 2

1.2 Challenges in HPC Resource Scheduling . . . . . . . . . . . . . . . . . . 4

1.3 Elasticity in HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Summary of Research Contributions . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Adaptive Resource Request . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Dynamic Resource Negotiation . . . . . . . . . . . . . . . . . . . 7

1.4.3 Dynamic Resource Bundle . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Elastic Job Bundling: An Adaptive Resource Request Strategy for

Large-Scale Parallel Applications 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Elastic Job Bundling (EJB) . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 The Formation of Elastic Jobs . . . . . . . . . . . . . . . . . . . 12

iv

2.2.2 Running Applications on Elastic Jobs . . . . . . . . . . . . . . . 14

2.2.3 Taming Unpredictability . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.4 Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . 18

2.3 EJB Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 TargetJobArrivalEvent Handler . . . . . . . . . . . . . . . . . . . 19

2.3.2 IdleJobSlotsAvailableEvent Handler . . . . . . . . . . . . . . . . 20

2.3.3 SubjobStartEvent Handler . . . . . . . . . . . . . . . . . . . . . . 22

2.3.4 TargetAppCompleteEvent Handler . . . . . . . . . . . . . . . . . 22

2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Trace-driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6.2 Improving Elastic Job Turnaround Time . . . . . . . . . . . . . . 28

2.6.3 Migration behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6.4 Multiple Elastic Jobs . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8.1 Moldable Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.8.2 Malleable Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Dynamically Negotiating Capacity Between On-demand and Batch

Clusters 39

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Leases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.1 Evaluating a Real-Life Scenario . . . . . . . . . . . . . . . . . . . 49

3.3.2 Evaluating Balancer Algorithms . . . . . . . . . . . . . . . . . . 53

v

3.3.3 Elasticity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 The Bundle Service for Elastic Resource Scheduling in HPC Environ-

ments 66

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 The Bundle Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.1 HPC Cluster Workload Characterization . . . . . . . . . . . . . . 72

4.4.2 Grid Nodes Performance Heterogeneity . . . . . . . . . . . . . . 73

4.4.3 Grid Network Performance Variation . . . . . . . . . . . . . . . . 74

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Conclusion 78

5.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1.1 Elasticity for Parallel Batch System . . . . . . . . . . . . . . . . 79

5.1.2 Elasticity for On-demand and Batch Hybrid System . . . . . . . 80

5.1.3 Elastic Resource Abstraction for Heterogeneous Resources . . . . 80

5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.1 Improving Elasticity by Using Container Based Solutions . . . . 81

5.2.2 Unified Batch and On-demand Scheduler . . . . . . . . . . . . . 82

5.2.3 Node-level Resource Partitioning and Sharing . . . . . . . . . . . 82

References 83

vi

List of Tables

2.1 Upper-bounds of processor (Pmax) and runtime (Rmax) of two types of

immediately backfillable job slots. . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Traces used in our simulation . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Increase ’+’ or decrease ’-’ percentages of mean wait, run, and turnaround

time of elastic jobs compared to target jobs’ baseline results. . . . . . . 28

2.4 Summarize migration related statistics. . . . . . . . . . . . . . . . . . . . 30

2.5 Before-and-after comparison: confidence intervals are calculated at the

95% confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6 Fragmentation: np is the average number of idle processors, % is the

percentage of the idle processors in the cluster. . . . . . . . . . . . . . . 35

3.1 Experimental results for the most challenging week: there are 24,177

batch jobs and 141 on-demand requests being submitted in each experi-

ment. The wait time is measured in minutes and the reserve values are

given in nodes. For the dynamic case, the on-demand and batch utiliza-

tion refer to the portion of utilization coming from on-demand and batch

requests respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 BundleAgent supported platforms . . . . . . . . . . . . . . . . . . . . . . 70

4.2 BundleAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

List of Figures

1.1 The process of (1) user submits resource request, (2) job scheduling, and

(3) application execution in an HPC cluster. . . . . . . . . . . . . . . . . 3

1.2 A real-world week-long bursty workload from the Argonne National Lab’s

APS cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Overview of the Elastic Job Bundling. . . . . . . . . . . . . . . . . . . . 12

2.2 Illustration of elastic jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Mapping a parallel application’s processes to an elastic job, including

progress measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Finding immediately usable resources under EASY backfilling. . . . . . 17

2.5 Three cases for subjob submission in a waiting elastic job. . . . . . . . . 20

2.6 EJB system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Performance measurement of six NPB programs under over-subscription.

All programs are compiled of problem size CLASS=C, with NPROCS=100(bt,sp),

128(ft,is,lu,mg). Different problem sizes or NPROCS follow the same pat-

tern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 Elastic job’s overall performance and variations. . . . . . . . . . . . . . . 29

2.9 Sensitivity analysis of Omax, λ, and ∆. . . . . . . . . . . . . . . . . . . . 31

2.10 Decision tree for elastic job selection . . . . . . . . . . . . . . . . . . . . 32

2.11 Bounded slowdown: side-by-side view before and after EJB is added,

grouped into elastic, non-elastic, and all jobs. . . . . . . . . . . . . . . . 34

2.12 Linear regressions of tw over job size before and after EJB is added.

Adjacent to x-axis indicates fairness. . . . . . . . . . . . . . . . . . . . . 35

2.13 Changing utilization: EJB is more resistant under high utilization. . . . 36

3.1 High-Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

viii

3.2 Performance results of the Basic algorithm, five batch workloads and six

on-demand workloads, R = 0, W = 0. . . . . . . . . . . . . . . . . . . . 55

3.3 Performance results of the Basic algorithm with static reserve, three

batch workload, ρ = 10% on-demand workloads. . . . . . . . . . . . . . 57

3.4 Performance results of the Hint algorithm, H = 15 min or H = 30 min,

three batch workload, six on-demand workloads. . . . . . . . . . . . . . 59

3.5 Performance results of the Predictive algorithm, three batch workload,

six on-demand workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Measurements of elasticity based on U77 and rho=20% workloads. . . . 62

4.1 Overview of the Bundle layer. . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2 An overview of Bundle architecture, blue shaded components comprise

the Bundle software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Visualization of a month-long workload of TACC Stampede HPC cluster. 72

4.4 Compute node performance clustering . . . . . . . . . . . . . . . . . . . 75

4.5 network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

ix

Chapter 1

Introduction

High Performance Computing, or HPC, generally refers to the practice of aggregating

the power of a cluster of computers to tackle large problems beyond the capacity of a

single node. Compared to general-purpose computing, HPC achieves high level perfor-

mance by co-designing high-end computing, storage, and network. From hurricane and

earthquake predictions to solving global hunger challenges, never before has HPC been

more crucial to empowering science for the benefit of humanity.

HPC allows scientists to simulate bigger models represented at finer scales at faster

speeds. Take the research area of earthquakes for example. Large earthquakes cause

devastating damage to human society. With the major advances occurring in HPC,

the ability to simulate the complex processes associated with major earthquakes helps

scientists in pushing the frontiers of studies targeting reducing seismic damage. For

instance, a work [1] published in 2017 runs large-scale nonlinear earthquake simulation

on Sunway TaihuLight – the world’s fastest supercomputer in that year. Their work

achieves over 15% of the supercomputer’s peak performance, with the extreme cases

demonstrating a sustained performance of over 18.9 Pflops, enabling the simulation of

Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution.

Not limited to scientific computing, the increasing adoptions of HPC technologies

also advance innovation in emerging areas such as Big Data, Internet of Things (IoT),

and Machine Learning (ML). One compelling example of such is AeroFarms, an IoT/ML-

driven vertical farm that controls plant’s operating and growing environment at fine

granularity. The vertical farm gathers data on every factors from moisture and nutrients

1

2

to light and oxygen, and sends the data to HPC facilities optimized for machine learning.

HPC technologies enable complex decision-making, such as real-time quality control that

relies on diverse types of data. Empowered by the deep integration of IoT/HPC/ML,

the farm is capable of using up to 95% less water than traditional field farming and

significantly improving annual productivity [2].

Driven by real-world HPC applications, both academia and industry have invested

large effort in building the next generation of HPC systems, i.e. Exascale computing [3],

which targets one exaflops per second by 2021 [4]. However, most emphasis was placed

on the architecture and infrastructure while HPC resource scheduling has not evolved

much compared to a decade ago. Almost all HPC clusters today are managed by queue-

based batch schedulers (e.g. SLURM, PBS, TORQUE) to arbitrate resource sharing

among applications. This type of scheduling aims to maximize application run-time effi-

ciency while sacrifices response time and sometimes utilization, which hinders the ability

of HPC system to successfully serve the needs of HPC users under different workload

patterns or supporting richer use scenarios. In order to support these needs, this the-

sis will explore alternative approaches which balance the aforementioned performance

goals, namely efficiency, utilization, and response time.

1.1 Resource Scheduling in HPC Clusters

As Figure 1.1 displays, in an HPC cluster, when a user wants to run an application,

e.g. a set of tasks, the user would need to submit a resource request to the cluster’s

scheduler. The request essentially specifies (a) number of processors, and (b) estimated

runtime needed to execute the tasks. The scheduler responds to the resource request by

granting a temporary ownership of resources, taking place between a well-defined start

time and end time. Such a temporary ownership is defined as a job, which is the basic

scheduling unit in an HPC resource scheduler. The distinction between a job and an

application is that the former can be seen as a lease which holds needed by the latter.

From a resource efficiency perspective, a direct method to achieve high performance

of an application is to give the job corresponding to the application exclusive ownership

of resources for a continuous period of time – a strategy commonly known as space share

3

Figure 1.1: The process of (1) user submits resource request, (2) job scheduling, and(3) application execution in an HPC cluster.

as opposed to time share, or multi-tenancy which allows multiple applications to con-

currently operate in a shared environment. Another important assumption with space

share is that an independent processor is assigned to every task within an application

such that space sharing is realized at task/processor level.

A batch scheduler arbitrates space share by managing a queue of jobs. It controls

the order at which jobs start running according to scheduling policies and resource avail-

ability. A job can start only when its entire resources are allocated in the processors ×time ‘shape’, such that the application can execute without interruption until comple-

tion or the job runtime expires. This job model can be characterized as ‘rigid-job’ and

‘all-or-nothing’. Rigid-job means a job’s size and duration isn’t flexible. All-or-nothing

describes the criteria that the scheduler has allocate the resources all at once, otherwise

the job will have to keep waiting.

Finally, another noteworthy trend in the HPC world is that in recent years, a new

generation of scientific equipment and device, ranging from advanced light sources to

geographic information systems (GIS), are capable of generating large volumes of exper-

imental data which requires rapid data analysis. Unlike traditional batch applications,

these new applications are more time-sensitive, meaning that they require on-demand

resource availability for data analytics rather than tolerate long wait time. Moreover,

fast turnaround time for data processing enables fast iterations of scientific experiments,

4

which ultimately accelerates the speed of scientific discovery. To serve this new class of

applications, many HPC system operators are building on-demand service frameworks

that support fast resource access.

1.2 Challenges in HPC Resource Scheduling

To understand the challenges in HPC resource scheduling, we need to analyze the trade-

offs made by existing approaches. In order to achieve high performance, the design of

HPC resource scheduling systems makes trade-offs between efficiency, utilization, and

response time.

First of all, HPC scheduling trades off response time for high utilization and effi-

ciency. Because of the high cost of building and operating HPC clusters, high utilization

is usually the major scheduling concern, which means that the job queues are usually

over subscribed by many users’ job requests. As a result, job requests usually have to

wait for long times (hours to days) in queues until they can be started. Moreover, as de-

scribed in the previous section, HPC job scheduling applies rigid-job and all-or-nothing

– two strategies that exacerbate long job waiting times, since it is difficult to find suffi-

cient resources that are concurrently available for a given period of time, especially for

large jobs with long runtimes.

A recent study [5] conducted on 20 representative HPC clusters over the past five

years clearly reflected the long wait time issue. While the backlog of jobs shows a

substantial fluctuation over time, the demand for resources measured by core years

consistently exceeded the capacity of the existing resources. Despite the wide variation

in job wait times among the 20 clusters, large jobs have much longer wait times compared

to small jobs.

Second of all, the pursuit of high efficiency in HPC sometimes hinders utilization.

This type of trade-off could also be explained by the limitation of rigid-jobs. When jobs’

sizes are rigid and predetermined at resource request time, they won’t be able to utilize

spontaneously available resources. If there were no other jobs to fill the idle resources,

those resources will be wasted, thus lowering utilization. With the large scale of today’s

HPC clusters, a 1-2% loss in utilization could mean a significant waste of computational

power and energy, any money.

5

Figure 1.2: A real-world week-long bursty workload from the Argonne National Lab’sAPS cluster.

Third of all, HPC infrastructures operated to support on-demand availability trade

off utilization for low response time. Resources are over provisioned according to peak

usage such that there are always free resources to fulfill on-demand requests. As a

real-world example shown in Figure 1.2, the APS cluster is a production on-demand

cluster operated in Argonne National Lab (ANL). This example shows a typical pattern

of on-demand workload: highly bursty but low in average utilization.

Based on the above analysis, we can summarize the challenges that HPC scheduling

needs to address as follows:

• Challenge 1: Batch jobs, especially large ones, tend to suffer long waiting time,

which hinders the usability of HPC systems;

• Challenge 2: Rigidity limits jobs’ opportunities to use spontaneous availability,

thus impairing utilization;

• Challenge 3: The support for on-demand availability significantly undermines

utilization.

The thesis will address all three challenges developing new elastic scheduling method-

ologies – a strategy commonly implemented by cloud computing, to improve HPC re-

source scheduling.

6

1.3 Elasticity in HPC

In the context of cloud computing, elasticity can be defined as “the ability of a system to

adapt to workload changes by automatically provisioning and deprovisioning computing

resources, such that at each point in time the available resources match the current

demand as closely as possible” [6]. In order to satisfy the conditions of elasticity, it

would not only require systems to quickly provision resources on demand, but also

require applications to be able to adapt to the resources available at run-time. Both

conditions however do not generally hold in the realm of HPC.

• Firstly, nearly all HPC facilities are operated by queue-based batch systems that

do not support on-demand resource provisioning, partly because preemption is

not allowed.

• Secondly, many HPC applications feature parallel computing tasks which require

a fixed number of processors throughout their runtime. This rigidity has greatly

limited the abilities of HPC applications to dynamically adapt to system resource

availability at run-time. Breaking the limitation would be a solid step towards

bringing some degree of elasticity to current HPC resource management paradigm.

Achieving elasticity in HPC systems would help address the key challenges as listed

in previous section. Specifically, if we can enable HPC applications to dynamically adapt

to available resources, these applications can take opportunities to utilize fragmented

resources such that they can start sooner and make progress faster, which ameliorates

challenges 1 and 2 at the same time. Moreover, if HPC systems can support resource

provisioning in an on-demand manner, thus addressing challenge 3, it will make HPC

systems very appealing to users who not only desire high performance but also benefit

from cloud-like resource provisioning – the best of both worlds.

1.4 Summary of Research Contributions

The thesis will present my effort in studying the methods of and building systems using

the elastic scheduling approaches. At a high-level, the thesis takes a problem-driven

approach: identifying problems by observing what users and system administrators are

7

facing in day-to-day operations. In the context of the thesis, users refer to those who run

applications on HPC systems, including scientific researchers, engineers, and students

from academic institutions.

High-level speaking, our approaches share several commonalities: We do not assume

that the users have expert-level programming skills. Also, we minimize the changes to

user interfaces. The changes made by our solutions are transparent to users. Therefore,

we chose to implement our methods in the middleware layer – on top of existing systems.

With respect to evaluations, we adopt the trace-driven approach. We validate our

approaches using real-world traces collected from production HPC facilities.

1.4.1 Adaptive Resource Request

In today’s batch queue HPC cluster systems, the user submits a job requesting a fixed

number of processors. The system will not start the job until all of the requested re-

sources become available simultaneously. When cluster workload is high, large sized jobs

will experience long waiting time due to this policy. To solve this problem, we propose

a new approach that dynamically decomposes a large job into smaller ones to reduce

waiting time, and lets the application expand across multiple subjobs while continuously

achieving progress. This approach has three benefits: (i) application turnaround time

is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our

approach does not depend on job queue time prediction but exploits available backfill

opportunities. Simulation results have shown that our approach reduces application

mean turnaround time by up to 48%, reduces resource fragmentation by up to 59%,

and reduces priority inversions by 20% [7].

1.4.2 Dynamic Resource Negotiation

The recent improvements in experimental devices, ranging from light sources to sensor-

based deployments for Smart Cities projects, lead to the need to analyze more data

on-demand so that they can be effectively used in the management of the experimental

or observational cycle. This means that small, dedicated analysis clusters used by

many experimental communities are no longer sufficient and their users are increasingly

looking to expand their capacity by integrating HPC resources into their workflow. This

8

presents a challenge: how can we provide on-demand execution within HPC clusters

operated mostly as batch?

Our answer to this question is the design and evaluation of the Balancer: a service

that dynamically moves nodes between an on-demand cluster configured with cloud

technology (e.g., OpenStack) and a batch cluster configured with a batch scheduler (e.g.,

Torque) with the changes for supporting on-demand resource reclamations. We propose

three algorithms for moving nodes between on-demand and batch partitions and evaluate

them experimentally both in the context of real-life traces representing two years of a

specific institutional need, and via experiments in the context of synthetic traces that

capture generalized characteristics of potential batch and on-demand traces. Our results

for the real-life scenario show that our approach reduces the current investment in on-

demand infrastructure by 82% while at the same time improving the mean batch wait

time almost by an order of magnitude (8x) [8].

1.4.3 Dynamic Resource Bundle

Large-scale distributed scientific applications are concurrently scheduled on multiple

HPC resources which are diverse in architectures and interfaces, and temporally vari-

ant in performance and workload. Executing an application on multiple heterogeneous

and dynamic resources is difficult due to the complexity of choosing resources and dis-

tributing the applications tasks over them, especially when the users and the developers

of the application lack of a good understanding of the general properties and perfor-

mance of the dynamic resources. This thesis addresses this issue by devising uniform

resource abstractions called the Resource Bundles, implementing them in middleware,

and experimental evaluations to show the benefits of our methodology.

The abstractions represent characterizations of resource capacities and capabilities.

We collected resource information over a year of 10 diverse HPC clusters of XSEDE and

NERSC, and thousands of distributed servers of OSG. These resource characterizations

offer useful insights on how distributed applications should be coupled with multiple

heterogeneous resources [9].

9

1.5 Outline

Chapter 2 will present Elastic Job Bundling, a service that dynamically determines re-

source requests and elastically execute parallel applications in existing HPC clusters.

Chapter 3) will present Balancer, a service that dynamically negotiates resources be-

tween a batch scheduler and an on-demand scheduler. Chapter 4 will present Bundle,

a service that supports elastic coupling between distributed applications and heteroge-

neous resources. Chapter 5 will conclude and discuss future work directions.

Chapter 2

Elastic Job Bundling: An

Adaptive Resource Request

Strategy for Large-Scale Parallel

Applications

2.1 Introduction

Scientific research today in areas such as fluid dynamics and climate modeling is largely

dependent on simulations which have large computational needs [10]. Parallel computers

are commonly used to address such problems of ever increasing scale [11]. With the rapid

growth of scientific parallel programs designed to execute simultaneously on hundreds to

thousands of processors, swiftly provisioning a large number of processors has become

more challenging.

Massively parallel supercomputers have long been the most popular platform for

executing large-scale scientific applications. Due to the high cost of these machines,

users usually space-share them by submitting individual job requests to the batch queue

system. Each job request contains the number of desired processors P and run time

estimation R. Once a job is scheduled, it gains exclusive use of the P processors until

it finishes before R, or is killed when R expires.

10

11

Mapping each application’s resource request to a P ×R shape is convenient for users

to specify and simplifies batch scheduler design. However, this rigid scheme may also

cause the following problems: (i) when system workload is high, it is difficult to find

enough free processors for large jobs which leads to long waiting time; and (ii) when

most jobs are large, a comparatively small number of free processors cannot be efficiently

utilized, since these fragments are unusable for any waiting job. Giving higher priorities

to large jobs will not solve these problems, particularly in the event that the workload

is dominated by large jobs.

In this work, we propose a new technique addressing the queue waiting problem

called Elastic job bundling. When a large job of size P ×R is waiting in the queue, we

decompose it into several smaller subjobs of size Px×Rx (Px < P ) to reduce wait time.

This technique then manages the time overlap of subjobs to allow the application to

continuously execute and make progress.

In contrast to prior approaches such as [12, 13, 14], our technique: 1) does not require

any changes to the batch scheduler, 2) does not depend on queue time prediction, and

3) does not require any changes to the application (e.g. moldability or malleability).

We evaluate our approach using real-world workloads. Preliminary results reveal

that our approach:

• on average reduces target job waiting & turnaround time by up to 69% & 48%

respectively;

• on average reduces system-wide job waiting & turnaround time by up to 39% &

27% respectively;

• promotes fairness in terms of waiting time between large and small jobs;

• lowers system fragmentation by up to 59%.

2.2 Elastic Job Bundling (EJB)

Elastic job bundling (EJB) is a software layer that operates between parallel application

end-users and HPC batch systems (see Figure 2.1). The goal of EJB is to reduce the

turnaround time of parallel applications, especially those that demand a large number of

12

Figure 2.1: Overview of the Elastic Job Bundling.

processors. EJB accepts ordinary job requests and transforms them into multiple smaller

subjobs which can start earlier than the original job. Applications initially start running

on these smaller subjobs with downgraded performance due to over-subscription. During

run time, the application will dynamically expand to processors subsequently acquired

by EJB, through additional subjob requests when more resources become available.

2.2.1 The Formation of Elastic Jobs

Traditionally, one parallel application A is bound to a single job J , with fixed processors

P and run time estimation R, which can be expressed as A 7→ J = (P,R). A batch

scheduler will either allocate all of the P × R resource or keep the job waiting. This

all-or-nothing job scheduling strategy can lead to inefficiency. Consider the example in

Figure 2.2a. Because all job requests are rigid, the three jobs experience long waiting

time despite the presence of many idle processors. Intuitively, by changing the “shapes”

of the waiting jobs in a way that they can adapt to the dynamic workload, we can not

only reduce queue waiting time, but also improve resource utilization.

EJB implements this idea as follows: EJB treats a job request sent to it as a target

job Jt, and the application bound to Jt as the target application, At 7→ Jt = (Pt, Rt).

EJB tries to improve Jt’s turnaround time by first decomposing Jt into several smaller

subjobs Jx = (Px, Rx), x = 1, . . . , n, Px < Pt. For example, if the jobs in Figure 2.2a

were submitted to EJB, it would treat those jobs as target jobs and decompose them to

smaller subjobs which can start much earlier and increase utilization (Figure 2.2b).

13

pro

cess

ors

time

free

J1

Running jobs

J2J3

(a) Rigid monolithic jobs experience long waiting time: Jobpriority J1 > J2 > J3.

pro

cess

ors

time

free

Running jobs

J11

J21

J12

J22

J31

J32

J33J23

(b) subjob decomposition: Jxy are sub-jobs decomposed from monolithic jobJx.

Je1

Je3

Je2

(c) Elastic job composi-tion: Jex is the elas-tic job corresponsing tomonolithic job Jx.

Figure 2.2: Illustration of elastic jobs.

Second, EJB “bundles” the resource allocations from the independent subjobs to

create an integrated malleable job Je = (J1, J2, . . . , Jn), called an elastic job, as Fig-

ure 2.2c shows. Third, EJB runs the target application continuously on the elastic job,

At 7→ Je, which will be discussed in the next section. At any point in time, the number

of total processors allocated to Je will be ≤ Pt since we maintain Pt processes of At at

all times.

A subjob looks like an ordinary parallel job to the batch scheduler. The prefix

“sub” is only meant to articulate a composition relationship between subjobs and the

integrated elastic job. The notations introduced in this Section are summarized as

follows:

• target job: Jt = (Pt, Rt);

14

• target application: At;

• subjob: Jx = (Px, Rx), x = 1, . . . , n;

• elastic job: Je = (J1, J2, . . . , Jn).

2.2.2 Running Applications on Elastic Jobs

When running a target application on an elastic job, the number of processors allocated

to all concurrently running subjobs can change. The total duration of an elastic job

can be divided into intervals. The number of processors in each interval stays the same.

However, EJB does not change the number of parallel processes in the application.

Instead, EJB adapts the target application to the elastic job through over-subscription

and migration. Thus, the application structure or logic need not change.

Given At 7→ Jt = (Pt, Rt), we can know that At has Pt processes. By running At

exclusively on Jt, with one process per processor, the run time of At is Rt,

pAt(Pt) = Rt (2.1)

Suppose At is compute-bound with balanced workload which is typical of many SPMD

applications. Under over-subscription, At is run on q processors, q < Pt. Under an even

distribution, each processor is time-shared by up to⌈Ptq

⌉processes where each process

on the same processor is given an equal share of the CPU. In this case, the expected

performance degradation would be proportional to⌈Ptq

⌉, such that:

pAt(q) = ∆ ·⌈Pt

q

⌉·Rt (2.2)

Where ∆ is a penalty factor that models the severity of performance degradation due to

over-subscription. In the ideal case, a ∆ = 1 indicates that the performance degradation

is linearly proportional to the degree of over-subscription.

Obviously, one processor can only support a limited number of processes for over-

subscription, due to memory constraints or context switching cost. We denote by Omax

an upper bound on the degree of over-subscription. For simplicity in this study, we

assume Omax to be the same for different applications, such that q ∈ [⌈

PtOmax

⌉, Pt]. Our

15

technique is applicable to more complex degradation models or to differing values of

Omax, but these are the subject of future work.

When a new subjob Jx is added to Je, EJB migrates a subset of At’s running

processes to Jx’s processors, lowering the degree of over-subscription. Before a running

subjob Jy terminates, EJB must migrate all the processes running in Jy to Je’s other

continuing subjobs. This type of cross-subjob migration can be performed in bulk, such

that all the migrated processes are migrated concurrently. At stops making progress

during migration intervals. We first assume that each bulk migration interval has a

fixed maximum duration of λ seconds. To evaluate the impact of migration cost, we

can vary λ.

Je has two types of intervals: RUN and MIGRATE. Suppose there are l intervals

in Je. In interval k (k = 1, . . . , l), Je has qk concurrently usable processors and interval

length=Lk. Now we model At’s progress on Je as:

∑k

Lk

pAt(qk)= 100%

where k = 1, . . . , l and k.type = RUN

(2.3)

Equation (2.3) can be understood as follows: the completion of At requires that the

accumulated progress made by every interval sums to 100%. Based on Equation (2.3),

we can: (i) estimate At’s progress at any point during it’s run time, (ii) estimate the

time it takes to achieve a certain amount of progress, and (iii) given At’s current progress

and upcoming intervals, estimate At’s completion time.

We demonstrate this progress model in Figure 2.3. In this example, a 3 process At is

submitted with Jt = (4, 800s). Je contains 3 subjobs: J1 = (1, 1460s), J2 = (1, 1060s),

and J3 = (2, 440s). Je has 7 intervals, the durations of which are marked below the

interval number. In interval 1, since only J1 is running, Je has 1 processor. Each of

the 4 processes of At over-subscribe the same processor in a time-shared manner, such

that At makes progress at a rate of 14 . By the end of interval 1, when J2 starts running,

At’s progress is 12.5%. With J2’s 1 processor added, Je has 2 processors. Interval 2 is

a MIGRATE interval. Suppose its length λ = 20s, within which process 3 and 4 are

migrated to J2. Then in interval 3, At makes progress at a rate of 12 . By the end of

interval 3, when J3 starts running, At’s progress is 37.5%. Since J3 terminates before

16

core

3

4

2

1 p1

p2

p3

p4

1 2 3 4 .. .. 1 2 3 4 .. .. .. .. 1 2

.. .. .. .. 3 4

1

3

2

4

1 2 .. ..

3 4 .. ..

2

4

1 2 .. 1

3 4 .. 3

job

application

J1 =(1,1460s)

J2 =(1,1060s)

J3 =(2,440s)

J t =(4,800s)

intervals 1 2 3 4 5 6 7

400s 20s 400s 400s 200s20s 20s

process

12.5% 37.5% 87.5% 100%0%

migration

progress

durations

Je =(J1,J2,J3)

Figure 2.3: Mapping a parallel application’s processes to an elastic job, includingprogress measurement.

At’s completion time, 2 MIGRATE intervals 4 and 6 are added. Processes 2 and 4

which were migrated to J3 in interval 4, must be migrated back to J1 and J2 in interval

6 before J3 terminates. There is no over-subscription in interval 5, by the end of which

At’s progress is 87.5%. Interval 7 is the last interval, by the end of which At’s progress

is 100%, At then completes. Je’s runtime is the summation of its intervals: 1460s.

2.2.3 Taming Unpredictability

EJB needs to control the sizes of subjobs to enable them to be scheduled early, and

to ensure that they overlap in run time to allow for migration. However, accurate

queue wait time prediction is known to be a difficult problem despite many efforts in

this area [15, 16, 17, 18, 19, 20, 21, 22]. We address this challenge by controlling the

shape P ×R of the subjobs such that they can be immediately scheduled to run on the

fragmented idle resources. E.g., production schedulers such as TORQUE [23] or SLURM

[24] are capable of providing immediately available resources information through the

user interface such as showbf or slurmbf.

At a first glance, one may think that it would be difficult to find sufficient idle

17

pro

cess

ors

shadow time

running job

first

extrafree

now

1st

queued jobs

2nd 3rd

(a) EASY backfilling algorithm.

proc

esso

rs

shadow time

firstjobin

queue

extra Type-I Job Slot

proc

esso

rs

now

firstjobin

queue

free Type-IIJob Slot

now shadow time

(b) Immediately backfillable job slots: type-I and type-II

Figure 2.4: Finding immediately usable resources under EASY backfilling.

Table 2.1: Upper-bounds of processor (Pmax) and runtime (Rmax) of two types of im-mediately backfillable job slots.

Pmax Rmax

Type-I slot extra processors unlimited

Type-II slot free processors Tshadow − Tnow

resources especially on HPC clusters that are often over-committed. However, we argue

that a large factor contributing to job waiting time is due to the shape of the queued

jobs, as in the example given in Figure 2.2a. Due to its wide deployment, we present

how EJB can work with EASY backfilling [25] and later evaluate its performance.

We now briefly revisit the EASY backfilling algorithm. Each time the scheduling

algorithm runs, EASY tries to maximize utilization at that point of time, while only

guaranteeing the start time of the first job in the queue. The example in Figure 2.4a

shows at time “now”, three jobs are waiting and the number of available free processors

< processors required by the 1st job. EASY first loops over the running jobs in the

order of their expected termination time, until the available processors are sufficient for

18

the 1st job, when the 1st job is guaranteed to start. EASY calls this time the shadow

time Tshadow. If the available processors at Tshadow > processors required by the 1st

job, the surplus processors are extra. As a second step, EASY finds backfillable jobs

according to the condition that they do not delay the 1st job. In our example, both

the 2nd and 3rd job do not satisfy this condition, so they will keep waiting. If any

lower-priority job satisfies the backfill condition, they will be selected as backfill jobs to

start immediately, and they may add unbounded delay to the 2nd and 3rd job in our

example.

Figure 2.4b shows the upper-bound in both processor and time dimensions of the

shape of backfillable jobs. Table 2.1 lists the upper-bounds, which can be spatially

imagined as slots with height=Pmax and width=Rmax, which we call immediately back-

fillable job slots. There are two types of immediately backfillable job slots. The Pmax in

a type-I slot = extra processors. Type-I slot has no upper-bound for Rmax. The Pmax in

a Type-II slot = free processors. Simply speaking, jobs submitted to fill the type-I slot

can run on a smaller number of processors with unlimited runtime. Jobs submitted to

fill the type-II slot can run on a larger number of processors, but with limited runtime.

2.2.4 Assumptions and Limitations

In summary, we made the following assumptions for EJB:

1. EJB targets the optimization of large tightly-coupled (such as MPI) parallel ap-

plications. Embarrassingly parallel, or bags of tasks are comparatively easier to

schedule, since they do not require co-scheduled subjobs, nor cross-subjob migra-

tions.

2. Target applications are compute-bound, and not memory-bound. Otherwise, a

large memory footprint will prohibit processor over-subscription.

3. In this work, we assume that the underlying batch scheduler runs the EASY

backfilling algorithm, without additional priority control policies.

2.3 EJB Scheduling Algorithm

EJB runs a heuristic event-driven scheduling algorithm executed at four types of events:

19

1. TargetJobArrivalEvent : A target job is submitted to the EJB scheduler (EJB-

sched).

2. IdleJobSlotsAvailableEvent : New idle job slots become available.

3. SubjobStartEvent : A subjob starts running.

4. TargetAppCompleteEvent : The target application has run to completion.

The time at which the event happens is called Tnow.

2.3.1 TargetJobArrivalEvent Handler

When EJB-sched receives a job request At 7→ Jt = (Pt, Rt), EJB-sched first needs to

check if the shape Pt × Rt can be scheduled immediately by the batch scheduler. For

this purpose, EJB-sched submits a special subjob J0 = (Pt, Rt) to the batch scheduler.

If J0 starts running immediately, then we are done. Otherwise, EJB-sched creates a

new Je for At. EJB initializes Je as follows:

• Status: Je.Stat = WAIT ;

• Maximum processors needed: Je.Pmax = Pt;

• Currently usable processors: Je.Pcurrent = 0;

• List of subjobs: Je.SubjobList = [J0];

• List of available intervals: an interval is a data structure having the following

information:

– Type - [RUN |MIGRATE],

– Processors - concurrently usable processors,

– StartT ime - when the interval starts,

– Duration - how long is the interval,

– SubjobList - subjobs running during the interval,

– MigrationP lan - valid for MIGRATE interval,

initially Je.IntervalList = [empty];

20

Type-II slot S2

Type-I slot S1

J1

J1

J2

J1

Case 1

Case 3

Immediatelybackfillable job slots available

Case 2

At.Tc1

At.Tc2

At.Tc3

Tnow

1 2 3intervals L 1 Durations L 2 L 3

Figure 2.5: Three cases for subjob submission in a waiting elastic job.

• Current progress of At: At.P rogress = 0%;

• At’s estimated completion time: At.Tc =∞.

J0 functions as a place holder in the batch queue.

2.3.2 IdleJobSlotsAvailableEvent Handler

When Je.Stat = WAIT

EJB-sched checks whether S1 and/or S2 are big enough to run the entire At. There are

three cases to check when satisfying this condition, as Figure 2.5 shows:(i) submit one

subjob which could fit S1, (ii) submit one subjob which could fit S2, and (iii) submit two

subjobs to fit both S1 and S2 respectively. EJB-sched estimates At.Tc, if feasible, for

each of the three cases. EJB-sched will then submit subjobs that produce the shortest

estimated completion time. If none of the three cases are met, then EJB-sched will do

nothing.

Case 1: if ∃S1 and S1.Pmax ≥ PtOmax

, then by submitting subjob J1 = (P1, L1), P1 =Pt⌈Pt

S1.Pmax

⌉ , L1 = pA(P1), At.Tc = Tnow + L1.

21

Case 2: if ∃S2 and S2.Pmax ≥ PtOmax

, then by running on P1 = Pt⌈Pt

S2.Pmax

⌉ processors,

L1 = pA(P1). If L1 < S2.Rmax, then J1 = (P1, L1), At.Tc = Tnow + L1.

Case 3: if (i) ∃ both S1&S2, and (ii) S1.Pmax ≥ PtOmax

, and (iii) Pt⌈Pt

S2.Pmax

⌉ > Pt⌈Pt

S1.Pmax

⌉EJB-sched may simultaneously submit two subjobs such that A will (i) run on both

subjobs in L1; (ii) migrate processes from J2 to J1 in L2; (iii) resume in L3 while

running only on J1 such that:

• P1 = Pt⌈Pt

S1.Pmax

⌉ ,

• P2 = Pt⌈Pt

S2.Pmax

⌉ − P1,

• L1 = S2.Rmax − λ,

• L2 = λ,

• L3 = (100%− L1pA(P1+P2)

) · pA(P1),

• J1 = (P1, L1 + L2 + L3),

• J2 = (P2, L1),

• At.Tc = Tnow + L1 + L2 + L3.

When Je.Stat = RUNNING

EJB-sched checks whether adding more resources to Je could advance At.Tc. This is

not always true because in order to increase speedup after adding more processors, Je

needs to pay the price of migration. EJB-sched will only decide to allocate more subjobs

to Je when the benefit outweighs the cost. EJB-sched needs to evaluate at most three

possible schedules based on resource availability and the current status of Je. Basically,

Je.IntervalList will be updated with the newly available processors. EJB-sched can

then re-estimate the new At.Tc according to the updated Je.IntervalList:

Case 1: submit a new subjob Jx = (Px, Rx) with Rx = new At.Tc − Tnow. This case

applies for a type-I available job slot, or a type-II slot when the slot’s Rmax is sufficiently

22

long. This case instantly triggers a migration in which processes in existing subjobs are

partially migrated to Jx. All the subsequent intervals will increase their qk by Jx.Px.

Case 2: submit a new subjob Jx = (Px, Rx) with Rx < new At.Tc − Tnow. This case

applies for a type-II job slot with small Rmax. Besides triggering an instant expansion

migration, this case will also schedule a shrinkage migration before Jx terminates. The

Processors associated to every interval in Je.IntervalList between Tnow and Tnow +Rx

will be incremented by qk.

Case 3: submit two new subjobs Jx = (Px, Rx) and Jx+1 = (Px+1, Rx+1), Jx will run

until the recalculated completion time and Jx+1 will terminate earlier. This case is a

combination of cases 1 and 2.

Many fine-grained optimization such as combining/removing migration intervals are

considered in our algorithm. For brevity, we omit how the shapes of Jx and Jx+1

are determined and how At.Tc is recalculated; please refer to [26] for details. Based

on the evaluation results in the above cases, EJB-sched will choose the schedule that

can produce the earliest completion time. Whenever new resources become available,

EJB-sched will call this event handler unless Je has reached full parallelism Je.Pmax.

2.3.3 SubjobStartEvent Handler

When a subjob starts, EJB-sched performs process migration as scheduled. However, if

the place holder job J0 starts, based on A’s current progress, EJB-sched has the options

of (i) migrating all running processes to J0, or (ii) continuing execution on existing

subjobs and cancel J0, or (iii) restarting At on J0 and discarding currently achieved

progress. EJB-sched will choose the option which can produce earliest completion time.

2.3.4 TargetAppCompleteEvent Handler

When the application terminates earlier than the projected finish time At.Tc, EJB-sched

will cancel all running subjobs. EJB-sched will also cancel J0 if it is still in queue.

23

2.4 Implementation

Figure 2.6 presents the architecture of the EJB system. At a high level, the EJB system

consists of two parts: the EJB Manager and the EJB Worker. The EJB manager can be

launched on any machine which has a connection to the HPC cluster’s front node. Users

of the EJB system can submit job requests to EJB-sched through an interface similar to

the batch submission. EJB-sched runs the scheduling algorithm described in last section.

EJB-sched interacts with the batch scheduler only through ordinary calls such as show

job queue status, submit job, and cancel job. In order to control and manage

the elastic job, scheduling operations for all elastic jobs are placed on a Operation

Queue. There are two types of Operations: launch which submits the application with

a computed degree of over-subscription, and migrate-dest which decides two things:

the group of processes that will be migrated, and the destination subjob that will receive

the processes. The EJB Controller is in charge of sending these operations to the EJB

Workers running in each subjob. This mode of operation can be seen as similar to the

pilot job [27], in which a resource is first acquired by a pilot job, and then tasks are

scheduled into that resource. In our case, when a subjob starts, the EJB worker will

direct the target application to perform the scheduled operations in that subjob. There

will be only a small number of messages sent between the EJB Controller and EJB

Worker throughout an elastic job’s lifecycle.

Over-subscription is supported by most MPI implementations including OpenMPI,

MPICH, and its derivatives. For example, the OpenMPI run-time environment detects

over-subscription and sets MPI processes to degraded mode which means that they

yield processors when idle. In order to validate our performance degradation model

of Equation (2.2), we use the NAS Parallel Benchmarks (NPB [28]) and measure the

over-subscription performance using the FutureGrid [29] testbed.

In Figure 2.7, NPB programs of fixed problem size and amount of parallelism were

run on fewer processors to produce over-subscription. We measure the end-to-end execu-

tion time at different over-subscription levels. The measured times are compared against

one-process-per-core. We can observe sub-linear (ft/is), linear (bt), and super-linear

(lu/mg/sp) performance degradations from Figure 2.7. Our key finding is: degradation

correlates with scalability. E.g. two of the programs having super-linear degradations,

24

control

Batch Scheduler

Nodes

HPC Cluster

subjobs

EJB Worker

EJBScheduler

OperationQueue

EJB Manager

EJB

Con

trol

ler

detect idle slots

Job Request

submit/cancel jobs

trigger scheduledoperations:

oversubscription,migration

elasticjobs

Figure 2.6: EJB system architecture

0 2 4 6 8 10 12 14 16

Degree of over-subscription

0

5

10

15

20

25

30

Nor

mal

ized

runt

ime bt

ftislumgsp

∆ = 1

∆ = 1.57

Figure 2.7: Performance measurement of six NPB programs under over-subscription.All programs are compiled of problem size CLASS=C, with NPROCS=100(bt,sp),128(ft,is,lu,mg). Different problem sizes or NPROCS follow the same pattern.

lu and sp, also achieve super-linear speedup in our experimental cluster. ft and is, both

show sub-linear degradation and also show sub-linear speedup. bt shows both linear

speedup and degradation. The only exceptional case is mg, which shows slightly sub-

linear speedup, but super-linear degradation. For this type of parallel programs, a more

conservative estimation of ∆ is needed.

Another key point is that the maximum number of processes on a single physical

25

processor is limited by memory size. For example, a parallel application will be killed

if the memory usage exceeded the physical memory size. Memory-bounded parallel

applications require out-of-core techniques such as [30] to be able to run with EJB. The

actual benefits in this situation depend on what level of over-subscription is workable

and how big the performance degradation is.

To enable migration, we use DMTCP [31], a user-level checkpoint/restart tool for

parallel applications including MPI. DMTCP does not require re-compilation of ap-

plication nor system privileges. Migration includes three steps: global checkpointing,

moving checkpoint images, and restart.

DMTCP supports checkpointing by adding a checkpoint management thread to ev-

ery process at start-up time. During checkpointing, all application processes are simul-

taneously suspended and the checkpoint images are written to disk by every DMTCP

checkpoint management threads. For compute nodes equipped with a shared file system

(as on FutureGrid), explicitly moving the image is not required. For restart, each EJB

Worker is directed to resume its own group of suspended processes, thus completing the

migration.

Thus on FutureGrid, the migration time is predominantly spent on checkpointing,

which is determined by the parallel application’s total memory usage and disk-write

speed. For example, in our experimental cluster, checkpointing a NPB bt program of

size C and 100 processes takes about 30 seconds, generating in total 3GB checkpoint

images at 100MB/s. On clusters supporting parallel I/O, the migration speed could be

greatly accelerated. Ultimately, the migration time grows sub-linearly with application

scale represented by number of processes [32]. For example, when the application scale

increases to 2000 processes, checkpoint time increases to 280s. In subsection 2.6.2, we

will run sensitivity analysis of varying migration cost from 60 to 600 seconds.

2.5 Trace-driven Simulation

We simulated EJB using logs of real parallel workloads from production systems [33] to

assess feasibility. Table 2.2 lists the 4 selected traces used by our simulation. These 4

traces have been widely used by previous studies of parallel job scheduling algorithms.

Our simulator is based on PYSS [34] – an event-based scheduler simulator developed

26

Table 2.2: Traces used in our simulation

Log Files CPUs Jobs Duration Uti%

CTC-SP2-1996-3.1-cln 338 77,222 7/96-5/97 85.2%

SDSC-SP2-1998-4.1-cln 128 59,725 4/98-4/00 83.5%

SDSC-BLUE-2000-4.1-cln 1,152 243,314 4/00-1/03 76.7%

KTH-SP2-1996-2 100 28,489 9/96-8/97 70.4%

by the Parallel Systems Lab at Hebrew University. In order to emulate how EJB works

in practice, the simulator’s EasyBackfillScheduler which functions as a cluster batch

scheduler is kept unchanged. Job traces contain both job walltime and runtime. The

former is user estimated run length. The latter is the application’s actual run length

recorded after it terminates, walltime ≥ runtime. In simulation, the job’s actual runtime

is unknown to EJB-sched. EJB-sched calculates the projected completion time based

on the job’s walltime. However, the simulator keeps track of the actual progress based

on the runtime, and triggers the TargetAppCompleteEvent once the actual progress is

100% (see Equation (2.3)).

In theory, any job in the trace can be submitted to EJB-sched. Nevertheless, jobs

requesting only a few processors cannot be further optimized through over-subscription.

If they experience long waiting time, it can be an indication of truly high system work-

load and our approach cannot find free slots under this condition. We set the minimum

P of an eligible elastic target job to be 8. We set the following default values: the

maximum degree of over-subscription, Omax = 8, the migration duration, λ = 120(s),

and the performance degradation factor, ∆ = 1. Note that λ should be affected by

application scale. 120s is a conservative value, which is the cost to checkpoint an appli-

cation with 400 processes [32]. Which is the upper-bound of the application size in all

of our traces. We use this conservative value to make sure migration time is sufficient

which prevents the situation of unfinished migration. In practice, the algorithm should

use varying λ as input, which is determined by application size.

Furthermore, each trace’s first 1% jobs, as well as the jobs that terminate after

last job arrival are excluded from the performance analysis. This is a commonly used

technique to reduce the impact of warm up and cool down effects.

27

2.6 Evaluation

We evaluate EJB through a series of experiments based on simulation. Our baseline for

comparison is a system scheduler that runs EASY Backfilling only. Overall, the results

reveal the following performance benefits of EJB:

• elastic job performance is significantly improved;

• non-elastic job performance is either not impacted or slightly improved;

• system fragmentation is reduced;

• fairness between jobs of different sizes is promoted.

We start by carefully choosing the appropriate performance metrics (Section 2.6.1). We

then measure how elastic jobs are improved (Section 2.6.2). We then evaluate how

migration cost impacts elastic job performance (Section 2.6.3). Finally, we study the

cluster-wide performance when co-scheduling many elastic jobs together (Section 2.6.4).

2.6.1 Performance Metrics

The elastic job’s turnaround time (tt) is the time when the target job is submitted to

EJB-sched to the point when the target application completes, which is also when all

subjobs terminate. When dividing the elastic job’s tt by the baseline tt, we have the

speedup of turnaround time:

Stt =baseline tt

elastic job tt(2.4)

The elastic job’s waiting time (tw) is measured from the target job’s submission to the

start of the first subjob of the elastic job. The elastic job’s run time (tr) is measured

from the time the first subjob belonging to the elastic job starts, to the time the elastic

job’s last subjob terminates. Elastic job’s bounded slowdown (Slo) is defined as

Slo =elastic job tt

baseline tr(2.5)

Notice that we don’t use elastic job tr in calculating slowdown, for the reason that

slowdown should be compared against the runtime on a dedicated system, without

over-subscription and migration. Bounded-slowdown substitutes a job’s baseline tr with

28

Table 2.3: Increase ’+’ or decrease ’-’ percentages of mean wait, run, and turnaroundtime of elastic jobs compared to target jobs’ baseline results.

tracetargetjobs

percentage change of mean

tw tr tt [95% conf. interval]

CTC 16,167 -50.4% +29.3% -33.9% [-34.7%,-33.1%]

SDSC 14,329 -68.9% +57.6% -48.4% [-49.5%,-47.3%]

BLUE 64,090 -59.8% +26.9% -36.1% [-36.7%,-35.6%]

KTH 4,399 -66.5% +34.9% -37.8% [-39.7%,-35.9%]

AVG -61.4% +37.2% -39.1%

10s when tr ≤ 10s. Bounded-slowdown avoids super-short jobs generating very large

slowdown values.

We measure system fragmentation as the average number of idle processors in the

cluster while the batch queue is not empty. This measurement excludes the period

when all the jobs in the cluster have received resources, yet there are still unallocated

processors. For example, if jobs never wait, then system fragmentation will always be 0,

independent of idle processors. As another example, if the scheduler is able to perfectly

fill all resources with jobs, then the system fragmentation is also 0.

2.6.2 Improving Elastic Job Turnaround Time

As a first step towards evaluating EJB’s performance, we isolate EJB’s impact on a

single target job by simulating one elastic job in each run of our simulator. We then

compare the elastic job performance against the baseline of the target job. Target jobs

are all the jobs which have P ≥ 8 and baseline tw > 0. Note: we do not need to

know tw accurately but simply whether it is non-zero. We can know this if J0 starts

immediately. As Table 2.3 shows, we simulated ≥ 100, 000 such jobs in four traces

combined. Figure 2.8a provides a clearer visual view of the how turnaround time has

been improved.

Elastic jobs’ mean tt is 39.1% faster than the baseline value, with variations between

traces. As expected, the EJB results in significantly shorter tw (61.4% lower) at the

expense of longer tr (37.2% higher) due to over-subscription and migration. Detailed

29

CTC SDSC BLUE KTH0

10000

20000

30000

40000

50000

60000

70000

Mea

njo

btu

rnar

ound

tim

e[s

ec]

tw (baseline)tr (baseline)tw (elastic)tr (elastic)

(a) Side-by-side view of how turnaround time improves by trans-forming target jobs into elastic jobs.

10−1 100 101 102 103 104

0.00.20.40.60.81.0

ctc

10−1 100 101 102 103 104

sdsc

10−1100 101 102 103 104 105

0.00.20.40.60.81.0

blue

10−1100 101 102 103 104 105

kth

Elastic job’s Stt [log10]

(b) Cumulative distribution function (CDFs) of the speedup of theturnaround time (Stt) of all elastic jobs.

Figure 2.8: Elastic job’s overall performance and variations.

distributions of Stt are depicted in Figure 2.8b which shows that most target jobs benefit

from being elastic. Some exceptionally well-performing jobs have tt 1, 000 times faster

than before. 1/4 of the target jobs’ tt are unchanged and < 3% of the elastic jobs

experience worse results.

Next, we perform sensitivity analysis to Omax, λ, and ∆. Figure 2.9 shows the results

for the CTC trace only, as other traces reveal similar trends. First, in Figure 2.9a we

30

Table 2.4: Summarize migration related statistics.

trace subjobs migrations migration duration resource overhead

CTC 3.3 2.1 8.5% 19.7%

SDSC 2.8 1.7 5.6% 16.4%

BLUE 2.7 1.5 8.6% 18.2%

KTH 2.7 1.6 6.4% 24.1%

vary Omax ∈ [2, 4, 8, 16, 32, 64]. The larger Omax, the greater the benefit of EJB. After

Omax has reached 16, further increasing Omax won’t bring evident performance gains.

Second, we evaluate whether the performance improvements are sensitive to the

migration cost. In Figure 2.9b, we vary λ from 1 to 10 minutes. The performance is not

very sensitive to the migration cost. This can be explained as the number of migrations

on average is small and the tt of the target job is much larger compared to migration

time, e.g. on average 9 hours in the CTC workload.

In Figure 2.9c, we vary ∆ from 0.8 to 2.0. The elastic job mean tt is not very

sensitive to degradation. This is due to (i) the fact that many elastic jobs are capable

of finding enough processors in the later stages of their life cycle, thus eliminating the

over-subscription overhead afterwards, and that (ii) tw still accounts for a considerable

proportion of tt even with EJB .

2.6.3 Migration behavior

Table 2.4 characterizes elastic job overhead with respect to the number of migrations.

The subjobs column shows that on average each elastic job consists of about 3 subjobs,

and conducts bulk cross-subjob migrations approximately twice. Actually, more than

60% of the elastic jobs contain more than one subjob, and around 40% of the elastic

jobs have experienced at least one bulk migration. In very rare cases, the number of

subjobs and migrations can reach > 20. This shows that the performance gain of EJB

is not only a result of moldability, but also the result of migrations.

The migration duration column shows that migration durations on average account

for 5 − 8% of an elastic job’s run time. Furthermore, extra CPU resources may be

spent due to migration and over-subscription. The resource overhead column shows

31

0 10 20 30 40 50 60 70

-10%-15%-20%-25%-30%-35%-40%-45%

ctc

Omax

mea

ntt

decr

ease

%

(a) Elastic job mean tt percent re-duction as a function of the maxdegree of over-subscription.

1m

in2

min

4m

in

10m

in

-30%-32%-34%-36%-38%-40%

ctc

λ

mea

ntt

decr

ease

%

(b) Elastic job mean tt percentreduction as a function of the mi-gration cost.

0.81.01.21.41.61.82.0

-20%

-25%

-30%

-35%

-40%

-45%ctc

∆

mea

ntt

decr

ease

%

(c) Elastic job mean tt percent re-duction as a function of the per-formance degradation factor.

Figure 2.9: Sensitivity analysis of Omax, λ, and ∆.

that elastic jobs have a 16− 24% resource overhead which is measured by processor ×time. A main factor contributing to this is the inaccuracy in the tr estimations. Based

on user provided trs, the EJB algorithm may decide that it is beneficial to perform

additional migrations. However, the real tr of these elastic jobs are much shorter, such

that the migrations may be unnecessary. To address this issue in our future work, we can

use the similar approach of [21] to more accurately estimate tr according to historical

job information and make migration decisions based on the adjusted tr.

32

>=8

<8

elastic job

non-elasticjob

number ofprocessors

candidatejob

non-candidatejob

>0 wait time

0 wait time

Figure 2.10: Decision tree for elastic job selection

2.6.4 Multiple Elastic Jobs

In Section 2.6.2, we analyzed how EJB impacts single job performance. In this section,

we try to understand the comprehensive performance impact when many elastic jobs

coexist, in effect competing for resources with each other and with other non-elastic

jobs. The following simulations are meant to emulate real-world conditions when users

arbitrarily submit job requests to EJB-sched.

The impacts of EJB are measured on:(i) elastic jobs, (ii) non-elastic jobs, and (iii) all

jobs. The impact determined by measuring how jobs perform differently after introduc-

ing EJB can be tricky. Since for each separate job, its performance in terms of tt or

tr can be largely dependent on background workload during its tt. From a single job’s

perspective, its background workload can be totally different if EJB were to be deployed.

We solve this dilemma by applying a statistical method called Before-and-After

Comparisons [35]. The Before-and-After comparison is designed to evaluate whether by

adding some new features to a system, the performance change is statistically significant.

In our context, the method works in this way: for each workload, we run the simulation

twice before and after involving EJB. Then for each performance metric, we have a

pair of results corresponding to each job’s before and after case. Next, we calculate a

confidence interval for the means of the differences of each paired value. If this confidence

interval does not include zero, then we can conclude with a certain confidence that there

is a statistically significant difference before and after introducing EJB.

First, we simulate an extreme condition by submitting all jobs that request at least

8 processors to EJB-sched. Table 2.5 shows the Before-and-After comparison results.

Notice that the number of elastic jobs are different from that of Table 2.3. We use the

decision tree in Figure 2.10 to determine which jobs will become elastic.

33

Table 2.5: Before-and-after comparison: confidence intervals are calculated at the 95%confidence level.

Workload Job typeElasticjobs

Mean tt Mean tw Mean tr

Before After (change %) conf. interval Before After (change %) conf. interval Before After (change %)

CTC

elastic 21,035 40,276 35,512 (-11.8%) (-5,091,-4,437) 31,695 25,863 (-18.4%) (-6,165,-5,498) 8,581 9,649 (+12.4%)

non-elastic 59,123 19,268 17,874 (-7.2%) (-1,503,-1,285) 7,008 5,614 (-19.9%) (-1,503,-1,285) 12,260 —

all 76,446 25,049 22,728 (-9.3%) (-2,442,-2,201) 13,801 11,186 (-18.9%) (-2,737,-2,493) 11,248 11,542 (+2.6%)

SDSC

elastic 18,790 58,235 40,888 (-29.8%) (-17880,-16813) 48,468 29,001 (-40.2%) (-20,016,-18,917) 9,767 11,887 (+21.7%)

non-elastic 40,333 10,589 8,676 (-18.1%) (-2,052,-1,775) 5,412 3,499 (-35.3%) (-2,052,-1,775) 5177 —

all 59,123 25,731 18,913 (-26.5%) (-7021,-6615) 19,096 11,604 (-39.2%) (-7,701,-7,282) 6,636 7,309 (+10.2%)

BLUE

elastic 135,302 16,863 14,881 (-11.8%) (-2,067,-1,899) 12,015 9,626 (-19.9%) (-2,475,-2,302) 4,848 5,254 (+8.4%)

non-elastic 105,560 4,096 3,186 (-22.2%) (-941,-878) 1,109 199 (-82.0%) (-941,-878) 2,987 —

all 240,862 11,268 9,756 (-13.4%) (-1,562,-1,463) 7,235 5,495 (-24.1%) (-1,791,-1,690) 4,033 4,261 (+5.7%)

KTH

elastic 5,811 32,457 25,632 (-21.0%) (-7,347,-6,302) 22,137 13,673 (-38.2%) (-9,010,-7,918) 10,320 11,959 (+15.9%)

non-elastic 22,392 11,523 11,565 (+0.4%) (-56,141) 2,982 3,024 (+1.4%) (-56,141) 8,541 —

all 28,203 15,836 14,463 (-8.7%) (-1509,-1236) 6,929 5,218 (-24.7%) (-1,853,-1,568) 8,907 9,245 (+3.8%)

All jobs submitted to EJB-sched are candidate jobs. EJB-sched only transforms a

candidate elastic job to an elastic job when the job’s original shape cannot be started

immediately. The increase in the number of elastic jobs (e.g. the number of elastic jobs

in CTC has increased from 16,167 in Table 2.3 to 21,035 in Table 2.5) indicates that

when we saturate the cluster with elastic jobs, a greater number of jobs are identified

as eligible for elasticity. The reason is that the mutual influence between elastic jobs

causes more jobs that were inelastic originally because tw = 0, to now become elastic.

However, the mean turnaround time of elastic jobs is significantly reduced.

Table 2.5 shows that for all the workloads except KTH, wide use of EJB not only

results in shorter tt for elastic jobs, but surprisingly improves the response time of

non-elastic jobs, and the improvement is statistically significant. For KTH, elastic jobs

are also significantly faster than before. Non-elastic jobs in KTH are on average 0.4%

slower after EJB is applied. However this performance degradation is not of statistical

significance since its confidence interval (−56, 141) crosses 0.

The performance results measured by bounded slowdown shown in Figure 2.11 are

consistent with the turnaround time results such as in Table 2.5 (column 5). The

maximum slowdown (which is too large to be shown in the graph) experienced by the

most unlucky job also decreases. [36] indicates that the mean turnaround time and the

mean slowdown are seperately dominated by long and short jobs, thus EJB is not biased

toward any type of job. Actually, we observe that large jobs with short tr benefit greatly

from EJB. These jobs previously suffered from long waiting time due to the height of

their original shape. EJB enables these jobs to start earlier, hence they will complete

34

020406080

ctc

0

50

100

150sdsc

elas non-elas all0

1020304050

blue

elas non-elas all0

50100150200250

kth

Bou

nded

slow

dow

n

beforeafter

Figure 2.11: Bounded slowdown: side-by-side view before and after EJB is added,grouped into elastic, non-elastic, and all jobs.

in less time.

In order to evaluate how EJB promotes fairness, we did a linear regression analysis

of all job’s tw and over job size in Figure 2.12. We admit that job’s tw does not have

strict linear correlation with job’s size. However, the trend is that larger jobs tend

to wait longer. Actually, large jobs are known to suffer more than small jobs under

scheduling policies that optimize mean tt or slowdown [37]. By comparing the slopes

of regression lines generated from the results before and after EJB is added, we can see

that the slope of the tw under EJB is flatter indicating less sensitivity to processor size

(i.e. is more fair). We have also measured the total number of priority inversions, which

drops about 20% when EJB is applied. This is further evidence of fairness.

Table 2.6 shows the measurement of fragmentation as defined in Section 2.6.1. The

result shows that with EJB, average system utilization is higher when there are jobs

in the queue which indicates EJB uses the idle processors to help queueing jobs start

sooner.

Finally, in Figure 2.13 we measure EJB performance by synthetically decreasing/increasing

system utilization through changing job’s arrival rate. From the results we can see when

cluster utilization is low, EJB performs similar to batch scheduling. However, in clusters

with high utilization, EJB performs significantly better.

35

0 100 200 3000.0

0.5

1.0

1.5

2.0×105 ctc

0 50 100 1500.0

0.5

1.0

1.5

2.0×105 sdsc

1 301 601 901 12010.0

0.5

1.0

×105 blue

20 40 60 80 1000

2

4

6

×104 kth

Job size (number of processors)

Wai

tti

me

(sec

onds

)

beforeafter

Figure 2.12: Linear regressions of tw over job size before and after EJB is added.Adjacent to x-axis indicates fairness.

Table 2.6: Fragmentation: np is the average number of idle processors, % is the per-centage of the idle processors in the cluster.

Tracebefore after

changesnp % np %

CTC 33.7 10.0% 22.4 6.6% -34.0%

SDSC 13.8 10.8% 5.9 4.6% -57.4%

BLUE 129.5 11.2% 53.8 4.7% -58.0%

KTH 16.0 16.0% 6.6 6.6% -58.8%

2.7 Discussion

We have shown that EJB reduces large job turnaround time with minimal impact on

small jobs. We attempt to explain this interesting phenomenon as EJB is, in effect,

homogenizing system workload, by decomposing large jobs into smaller ones. Com-

pared to larger jobs, smaller jobs can allow schedulers to allocate resource more quickly

and improve load balance [38]. Ultimately the performance improvement comes from

reduced system fragmentation. When workload is high, EJB lowers the average size of

jobs. When workload is low, EJB generates additional jobs to exploit idle resources.

Another point worth discussion is: On a EJB-ready HPC cluster, when should EJB

36

1.01.52.02.53.03.5

×104 ctc

0123456×104 sdsc

50% 60% 70% 80% 90%0.51.01.52.02.53.03.5

×104 blue

50% 60% 70% 80% 90%1.01.52.02.53.03.54.0

×104 kth

utilization

mea

ntt

w/o EJBw/ EJB

Figure 2.13: Changing utilization: EJB is more resistant under high utilization.

be activated? Our view is that EJB can be dynamically switched on/off according to

system workload. Users can be given the option of specifying whether they would like

to pay a little bit more resource quota in return for faster turnaround time. When the

batch queue length has exceeded a certain threshold, the administrator could decide to

enable EJB to reduce wait time.

2.8 Related Work

Characterized by different patterns of resource usage, parallel jobs are categorized by

three types. Rigid jobs require a fixed number of processors. Moldable jobs can be

executed on several processor sizes. The actual number of processors is determined

at the start, and never changes. Malleable jobs may change the number of processors

during execution. Bringing flexibility to parallel jobs to adapt them to system workload

has been extensively studied. The essentials of these studies are twofold. First, is the

mechanism to allow a parallel job to use different number of processors. Second, is the

scheduling strategies used such as moldability or malleability. This section will briefly

compare EJB with several representative approaches.

37

2.8.1 Moldable Jobs

Cirne’s works in [12, 13] rely on applications to be moldable and job waiting time to

be predictable to improve moldable job turnaround time. It chooses the job size based

on which size might produce the shortest tw + tr. The merit of this approach is that

it requires no system changes. Nonetheless, estimating job waiting time can be very

error-prone. Also, many applications are not moldable, e.g. some applications can only

be decomposed into restricted degrees of parallelism such as powers of two. Moreover,

by definition moldable jobs can not grow to a larger resource footprint to gain further

speedup even when free resources become available after the moldable job starts running.

Commercial cluster schedulers like Moab support moldable job requests, in which

the user provides several options for job size and walltime. The scheduler will choose

an option based on whichever option can be met first. Basically, this is similar to our

approach but the application must be moldable and migration to enable expansion of

parallelism is not supported. We have evaluated this situation by setting migration

cost to infinity, and the performance was shown to be inferior to EJB due to lack of

adaptation to additional resources.

2.8.2 Malleable Jobs

Malleable (or adaptive) jobs have the attractive property that they dynamically adapt

to system workload [39]. ReSHAPE [40, 41] is a framework supporting dynamically

changing the number of processors of iterative, structured (2-D decomposition) applica-

tions, for the purpose of both selecting the best number of processors to yield the best

efficiency by expanding/shrinking the processor size according to the system workload.

The merit of their work is a implementation of a library which is capable of dynami-

cally mapping data to different number of processors. The user of their approach needs

to insert primitives into the code to indicate a resizing point. Our approach does not

require application modification. Tightly-coupled malleable applications are difficult to

implement, and require runtime support at the system level. Utrer et al. [14] proposed

a job scheduling strategy based on virtual malleability: processes within the same node

can be over-subscribed to use fewer processors, such that free processors could be allo-

cated to queued jobs. However, their approach is based on the assumption that they can

38

deploy their own scheduler to control the cluster, while our approach does not require

any change to the system scheduler. Also, the migration within a node approach can

not expand a running application to other available physical nodes.

2.9 Conclusion

We have presented elastic job bundling (EJB), a new resource allocation strategy for

large parallel applications. EJB decouples the one-to-one binding between parallel appli-

cations and jobs, such that one application can run simultaneously on multiple smaller

jobs. By transforming one large job into multiple smaller ones, faster turnaround time

is possible especially on HPC clusters with high workload. We simulated our algo-

rithm using real-world job traces and show that EJB can (i) reduce target job’s mean

turnaround time by up to 48%, (ii) reduce system-wide mean job turnaround time by

up to 27%, and (iii) reduce system fragmentation by up to 59%.We have also presented

an implementation that can realize this approach.

We have made the EJB code available on github [42], such that anyone interested

can obtain and use the complete algorithm code and reproduce our experimental results.

Chapter 3

Dynamically Negotiating

Capacity Between On-demand

and Batch Clusters

3.1 Introduction

The recent improvements in experimental devices, ranging from light sources to sensor-

based deployments, lead not only to the generation of ever larger data volumes but to the

need to support time-sensitive execution that can be used effectively in the management

of experiments, observations, or other activities requiring quick response turnaround.

This means that small, dedicated analysis clusters used by many experimental commu-

nities are now no longer sufficient and their users are increasingly looking to expanding

their capacity by integrating high performance computing (HPC) resources into their

workflow. This presents a challenge: how can we provide on-demand execution within

HPC clusters which are today operated mostly as batch?

The inspiration for this project was provided by scientists from the Advanced Photon

Source (APS) at the Argonne National Laboratory (ANL). APS is currently operating

a cluster dedicated to experiment support: the execution of jobs run on the cluster has

to be completed in the shortest time possible; thus the need for dedicated resources.

However, as the experiments increasingly require greater processing power, an interest

39

40

arose in using HPC resources so long as they can be provisioned on demand in a cost-

effective manner and with environments suitable to APS computations. This conflicts

with the modus operandi of HPC resources today, which are usually available via batch

schedulers maximizing utilization and thus amortization of expensive resources, and do

not provide environment management. In this chapter, we propose a solution to this

use case.

This chapter presents the design and evaluation of the Balancer: a service that dy-

namically moves nodes between an on-demand cluster configured with cloud technology

(in our case OpenStack) and an on-availability cluster configured with a batch sched-

uler (in our case Torque) as the need for on-demand availability changes. The ability

to integrate commodity, generally used technologies was an important requirement of

our design. Another requirement was to make it as non-invasive as possible, i.e., to

not kill or checkpoint running batch jobs as we have done in [43], or rely on specialized

adjustments to scheduling policies of the existing tools. We propose three different al-

gorithms for moving nodes between the on-demand and batch partitions and evaluate

them, first experimentally in the context of real-life traces representing two years of a

specific institutional need, and then via experiments in the context of synthetic traces

that capture generalized characteristics of potential batch and on-demand traces.

Our results, based on a real-life scenario, show first that combining capacities and

workloads of on-demand and batch clusters can provide sufficient capacity to satisfy all

on-demand requests while reducing the dedicated portion of the cluster by 82%, im-

proving the mean batch wait time almost by an order of magnitude (8x), and improving

the overall utilization as well. We secondly show that in a general case we can support

bursty on-demand workloads corresponding to up to 10% capacity of the cluster it shares

with batch workload in a non-invasive way, to achieve higher combined utilization.

In summary, this chapter makes the following contributions:

• We describe an architecture and implementation for dynamic non-invasive resource

reassignment between two systems: a system providing resources on-demand and

a system providing resources based on availability, that balances their respective

objectives in terms of the number of satisfied on-demand requests and utilization.

• We propose three algorithms for balancing resources in this context: the Basic

41

algorithm providing a baseline of our systems, the Hint algorithm that models the

behavior where experimental users can register the upcoming need for on-demand

cycles, and the Predictive algorithm for cases where such advance notice is not

possible.

• We evaluate these algorithms for different Balancer behaviors, to understand what

workloads we can successfully balance under this system, using two years of traces

from an experimental and a mid-scale cluster at Argonne. We show that we can not

only support the existing use case with dedicated resources significantly reduced,

but also scale the bursty on-demand workload to up to 10% of the capacity of the

cluster it shares with the batch workload in a non-invasive way.

3.2 Approach

The inspiration for this project was provided by scientists from the Advanced Photon

Source (APS) at the Argonne National Laboratory (ANL). APS provides a facility for

experiments in many scientific domains. To support them, it operates a cluster dedicated

to experimental analytics; the execution of jobs on this cluster is typically critical, has to

be completed in the shortest time possible, and thus typically run on demand. Since the

cluster is only periodically used, it is not well utilized. Recent experiments demonstrated

great utility of using HPC resources on demand; however, those HPC resources are

typically managed by batch schedulers that ensure good utilization required to amortize

the cost of expensive resources but do not ensure on-demand access in a cost-effective

manner.

Infrastructure-as-a-Service cloud technologies, such as OpenStack, have been a pop-

ular solution for on-demand access as they also provide environment management via the

deployment of virtual machines (VMs) or containers. We propose to use those existing

cloud technologies and provide a system that will combine them with HPC schedulers

in a non-invasive way, by arbitrating resource assignment between them. Specifically,

the system will meet the following objectives:

• Inject on-demand and environment management for the on-demand resources into

batch clusters such that we can schedule as many on-demand requests as possible

42

with as little impact on utilization as possible (i.e., maximizing utilization while

minimizing the number of rejected on-demand requests).

• Provide a solution in terms of existing commodity frameworks for both on-demand

and batch, such as OpenStack or Torque, such that the user’s interface to those

systems does not change and the changes to the systems themselves are minimal

though flexible.

• The solution should be minimally invasive in terms of interference with the normal

operation of the batch scheduler. We will not e.g., kill or checkpoint/snapshot jobs

in order to make room for on-demand requests as we have done in [43] or rely on

the availability of specialized queues with smaller sized jobs that can be used for

backfilling [7].

3.2.1 Leases

To explain our approach, we will use the concept of a lease, defined as a temporary

ownership of resources, taking place between a well-defined start time and end time.

In this chapter, we will differentiate leases based on their start time; the end time may

be bounded (e.g., assumed or specified to last a specific amount of time) or unbounded

(used until terminated by an event).

We define two types of leases, one reflecting the concern of users who are interested

in controlling the start time of their computations, and the other reflecting the concern

of the providers, who are interested in optimizing the utilization of their resources:

• An on-demand lease starts within a window of time W after the request has

been made and may or may not have a defined end time. Time W is typically

understood to be short, e.g., under a minute or two, and may comprise actions

such as virtual machine deployment and boot. This startup time can be arbitrary

and can include some system management, e.g., terminating jobs in order to make

room for a lease. On-demand requests are the most common type of request in

compute clouds and are implemented by all major cloud providers.

• An on-availability lease starts whenever the provider makes resources available for

the lease. Examples of on-availability leases include resource assignments given

43

out by a batch scheduler, high throughput leases implemented by systems such

as SETI@home [44], or spot pricing leases implemented by Amazon EC2 [45].

Since the lease may not (and generally does not) start immediately, the request is

typically placed on a queue; the provider selects it for resource allocation based

on a variety of concerns that generally favor increasing utilization but may also

take other factors into account (e.g., EC2 spot pricing).

3.2.2 Architecture

Our approach is to soft-partition nodes in a large cluster into two scheduling pools, an

on-demand pool and an on-availability pool, and to implement a mechanism that will

dynamically move nodes from one pool to the other to maximize our objectives.

In keeping with our assumptions, both resource managers (on-demand and on-

availability) are independent of each other; nodes in the on-demand pool are managed

by an on-demand resource manager (ODRM) while nodes in the batch pool are managed

by an on-availability resource manager (OARM). The users of each resource manager

use their respective interfaces to request resources; they are not affected by the presence

of the other resource manager, except as by having some requests rejected or delayed

due to changes in resource availability.

Our architecture (Figure 3.1) consists of a service, called the Balancer, which nego-

tiates adjustments in the respective sizes of the on-demand and on-availability resource

pools with ODRM and OARM. Implementing the Balancer as a separate service, dis-

tinct from both resource managers, allows us to implement bilateral negotiation and

also to implement the system with minimal changes to both resource managers. The

Balancer understands the status of each node in the whole resource, as well as whether

at any given moment they belong to the ODRM or OARM pool. However, the Balancer

only manages which nodes belong to which pool; the scheduling decisions are left to

ODRM and OARM. The boundaries between the pools are re-evaluated by the Bal-

ancer on an ongoing basis, negotiating with each scheduler for the availability of nodes

in the respective clusters.

The on-demand pool contains a group of nodes called the reserve R, which may be

set to zero. The reserve represents nodes that cannot be moved to the batch pool and

is intended to ensure that the system has up-front capacity to schedule resources when

44

Figure 3.1: High-Level Architecture

on-demand requests come in. Otherwise, the division between the on-demand and on-

availability parts of the cluster is fluid and constantly re-evaluated. Another parameter

of the system is the time window W , which defines how long the system can wait before

scheduling an on-demand request.

In the context of this chapter, the Balancer implements a simple one-way negotia-

tion which requests nodes from the OARM as needed; nodes from the ODRM are only

contributed by the resource manager itself. Under the current assumptions, execution

on the nodes in the on-availability pool has to be finished before they are contributed to

the Balancer; this means that the Balancer’s request for nodes from OARM to be con-

tributed in the allotted time may be unsuccessful. Ultimately, this negotiation protocol

can be extended to implement more complex constraints.

The interaction with OARM and ODRM takes place via the following interfaces.

Balancer Interface:

• request nodes(int n): request n additional nodes from the Balancer (the decision

of which specific nodes from the on-availability pool should be made available to

the ODRM is made by the Balancer)

• release nodes(node list): release specific nodes to the Balancer

• update nodes(node states): attempts to update status of specific nodes. Can

return an error if the status update is incorrect. This interface is used by OARM

when: (1) it attempts to run a job on a node it believes is in the OARM pool

45

and calls this interface to avoid a race condition with Balancer reclaiming this

node for ODRM at the same time; and (2) it finishes executing a job, giving the

opportunity for the Balancer to reclaim it if needed.

OARM Interface:

• reclaim node(nodename): reclaim a specific node, identified by its hostname,

from Balancer to OARM

• restore node(nodename): restore a specific node, identified by its hostname, from

Balancer to OARM

• get status of all nodes(): returns a list of data structures that for each node

describes if it is busy executing a job, free, or offline, and if it is executing a job

what is the remaining wall time.

3.2.3 Algorithms

Basic algorithm

The objective of the Basic algorithm is to implement a simple mechanism whereby

the Balancer requests nodes from the OARM as on-demand requests come in and uses

reserve as well as wait time to “pad” availability. Algorithm 1 shows the pseudo-code.

At any point of time, a node can be in any one of 4 states: OD Reserve, OD Alloc,

OA Idle, and OA Busy.

R is the number of nodes that are statically reserved for the ODRM pool. When

an on-demand request comes in, the Balancer allocates nodes from: (1) OD Reserve,

(2) OA Idle, and (3) OA Busy nodes whose jobs finish before time W . A request is

rejected if the Balancer cannot allocate n nodes before time W or immediately when

W = 0.

46

Input: R (default = 0), W (default = 0)Function request nodes(n):

nr ← nodes currently in OD Reserve stateni ← number of nodes in OA Idle stateif nr ≥ n then

allocate n OD Reserve nodeschange node state to OD Allocreturn node list

elseif nr + ni ≥ n then

reclaim nodes(n− nr)change node state to OD Allocreturn node list

elseif W = 0 then

return Rejectionelse

reclaim nodes(ni)wait for W secondsforeach received update nodes message do

reclaim nodes(1)if n nodes can be allocated then

change node state to OD Allocreturn node list

end

endif W expires then

return Rejectionreclaimed nodes are kept in OD Reserve state for I secondsbefore release to OARM pool

end

end

end

end

Function release nodes(node list):foreach node in node list do

change node state OD Alloc → OD Reserveif node is not statically reserved then

node is kept in OD Reserve state for I seconds before release toOARM pool

end

endAlgorithm 1: Basic algorithm

47

Hint algorithm

The Hint algorithm is a refinement of our original attempt at the Basic algorithm

and reflects the fact that in experimental communities it is often possible to determine

resource need within a short time (15-30 minutes), though it may not always be possible

to pinpoint it to a particular time days in advance. This allows us to implement a

dynamic reserve (i.e., a reserve that changes according to the situation).

Both functions request nodes and release nodes stay the same as in the Basic al-

gorithm. The Balancer introduces another interface add reserve(H,N), mandating the

Balancer to add N extra reserve nodes before time H. Here we are essentially parame-

terizing the Hint algorithm by two parameters: time H for “advance notice” or “hint”,

and N the number of nodes that are requested by the user or a third party. Note that no

nodes are statically reserved in the Hint algorithm—any nodes that are in OD Reserve

state for more than I seconds will be released to OARM pool.

Predictive algorithm

The Predictive algorithm is another refinement of the Basic algorithm that also imple-

ments the notion of a dynamic reserve but in situations when an advance notice is not

possible. The Predictive algorithm is run by an out of band predictor which invokes the

add reserve interface on behalf of the users. The predictor collects historical data from

the Balancer and use history to predict future on-demand requests.

Our Predictive algorithm is based on three observations of the arrival time of on-

demand requests. Firstly, the arrival time follows a strong diurnal pattern which can

be explained by the interactive nature of the APS workload. Secondly, the arrival time

shows moderate correlation between adjacent weeks, meaning that if there is a burst of

requests during several hours of one week, then there is likely a similar burst during the

same hours of the next week. Thirdly, sometimes there are bursts of requests during

the same hours of consecutive days. Our Predictive algorithm is described as follows:

The predictor divides each day into four 6-hour slots. At the end of each time slot,

the predictor queries the Balancer for how many nodes were requested during the time

slot. At the beginning of each time slot, the predictor invokes add reserve(0, N) to

reserve N nodes where N is estimated based on the peak number of requested nodes

48

during the same time slot of the last month, week, and day.

3.2.4 Implementation

Our implementation of the Balancer is configured to work with the Torque resource

manager [46] and the Maui cluster scheduler [47], used as OARM, and OpenStack (with

the KVM hypervisor [48]) as ODRM. It consists of a simple web service developed using

the Flask Python micro web framework, separately from either OpenStack or Torque. It

offers an HTTP endpoint capable of receiving resource requests as well as notifications

of resource status changes. To move nodes between the on-demand and on-availability

pools, the Balancer enables or disables them in Torque using the pbsnodes command

with arguments -o to disable or -c to enable.

The update nodes interface is implemented for Torque nodes as prologue and epi-

logue scripts which are triggered respectively when a job starts and ends execution

(whether successfully or not). These notifiers make HTTP requests to the balancer in

order to update its record of resource status, i.e., nodes available for stealing. No other

changes were required to integrate Torque into the system.

In order to make OpenStack work with the Balancer, we had to make small modifi-

cations to the OpenStack implementation: the scheduler (Nova) requests more resources

from the Balancer if it does not have enough available for scheduling virtual machines

requested by on-demand users (using request nodes), and resources are released to the

Balancer when instances are terminated (using release nodes). We also had to fix con-

currency issues in the scheduler when using large wait times which can make many

independent resource requests block and then resume execution at the same time.

3.3 Experimental Evaluation

We conduct our experimental evaluation in two stages. We first evaluate our approach

using the Basic algorithm in the context of a real-life scenario defined by two years

worth of traces reflecting the needs of on-demand and batch jobs at the Argonne National

Laboratory; this gives us insight into realistic demand and submission patterns. Second,

we use synthetic traces to generalize the problem and evaluate and compare the three

algorithms we formulated.

49

Our overall experimental methodology consisted of emulating the actual runs by

submitting traces of on-demand and batch requests to OpenStack and Torque configu-

rations respectively, on a cluster managed by the Balancer. The OpenStack submissions

use a mechanism called FakeDriver which, instead of launching a real VM, generates

the suitable internal events that track resource consumption. The Torque submissions

use a ”sleep” script for the duration of the job walltime.

3.3.1 Evaluating a Real-Life Scenario

To evaluate our approach we first ask the question: how would it fare under existing

shared on-demand/on-availability workloads in real-life computational centers? To an-

swer this question, we combined both workloads and resources of two systems used at

the Argonne National Laboratory (ANL). The on-demand side is represented by work-

loads ran on a small cluster in the Advanced Photon Source (APS) used for analytics

supporting real-time experiments; hence the need for immediate execution. The batch

side is represented by a general purpose mid-scale batch cluster in Laboratory Comput-

ing Resource Center (LCRC). Given this context, a more specific version of our question

is: if we combined both the on-demand/on-availability workloads and the resources cur-

rently executing those workloads under our approach, what advantages or disadvantages

would we observe?

To create a combined APS/LCRC workload we combined two years worth of job

execution traces from APS and LCRC (between 2013-10-06 and 2015-09-05). We first

mapped the job execution trace from APS onto on-demand VM deployment requests in

OpenStack as follows. At any APS job start/stop event, we evaluated how many one-

core jobs should be running and how many 16-core VMs would be needed to support

them, assuming that jobs would be tightly packed. If more or fewer VMs would be

needed as a result of a job start/stop request, an on-demand VM deployment request or

termination event would be generated. We then combined the APS on-demand trace and

the LCRC batch trace using the same start time for both, such that the VM deployment

requests are submitted to OpenStack and batch job requests are submitted to Torque.

To create a combined APS/LCRC cluster we proceeded as follows. The LCRC

cluster comprises 304 homogeneous 16-core nodes. The APS cluster consists of 57

heterogeneous multi-core nodes amounting to a total of 1092 cores. Since cores represent

50

the main scheduling concern in our experiment, we modeled APS capacity as 68 16-

core nodes (a total of 1088 cores, close to the actual 1092 capacity). The combined

APS/LCRC cluster is thus modeled as 304 + 68 = 372 16-core nodes.

We now set out to replay the combined APS/LCRC workload on a model of the

APS/LCRC combined cluster. Since we could not replay two years worth of traces in

real-time, we scaled down the experiment in space and time. To scale it in space, we

created an experimental environment that mapped each of the 372 combined cluster

nodes onto a Docker container, each with a unique hostname and IP address, connected

by an overlay network. An additional container represented the controller node. We

deployed the Docker containers on the Chameleon testbed [49] version 53, using the 24-

core 128 GiB RAM Xeon Haswell compute nodes, such that 24 containers were mapped

to each node. To scale the experiment in time, we mapped hours to minutes (i.e.,

accelerated 60x). Finally, we eliminated the ramp-up effect by preloading the cluster

with running jobs.

This still left us with a potentially very long experiment, so instead of replaying

two years worth of traces we focused on one week that would represent the greatest

challenge to our system. In the case of the batch trace, we defined ”challenging” as low

average node availability across 60 second periods, measured every second. In the case

of the on-demand trace, we defined ”challenging” as high total resource usage coming

from on-demand requests, calculated as a sum over the product of the time used by a

job and number of cores on which the job was running. We picked the week which had

the highest sum of usage and inverse of availability.

We now ran the experiments using traces from the most challenging week reflecting

the modifications above, such that the modified APS trace was submitted to OpenStack

and the LCRC trace was submitted to Torque. We measured the following qualities:

• Average utilization, defined as usage over time

• Mean batch wait time, defined as the time between when the job is submitted and

when it starts running

• Number of on-demand rejections, or reject rate, calculated as the ratio between

number of rejections and number of requests

51

Table 3.1 summarizes the results of this experiment in both static and dynamic

configurations. The shaded column in the static section reflects the existing scenario in

which the APS and LCRC clusters are separate: the LCRC cluster has 1002.8 minutes

mean batch wait time and the APS cluster has no rejections. A hypothetical scenario

where 100% of the combined resources are devoted to batch workload shows that the

lower bound of batch wait time for this trace is 122.5 min.

In the dynamic section of the table, we see the results of seven scenarios reflecting

different combinations of parameters R and W . We notice that the utilization of the

combined cluster improves by 4.8 to 5.6% across all dynamic scenarios, with mean

batch wait time decreasing by 85 to 88%; this is due to the fact that we can now utilize

the previously idle nodes of the dedicated on-demand cluster. However, there are 30

on-demand rejections when we choose R = 0 and W = 0; we can decrease them by

increasing either one of the parameters or both. From a practical perspective, the most

interesting observation is that the challenging week yielded no rejections for R = 12

nodes which corresponds to roughly 18% of the on-demand cluster: this means that

under the Basic algorithm, we could reduce our investment in hardware for the on-

demand cluster by 82% and still have all on-demand requests satisfied. Further, the

mean batch wait time under this scenario is almost the same as the lower bound for the

combined cluster established in the static column; this brings significant benefits to the

batch side as well.

Another scenario with no rejections occurs for R = 6 and W = 10; this means that

an on-demand request would execute within 10 minutes which is a relatively long wait

time in the context of this use case. This has been deemed not useful for our problem

formulation and thus we don’t explore on-demand wait time further in our experiments.

52

Tab

le3.1

:E

xp

erim

enta

lre

sult

sfo

rth

em

ost

chal

len

gin

gw

eek:

ther

ear

e24

,177

bat

chjo

bs

and

141

on-d

eman

dre

qu

ests

bei

ng

sub

mit

ted

inea

chex

per

imen

t.T

he

wai

tti

me

ism

easu

red

inm

inu

tes

and

the

rese

rve

valu

esar

egi

ven

inn

od

es.

For

the

dyn

am

icca

se,

the

on-d

eman

dan

db

atch

uti

liza

tion

refe

rto

the

por

tion

ofuti

liza

tion

com

ing

from

on-d

eman

dan

db

atc

hre

qu

ests

resp

ecti

vely

.

Para

metersettings

Sta

tic(B

aseline)

Dynamic

Dedicated

batch

nodes

W37

230

40

05

100

510

0Dedicated

on-d

emand

nodes

R0

6837

20

612

Combined

utilization

84.

4%80

.1%

1.25

%84

.9%

85.7

%85

.7%

85.3

%85

.3%

85.3

%85

.3%

Batch

utilization

84.

4%78

.8%

NA

84.5

%84

.4%

84.4

%84

.0%

84.1

%84

.0%

84.0

%

On-d

emand

utilization

NA

1.25

%1.

25%

0.38

%1.

25%

1.25

%1.

25%

1.25

%1.

25%

1.25

%

Batch

wait

time(m

in)

122

.510

02.8

NA

122.

014

7.0

147.

015

0.0

140.

615

0.4

130.

0

Rejections

141

00

303

31

10

0

53

3.3.2 Evaluating Balancer Algorithms

We next asked the question: how does our system perform in a generalized scenario?

What would happen if the on-demand or on-availability workloads were different—

larger or composed of a different mix of applications than in our real-life scenario?

In general, we sought to discover the relationship between the cluster capacity, the

type of workload, and configuration parameters or Balancer algorithms we would need

to employ to accommodate the on-demand workload while running the on-availability

workload undisturbed. We answered these questions by generating synthetic traces

representing both on-demand and batch workloads and running experiments with those

traces. In order to preserve continuity with our real-life experiments, we continue to

use the 372-node cluster as a base and each experiment represents one week.

Generating Synthetic Batch Workloads

We create five synthetic batch workloads as follows:

The Mainstream workload represents the “mainstream” workload condition in

the LCRC cluster. The workload is derived by randomly sampling 1% of all the jobs in

the LCRC traces. We retain the node number, walltime, and runtime of each job. Each

job’s submission time is calculated as the time offset from the beginning of the week

which the job is selected from, so that they add up to one week’s worth of submissions.

Since the Mainstream workload has a lower utilization compared to the real-life workload

described in Section 3.3.1 (66.5% versus 78.8%), we also generated workloads with higher

utilizations of 77% and 88%, and named them U66-Main (or in short U66), U77-Main,

and U88-Main, respectively. Specifically, we generate higher utilization workloads by

injecting additional jobs into the U66-Main workload.

The Wide workload (U66-Wide) is designed to model a workload composed of rel-

atively large parallel jobs. We derive the Wide workload directly from the mainstream

workload (U66-Main) by doubling the number of nodes of each job and randomly remov-

ing approximately half of the jobs to maintain close to the same aggregate utilization.

The Narrow workload (U66-Narrow) is designed to represent a workload com-

posed of small parallel jobs. To generate it, we split each job from the mainstream

workload (U66-Main) into two smaller jobs, each requiring half the number of nodes

54

(thus the utilization stays the same).

Generating Synthetic On-demand Workloads

Similar to the synthetic batch workloads, we create synthetic on-demand workloads by

abstracting workload patterns from real-life traces. In particular, we seek to preserve

their burstiness corresponding to periods when an APS experiment occurs causing the

demand for time-sensitive computation. We thus reuse the VM leases’ submission times

and durations from the challenging week. Since the utilization (denoted by ρ) of the

challenging week’s on-demand workload is 1.25%, in order to achieve higher utilization,

we multiply the number of nodes in the lease by 2x, 4x, 8x, 16x, and 24x. Thus, the ρ

of the synthetic workloads equals to 2.5%, 5%, 10%, 20%, and 30% respectively. These

synthetic workloads preserve the burstiness quality in real-life trace while exerting much

higher pressure on the Balancer. For example, when ρ = 30%, the peak arrival rate of

on-demand requests is 264 requests per minute.

Result analysis of the Basic algorithm

The experiments are similar to experiments in the previous section; we use various com-

binations of traces submitted to OpenStack and Torque respectively, having preloaded

the cluster with running jobs to mitigate the ramp-up effect. Figure 3.2 shows the

performance results of running the five batch workloads with six on-demand workloads

(x-axis) with the Basic algorithm (zero reserve). We skipped combinations for which

the sum of batch and on-demand workloads exceeds the capacity of the cluster.

Figure 3.2 shows that rejection rates are influenced by both the shape of batch jobs

in the trace and their density (i.e., batch utilization). While for the U66-Narrow trace

we don’t see rejections with ρ as high as 10%, this threshold drops to 5% as jobs become

wider, and the rejection rate stays firmly above zero for every ρ value for other traces.

Mean batch wait time follows a similar pattern as both smaller jobs and less utilization

make it easier for batch jobs to be scheduled. Our explanation for how batch job shape

affects performance is that it is easier for the batch scheduler to schedule narrower jobs

than wider ones. Thus jobs finish earlier such that more space will be left open when

on-demand requests arrive.

55

None 1.25 2.5 5 10 20 30

rho (%)

0

5

10

15

20

25

30

35

reje

ctra

te(%

)

U88

U77

U66-Wide

U66-Main

U66-Narrow

(a) Reject Rate

None 1.25 2.5 5 10 20 30

rho (%)

0

20

40

60

batc

hw

ait

tim

e(m

in)

U88

U77

U66-Wide

U66-Main

U66-Narrow

(b) Batch Wait Time

None 1.25 2.5 5 10 20 30

rho (%)

70

80

90

100

com

bin

edu

tili

zati

on(%

)

U88

U77

U66-Wide

U66-Main

U66-Narrow

(c) Combined Utilization

Figure 3.2: Performance results of the Basic algorithm, five batch workloads and sixon-demand workloads, R = 0, W = 0.

56

The utilization patterns follow strongly the utilization of the batch traces, although

all go up slightly as more on-demand jobs are added. Without any additional configu-

ration, our approach is thus able to support on-demand workloads demanding less than

10% of cluster capacity, depending on the shape and utilization of batch jobs. To put

this number in perspective, ρ from our real-life scenario was an order of magnitude

lower.

To make the Basic algorithm work for larger on-demand workloads, we need to use

the static reserve. We thus rerun the Basic algorithm with increasing R and observe the

trends of rejection rate dropping. Note that since the performance is mainly determined

by utilization compared to job shape, we will only use the mainstream batch workload

for the rest of this chapter.

Figure 3.3 illustrates what happens for ρ = 10%. We see that a relatively small

increase in the value of R (30 or 60 depending on on-demand trace density) can decrease

the number of rejections by a significant factor. However, to reduce them to or near

zero, we need a reserve of 120 nodes, roughly a third of the cluster. The negative

effect of reserving more nodes is that the batch job performance becomes significantly

worse (exponentially worse for ρ > 10%). This is reflected in the combined utilization,

which goes down with increased reserve for batch traces requiring higher capacity. A

high static reserve is thus a very expensive solution for accommodating on-demand

workloads higher than 10%; to look for a better one we turn to the Hint algorithm.

Result analysis of the Hint algorithm

To run experiments with the Hint algorithm, we used a program that simulated a user

notification to the Balancer 15 or 30 minutes before actual requests arrive. We used two

values for this user notification: 15 minutes and 30 minutes (H15 and H30, respectively).

Recall that our traces follow a real-time experimental pattern where a user would be

able to make such notification.

Figure 3.4 shows the rejection rates for the Hint algorithm. With ρ = 10% and given

a 30-minute hint, we get zero rejection rates for U66 and U77 and near zero (< 1%)

rejection rate for U88. With a slightly shorter advance notice of 15 minutes, we get

near zero (< 1%) rejection rate for U66 and low rejection rates (less than 4%) for U77

and U88. In comparison, the Basic algorithm evaluated in the previous section needed

57

0 30 60 90 120

R (reserve nodes)

0

10

20

30

40

reje

ctra

te(%

)

U88

U77

U66

(a) Reject Rate

0 30 60 90 120

R (reserve nodes)

0

1000

2000

batc

hw

ait

tim

e(m

in)

U88

U77

U66

(b) Batch Wait Time

0 30 60 90 120

R (reserve nodes)

75

80

85

90

com

bin

edu

tili

zati

on(%

)

U88

U77

U66


Figure 3.3: Performance results of the Basic algorithm with static reserve, three batchworkload, ρ = 10% on-demand workloads.

58

a static reserve of 120 nodes, i.e., almost a third of the cluster to achieve the same

rejection rate. A relatively accurate but short-term estimate of resource need can be

then used to activate a dynamic reserve that is effectively equal to a static reserve of

120 nodes. Since a static reserve typically means purchasing and operating a cluster set

aside for on-demand experimental support, this observation has significant potential for

creating on-demand capacity.

At the same time, the impact on batch workload is much lower: for U66, batch job

mean wait time when R = 120 is over 9 hours, while the same measurement for a hint of

30 minutes is merely 50 minutes. This is because the dynamic reserve implemented by

the Hint algorithm acquires the nodes only when they are known to be needed and for

as long as they are needed. To understand how effective it is, we looked at how much

time nodes spent in reserved state without being used. Using the Basic algorithm with

R = 120; this time is 837,546 minutes whereas in the H30 case it is only 34,695 minutes,

a reduction of 96%. This significantly increases the flexibility of the system as nodes

are free to be allocated to the most pressing tasks.

Another benefit of the Hint algorithm is that it improves combined utilization: most

importantly, the combined utilization goes up rather than down as in the case of high

reserve. In particular, batch utilization stays approximately the same, meaning that

combining on-demand workload didn’t hurt batch overall. The increased utilization is

contributed by more on-demand workload being scheduled, e.g. the biggest boost comes

from (ρ=30%, H30), when on-demand utilization increases by 5.1% compared to (U66

and ρ=30%, Basic algorithm). This is also the first time the combined utilization goes

above 85% (for U66 and ρ=30%) demonstrating that we can indeed combine concerns

of on-demand and on-availability workloads better.

Results analysis of the Predictive algorithm

Sometimes, it is impossible to get a reliable estimate of an incoming bursty workload.

In those situations, we apply a heuristic algorithm as described in section 3.2.3 to

predictively adjust the reserve. To evaluate our algorithm, we first ran the predictor

offline using historical data and then run live experiments using the predictor’s output.

Figure 3.5 shows that the predictive algorithm performs equally well as the Hint

algorithm in terms of reject rates. When on-demand workload is low (<5%), batch wait

59

1.25 2.5 5 10 20 30

rho (%)

0.0

2.5

5.0

7.5

10.0

12.5

reje

ctra

te(%

)

U88,H30

U88,H15

U77,H30

U77,H15

U66,H30

U66,H15

(a) Reject Rate

1.25 2.5 5 10 20 30

rho (%)

0

100

200

300

400

500

batc

hw

ait

tim

e(m

in)

U88,H30

U88,H15

U77,H30

U77,H15

U66,H30

U66,H15

(b) Batch Wait Time

1.25 2.5 5 10 20 30

rho (%)

70

80

90

100

com

bin

edu

tili

zati

on(%

)

U88,H30

U88,H15

U77,H30

U77,H15

U66,H30

U66,H15


Figure 3.4: Performance results of the Hint algorithm, H = 15 min or H = 30 min,three batch workload, six on-demand workloads.

60

1.25 2.5 5 10 20 30

rho (%)

0.0

2.5

5.0

7.5

10.0

12.5

reje

ctra

te(%

)

U88,Predict

U77,Predict

U66,Predict

(a) Reject Rate

1.25 2.5 5 10 20 30

rho (%)

0

500

1000

1500

2000

2500

batc

hw

ait

tim

e(m

in)

U88,Predict

U77,Predict

U66,Predict

(b) Batch Wait Time

1.25 2.5 5 10 20 30

rho (%)

80

85

90

com

bin

edu

tili

zati

on(%

)

U88,Predict

U77,Predict

U66,Predict


Figure 3.5: Performance results of the Predictive algorithm, three batch workload, sixon-demand workloads.

61

time achieved by the Predictive algorithm is comparable to that of the Hint algorithm.

However, when on-demand workload becomes higher, batch wait time is 1-4 times longer

than in the Hint algorithm (H = 30 min). This can be explained by the fact that with-

out additional information, the predictor can only approximately estimate on-demand

request arrival time. To lower reject rate, our Predictive algorithm over-reserves nodes,

indicating that the Predictive algorithm is significantly less efficient at estimating when

the nodes will be needed.

Figure 3.5 also shows that the differences in utilization compared to the Hint al-

gorithm results are relatively small for low ρ and are correlated primarily to batch

utilization and over-reserving nodes. Thus with larger ρ and consequently more time

spent in reserve, despite the increase in on-demand utilization, it is not big enough to

overshadow the drop in batch utilization. Unless there is a predictor that can accurately

predict user behavior and make precise estimations of on-demand request arrival, the

Predictive algorithm does not perform as well as the Hint algorithm in balancing batch

and on-demand performance, even though the Predictive algorithm performs better

than the Basic algorithm.

3.3.3 Elasticity Analysis

We extend our evaluation by modeling our Balancer approach under the elastic schedul-

ing framework. The Balancer algorithms (basic + reserve, hint, predictive) aim at pro-

visioning resources in a batch dominant cluster to match the demand of on-demand

workloads. To capture the precision of elasticity, we propose the following definitions

and metrics:

•∑O is the accumulated amount of overprovisioned resources, e.g. reserved but

unused

• C is the number of resources in the entire system

• o =∑

OC the percentage of resources that have been overprovisioned (for the on-

demand workload) in the system.

• F is the number of on-demand resource allocation requests that have been rejected

• f is the percentage of on-demand requests that have been rejected

62

Figure 3.6: Measurements of elasticity based on U77 and rho=20% workloads.

The elasticity of a Balancer algorithm can be evaluated by two metrics {o, f}. The

goal is to minimize both metrics, such that an algorithm satisfies all on-demand requests

without over reserving resources. Out of the available workload combinations, we use one

combination (U77 batch + rho=20% on-demand) to demonstrate how our algorithms

achieve elasticity. The results are shown in Figure 3.6. First, the baseline algorithm

(static partitioning) results in f = 0 but at the cost of overly reserving resources at more

than 11% of the cluster’s capacity. Then, for the Basic algorithm, starting from R = 0

(the point which is depicted at the upper-left corner of the graph), with the increase

of R, the Basic algorithm gradually reduces f at the cost of increasing o. Basically, R

can be seen as the knob we use to tune the trade-off between f and o. Finally, the Hint

algorithm performs nearly as well as the optimal algorithm – with both f and o close

to zero, which represents the best result the balancer approach can achieve.

3.4 Related Work

Several research groups have been exploring the suitability of the cloud environment

for HPC applications in terms of virtualizing HPC execution (e.g. Palacios [50]), en-

abling a cloud interface for grid computing (e.g. Globus [51], Magellan [52]), and using

63

on-demand/on-availability leases [53, 54]. Other work has focused on combining HPC

and cloud systems in a hybrid environment to enable cloud bursting of HPC workload

from HPC clusters to the public cloud to meet deadline constraints of HPC applica-

tions [55, 56, 57]. To reduce the cost using public clouds, a number of groups have

proposed using cheaper yet unreliable spot instances to reduce the cost of executing

HPC applications [58, 59, 60, 61, 62, 63]. Spot instances suffer from volatility due to

price fluctuation and some work has proposed prediction methods to calculate statistical

availability guarantees [63]. Our work differs from the hybrid cloud paradigm in two

ways. First, it bursts on-demand applications to HPC clusters in a controlled fashion

to meet the requirements of both on-demand and HPC batch applications. Second,

unpredictable start times are avoided by reclaiming resources from the on-availability

HPC cluster to convert them to on-demand resources.

In the realm of executing mixed workloads, cluster and data center operators have

configured resource schedulers to improve facility utilization through co-scheduling of

multiple batch and latency-critical workloads. In the HPC community, prior efforts such

as Marshall et al. [43] target improving utilization of private IaaS clouds by opportunisti-

cally backfilling VMs on idle nodes which are not in use by on-demand requests, allowing

HTC workloads to run on backfilled VMs. The backfilled VMs do not support start time

constraints and are preemptible, unlike our work. SpeQuloS [64] explores providing QoS

for executing Bag-of-Tasks applications on opportunistic grids or cloud spot instances.

More recent works [65, 66] aim to improve the value of reclaimed cloud capacity by

providing Service Level Objectives (SLOs) or guarantees for their use. TR-spark [67]

proposed a big-data analysis framework customized to exploit transient cloud servers

for Spark applications. Other systems have addressed the problem of reconciling con-

flicts incurred at finer-grained resource sharing (e.g., Heracles [68] and Morpheus [69]).

Elastic schedulers such as [70] enable slots to be shared across applications to meet

their SLOs and improve utilization. In this environment, slots can be taken away from

loosely-coupled applications at run-time (e.g. Hadoop), but this is not applicable to

the HPC environment. Through combining batch and on-demand requests, the Bal-

ancer also achieves utilization improvements but within the operating constraints of an

HPC environment where batch applications are first-class citizens. The Balancer does

not preempt running batch jobs to enable on-demand job execution. Additionally, due

64

to the node-exclusive requirement of most HPC applications (unlike in a data center

environment), the Balancer does not consider node-level sharing between batch and

on-demand workloads, thus performance conflict is not a major concern in our work.

Finally, our work differs from prior work in that it enables resources to be dynami-

cally shared between batch and on-demand schedulers. Thus, it is not another cluster

scheduler that manages only the flow of jobs, but it also manages the flow of resources

from one class of service to another. Mesos [71] is most similar to our work. Mesos

adopts a two-level scheduling model which (1) offers available resources to frameworks

(e.g. Torque) such that a framework can either accept or reject an offer, and (2) each

framework scheduler schedules its own tasks onto the accepted resources. The Mesos

master plays a similar role as the Balancer in cross-framework resource allocation. How-

ever, unlike the Balancer, Mesos does not support time-bounded resource allocation nor

performance-aware resource reclamation. Both mechanisms are critical in the Balancer’s

target environment. Similarly, Google’s Borg [72] and open-source Kubernetes enable

the co-scheduling of mixed workloads, but do not adhere to the specific constraints in

our HPC environment, namely, that batch jobs are the main tenant and run exclusively

on allocated nodes. Thus, the Balancer must operate with fewer degrees of freedom than

these general-purpose schedulers. For this reason, we opted to design a new scheduling

system targeted to HPC and on-demand environments.

3.5 Conclusion

We proposed a model reconciling the needs of on-demand and batch workloads within

one system in a non-invasive way, i.e., by operating on cycle stealing rather than disrupt-

ing job execution. The model consists of a lightweight Balancer service that dynamically

arbitrates resource usage between an on-demand and on-availability scheduling frame-

work and can be adapted to existing technologies, such as OpenStack or Kubernetes for

on-demand, or Torque or Slurm for batch.

Based on a real-life scenario representing two years’ worth of on-demand and batch

workloads at Argonne National Laboratory, we demonstrated that by using our model

on existing resources we could reduce the current investment in on-demand infrastruc-

ture by 82%, while at the same time improving the mean batch wait time almost by an

65

order of magnitude (8x). By exploring how our model behaves under various configura-

tions and workloads, we found that it performs best in scenarios where the on-demand

workload represents less than 10% of the overall capacity (our real-life usage example

needed only 1.25%). When trying to increase this limit, we found that a relatively short

(15 to 30 minutes) advance notice of resource need is as effective as placing a static

reservation on a third of the cluster, which has significant implications for resource

usage and cost. In cases when it is not possible to obtain such advance notice, a sim-

ple prediction algorithm provides a reasonable compromise, yielding near zero rejection

rates with reasonable resource usage.

Chapter 4

The Bundle Service for Elastic

Resource Scheduling in HPC

Environments

4.1 Introduction

Large-scale scientific projects rely on distributed applications combining the resources of

multiple HPC infrastructures. These applications are designed to draw on the power of

heterogeneous computational, storage, and networking architectures, utilize the different

geographical locations, and exploit the diverse performance and availability of diverse

platforms. However, this type of design goals is extremely difficult to fully achieve due to

the complexity of aggregating different resource scheduling mechanisms and interfaces of

the underlying infrastructures in a way that is transparent to the users and application

developers.

Various HPC resources are heterogeneous in their architectures and interfaces, are

optimized for specific applications, and enforce tailored usage and fair policies. In con-

junction with temporal variation of demand, this introduces resource dynamism, e.g.,

the time varying availability, queue time, load, storage space, and network latency. Hid-

ing these factors from an application executing on multiple heterogeneous and dynamic

resources is difficult due to the complexity of choosing resources and distributing the

66

67

application’s tasks over them.

The AIMES 1 project (DE-FG02-12ER26115, DE-SC0008617, DE-SC0008651) ad-

dresses the above limitations by integrating abstractions representing distributed ap-

plications, resources, and execution processes into a pilot-based middleware. The mid-

dleware provides a platform that can specify distributed applications, execute them on

multiple resources and for different configurations.

In this chapter, we present our own contribution in the AIMES project – a study of

devising resource abstractions and dedicated services to provide characterization of re-

source capacity and capabilities and using the information to improve resource schedul-

ing decisions in dynamic and heterogeneous HPC environments. We implement the

abstraction and service in middleware, and use experimental evaluation to show the

benefits of our methodology:

• Our resource abstraction uniformly and consistently describe the core properties

of distributed computing resources.

• Our dedicated service draws insights on how to make better scheduling decisions.

Section 4.2 discusses the abstraction we defined. Section 4.3 discusses the imple-

mentation of the abstraction into a service. Section 4.4 presents the experimental eval-

uations. In Section 4.5, we provide a brief summary of related work. Section 4.6 reviews

the presented work.

4.2 The Bundle Abstraction

Scientific applications deployed in distributed, heterogeneous environments rely on accu-

rate and comprehensive resource characterization to guide their resource couplings from

end to end. Conceptually, resource couplings involve two aspects: the static coupling

and the dynamic coupling. The coupling is static when users select resources based on

the characteristics including capacity, performance, policies, and cost. Frequently this

process depends on user’s knowledge and past experience and the decisions are made

on an ad hoc basis. On the other hand, the coupling is dynamic when user or software

1An Integrated Middleware Framework to Enable Extreme Collaborative Science

68

WorkstationWorkstationWorkstationWorkstationWorkstationLocal Cluster

WorkstationWorkstationHPC ClusterWorkstationWorkstationCloud

Compute Network StorageComputeCompute(w/ Memory)

NetworkNetwork StorageStorage

Query Monitor Discover

ApplicationsApplicationsApplications

Re

sour

ceA

bstr

act

ions

Resource Representation Resource Interface

Figure 4.1: Overview of the Bundle layer.

monitors resource status and triggers resource adjustments when needed. This proce-

dure is not routinely practiced due to resource heterogeneity and dynamisms. Both

static and dynamic resource couplings require systematic ways of constructing resource

characterizations.

Prior works [73, 74] have shown that resource abstractions is a powerful methodology

for resource characterization and resource discovery. We propose the Bundle abstraction

(or in sort, Bundle) to bridge applications and heterogeneous resources via uniform

resource characterizations.

The core concept of the Bundle abstraction is Resource Bundle, which contains

an integrated group of resources. A Resource Bundle can include multiple resource

types including compute, storage, and network. A Resource Bundle does not own re-

source components – any resource components may be shared across multiple Resource

Bundles. Resource Bundle provides a convenient handle for aggregated query and mon-

itoring.

Bundle comprises two parts: (1) Resource Representation and (2) Resource Inter-

face, which are depicted in Figure 4.1. Resource representation characterizes hetero-

geneous resources with a large degree of uniformity, thus hiding complexity. Resource

representation models resources across three basic categories: compute, network, and

storage. Given that memory is mostly assigned together with processors, Bundle treats

69

memory as an attribute of the compute resource. Measurements that are meaningful

across multiple platforms are identified in each category. For example, the property

“setup time” of a compute resource means queue wait time on a HPC cluster or virtual

machine startup latency on a cloud [75].

The resource interface exposes information about resources availability and capa-

bilities via an API. Two query modes are supported: on-demand and predictive. The

on-demand mode offers real-time measurements while the predictive mode offers fore-

casts based on historical measurements of resource utilization instead of queue waiting

time, which is extremely hard to predict accurately [21, 15, 22].

The resource interface exposes three types of interface: querying, monitoring, and

discovering. The query interface uses end-to-end measurements to organize resource

information. For example, the query interface can be used to inquire how long it would

take to transfer a file from one location to a resource and vice versa. Although file

transfer times are difficult to estimate [76], proper tools [77] are capable of providing

estimates within an order of magnitude, which are still useful.

The monitoring interface can be used to inquire about resource state and to chose

system events for which to receive notification. For example, performance variation

within a cluster can be monitored so that when the average performance has dropped

below a certain threshold for a certain period, subscribers of such an event will be

notified. This may trigger subsequent scheduling decisions such as adding more resources

to the application.

The discovery interface, which is future work, will let the user request resources

based on abstract requirements so that a tailored bundle can be created. A language

for specifying resource requirements is being developed. This concept has been shown to

be successful for storage aggregates in the Tiera project [74], where resource capacities

and resource policies are specified in a compact notation. Similar concept of resource

matching language has also been adopted in relevant works [78, 79, 73].

70

Bundle API

HPCCluster

Bundle ManagerBundleAgent

ResourceBundle

Workstation

BundleAgent

Grid

BundleAgent Cloud

BundleAgent

Workload Manager applications

Figure 4.2: An overview of Bundle architecture, blue shaded components comprise theBundle software.

Table 4.1: BundleAgent supported platforms

FutureGrid testbed clusters India, Xray, Hotel, Sierra, Alamo

XSEDE HPC clusters Stampede, Trestles, Gordon, Blacklight

NERSC HPC clusters Hopper

OSG grids Most of the open sites

4.3 Implementation

Bundle is implemented as a loosely-coupled distributed software system consisting of

four components (see Figure 4.2). The first component is BundleAgent, which is de-

signed to work with individual platforms. BundleAgent collects the configuration of

each platform and constantly monitors the dynamic status of each platform. Bundle

software deploys a BundleAgent instance on each platform. There are two types of

BundleAgents: the LocalBundleAgent and the RemoteBundleAgent. LocalBundleAgent

runs inside the the resource it monitors. If local deployment is prohibited by policy, a

RemoteBundleAgent is deployed on an remote site. We have implemented BundleAgent

on an expanded set of resources across multiple organizations including XSEDE [80],

NERSC, and OSG [81] (Table 4.1).

The second component in the Bundle architecture is BundleManager that controls

71

Table 4.2: BundleAPI

name description

General

get list Get a list of all the accessible resources.

get config Get the current configuration of resource includ-ing number of nodes, queues, policy constraints.

Compute Resources

get workload Get real-time workload of a platform, includingnodes and jobs status.

Network Resources

get bandwidth Get the bandwidth of network connections be-tween two distributed resources.

and aggregates information from BundleAgent ’s. BundleManager maintains historical

data and runs all sorts of data processing.

The third component in the Bundle architecture is ResourceBundle, which combines

a group of resources that are integrated used to execute an application. ResourceBundle

describes resources using properties that are commonly applicable across heterogeneous

platforms. For example, HPC clusters are partitioned into different jobs queues where

each queue manages a certain number of nodes. Whereas in a Grid, nodes are partitioned

into different sites. So when queried for configuration and capacity ResourceBundle

organizes information of both HPC and Grid platforms based on queue/site partitions.

The same abstraction is used to describe other resource properties such as resource

acquisition time, compute power, and data transfer time.

The fourth component of Bundle is BundleAPI, which is a group of general interfaces

that support uniform interactions and operations on underlying resources. Table 4.2 lists

the major interfaces supported by BundleAPI. The design goal of BundleAPI is that

they are useful yet general enough for heterogeneous resources.

72

Figure 4.3: Visualization of a month-long workload of TACC Stampede HPC cluster.

4.4 Experiments

We quantitatively and qualitatively analyze both static and dynamic resource infor-

mation with the following three aspects: (1) HPC cluster workload variation (subsec-

tion 4.4.1), (2) grid compute node performance heterogeneity (subsection 4.4.2), and

(3) wide-area network performance (subsection 4.4.3).

4.4.1 HPC Cluster Workload Characterization

For an HPC cluster that comprises homogeneous compute nodes, the main source of

dynamism comes from workload variation. For a large-scale HPC cluster concurrently

shared by a lot of users, the workload variation is created by the aggregated usage of all

the applications. Such that one must observe all the queues and jobs for long periods

to have a good understanding of the HPC cluster. In other words, a snapshot of the

HPC cluster won’t provide adequate information for drawing useful insights on how to

improve scheduling decisions, such as choosing which cluster to schedule an application

at a certain time such that the expected wait time is shorter.

73

Figure 4.3 demonstrates the workload characterization with one month of data col-

lected by BundleAgent run on TACC’s Stampeded HPC cluster [82]. In 2014, Stampede

was the world’s 7th largest supercomputer. The upper graph shows the wait times of

every jobs submitted during the month. Each job is represented by a solid circle, with

the radius representing the job’s number of processors. The x coordinate of a job’s circle

represents the job’s submit time. The y coordinate of a job’s circle represents the job’s

wait time. The lower graph displays workload intensity measured by the combined node

hours requested by all the jobs submitted per hour.

The data reveals two phenomenons: (1) the temporal correlation of wait times among

jobs submitted in adjacent times, and (2) the skew in job wait time distributions. Unlike

what we had originally expected, the job wait times show clear patterns instead of

randomness. The wait times form multiple peaks: job wait times quickly increase until

reaching the highest value of a peak, then gradually decrease until back to normal or

hit the next peak. Combined with the workload intensity curves, we observe that the

wait times peaks appear after workload bursts. Also, we observe many large jobs whose

wait times are significantly shorter compared to the other jobs which are submitted at

similar times. This phenomenon indicates that the Stampede cluster prioritize large

jobs.

Based on the above observations, we draw the following conclusions regarding how to

make intelligent scheduling decisions on this HPC cluster. Firstly, to avoid excessively

long job wait time, workload intensity measured by the combined cpu hours requested

during the last several hours (e.g., 1 to 4 hours) is a good indicator. When that number

exceeded a certain threshold, job wait times will quickly increase. Secondly, due to this

cluster prioritize large jobs, when large jobs arrive in the queue, they will significantly

increase the overall job wait time expectations. Application schedulers can subscribe

with Bundle to receive event notifications when multiple large jobs are detected in the

queues, and use the events to de-prioritize this cluster when making scheduling decisions.

4.4.2 Grid Nodes Performance Heterogeneity

For grids that comprise heterogeneous compute nodes, scheduling an application’s tasks

on different groups of nodes will result in significant differences for an application’s per-

formance. Bundle characterizes performance of all the compute nodes within a grid.

74

Such that the performance information is provided to application schedulers for improv-

ing scheduling qualities. Specifically, modern grid job schedulers such as HTCondor [83]

allow users to specify which sites or nodes they want their applications to be scheduled

to. The grid scheduler will guarantee the user requirements during match making [84]

process. However, most grid users can’t leverage this mechanism to select suitable

resources due to lacking of node information.

Bundle sends probes to all the compute nodes in OSG [81]. The probes collect con-

figuration information and run benchmarks on each compute nodes. Then we aggregate

the performance information by clustering the nodes into groups such that the nodes in

each group have similar performance. We choose the HINT benchmark [85] for compute

node CPU/Memory performance measuring. The HINT benchmark measures multiple

aspects of a compute node performance including processor speed, precision, usable

memory size and bandwidth. HINT suits our needs for two reasons: Firstly, the HINT

results can be summarized by a single number “Net QUIPS” (Quality Improvement Per

Second). Bundle use this number as the first order of performance. User can query

Bundle to return a list of nodes that produce a least a certain Net QUIPS value.

Secondly, HINT measures QUIPS as a function of time. Performance of multiple

nodes can be reflected in the same graph, such that nodes will be compared using the

entire QUIPS curve. For example, Figure 4.4 shows the QUIPS curves of 13 arbitrary

compute nodes in the UC site, one of the largest sites of OSG. The performance of each

node is reflected in a descending curve, the higher the measurements the stronger the

performance. The graph shows that the 13 nodes can be clustered into three groups.

For example, given any one node, Bundle will be able to find all the similar nodes with

comparable performances. Application schedulers can use this grouping information

to schedule tasks to a cluster of similar nodes with known performance. Predictable

or known performance is key to success of many quality of service (QoS) -aware grid

schedulers [86, 87, 88].

4.4.3 Grid Network Performance Variation

Multi-site grids like OSG provide distributed facilities owned by different organizations.

For example, OSG today has more than a hundred active computational sites worldwide.

It is a common practice for a user to routinely select several sites for regular uses. A

75

10ns 100ns 1us 10us 100us 1ms 10ms 100ms 1s 10s 100sTime

5

10

15

20

25

30

35

MQ

UIP

S (

HIN

T B

ench

mark

)

Clustering nodes within the samedomain based on CPU performance

uct2-c111

uct2-c163

uct2-c061

uct2-c100

uct2-c403

uct2-c176

uct2-c390

uct2-c154

uct2-c108

uct2-c383

iut2-c070

iut2-c050

iut2-c209

Figure 4.4: Compute node performance clustering

major factor for scheduling to specific sites is cross-site network heterogeneity: some

sites are geographically closer to the user, or they have faster connection speed to a

shared storage system. In OSU, a intermediate file system called Stash provides the

storage for large input/output data. First, user will upload input data to Stash. Next,

the data will be transfered to the compute node once a task is scheduled, and results

will be sent back to Stash. In this 2-hop scenario, the network bandwidth between Stash

and each compute nodes is a determinant factor for application’s performance.

In this experiment, we measure network bandwidth between Stash and compute

nodes in OSG. As mentioned above, BundleAgent sends probes to every compute nodes.

The probes measure network bandwidth using iPerf [89]. Figure 4.5 shows the distri-

butions of network bandwidths of 9 different sites. The average bandwidths of these

sites range from 3 Mbit/s to 957 Mbit/s, more than 300 times difference. Our results

also show large network performance variation between different nodes of the same site.

Bundle measures both cross- and intra- network heterogeneity. This information is key

to the performance of inter-domain grid schedulers such as [89].

4.5 Related Work

Scheduling distributed applications to multiple HPC resources is a well-known research

problem. For example, the I-Way [90], Legion [91], Globus [92], and HTCondor [93]

76

0 200 400 600 800 1000 1200 1400 1600 1800Bandwidth (Mbit/s)

AGLT2

BNL-ATLAS

Crane

MWT2

NWICG_NDCMS

UCD

UCSDT2

UTA_SWT2

Cinvestav

Network bandwidth between OSG-Connect-Stashand nodes of different OSG sites

Figure 4.5: network performance

frameworks integrated existing tools to run distributed applications on multiple re-

sources. The abstraction and service presented in this chapter differs from the resource

representation layer in these frameworks by assuming multiple resources with dynamic

availability and diverse capabilities.

The Bundle service is built upon related work. The service leverages some of the

work done in information collection [73], resource discovery [94, 79], and resource char-

acterization as it relates to job wait time prediction [22] and its difficulties [21]. For

the job wait time characterization, our work takes an alternative approach. Instead

of trying to predict the exact wait time, our approach predicts high-level trends using

temporal and workload correlations.

4.6 Conclusion

Elastically scheduling a distributed application requires information about the applica-

tion requirements, and resource availability and capabilities. This information is used

to choose a suitable set of resources on which to run the application executable(s) and

a suitable scheduling of the application tasks. When considering multiple resources and

distributed applications, bringing together application- and resource-level information

requires specific abstractions that have to uniformly and consistently describe the core

properties of distributed applications, those of computing resources, and those of the

77

execution of the former on the latter.

Bundle abstraction bridges applications and diverse resources via uniform resource

characterizations. Despite the diverse interfaces and platforms, the core properties of

distributed computing resources including compute, storage, and network are used to

uniformly and consistently describe diverse resources. Bundle provides simple APIs

that are valid across different resource platforms.

We implemented the abstraction as a dedicated service and integrate the service

with other components of the AIMES middleware. The Bundle service is deployed on

10+ heterogeneous platforms to capture the dynamism of distributed resource. We run

multiple experiments to evaluate Bundle service from three different aspects including

the ability to characterize workload variations, node performance heterogeneity and

network variations. We use the data to draw on insights of how to improve scheduling

decisions in dynamic and heterogeneous HPC resources.

Chapter 5

Conclusion

HPC clusters offer large scale computational resources to scientific applications that

solve large scale problems. Very few scientific applications run on dedicated resources.

Instead, queue-based batch systems are devised to manage how expensive resources are

shared among users.

Under the fundamental assumption that every application should achieve highest

performance, a batch scheduler assigns exclusively a fixed amount of resources to each

application for the entire runtime. Although this policy maximizes resource efficiency, it

leads to long queue waiting time for applications. It also creates resource fragmentations

which lower system utilization, causing wastes of money and energy. These problems

become more critical when many large parallel applications wait for resource allocations

concurrently.

Different from traditional batch jobs, a new class of data-driven scientific applica-

tions require on-demand resources. To support the needs, HPC facilities reserve dedi-

cated clusters for on-demand requests. Because of the highly-bursty and low-on-average

workload pattern, the on-demand clusters create the low-utilization problem.

To mitigate these three problems, this thesis adopts and adapts the elastic schedul-

ing strategy which is commonly practiced in Cloud Computing, but rarely in HPC.

We explore different trade-offs between performance, utilization and response time and

contribute new techniques to the HPC resource scheduling study.

78

79

5.1 Research Contributions

We contribute by devising new approaches leveraging existing technologies which are

widely used. Our new approaches focus on services that interact with both users and

existing resource management systems (RMS). The services are implemented in nonin-

vasive ways, meaning that they do not seek to change the user’s interface to the RMS.

Our new approaches employ new heuristic and predictive algorithms. Our algorithms

provide knobs to turn, which make them flexible with different workloads and user

preferences.

5.1.1 Elasticity for Parallel Batch System

Based on the insights from our long-term HPC resource characterizations, we found rigid

job scheduling creates long wait time and undermines utilization. Inspired by Cloud’s

elastic scheduling, we developed a technique called Elastic Job Bundling (EJB) to break

the rigid coupling between application and job. EJB is a user-level service functioning as

a proxy between user and existing batch system. EJB receives original job requests and

transforms large job requests into multiple smaller subjobs. Using smaller subjobs, EJB

dynamically requests resources based on immediate availability, and uses the available

resources to run large parallel applications with downgraded performance. EJB trades

of application’s performance for lower response time and higher utilization.

We proposed the concept of elastic job and formulated the definition of elastic job.

We devised the approach to run a tightly-coupled parallel application on multiple elastic

jobs. This approach leverages existing techniques including processor over-subscription

and process migration. Furthermore, we proposed to rely on existing batch sched-

uler’s backfilling mechanism to acquire immediately available resources, such that our

approach can control the ‘shape’ and the start time of subjobs.

We designed the elastic scheduling algorithm, which is a heuristic event-driven

scheduling algorithm. The algorithm is triggered by periodical free resource detec-

tion and subjob’s callback functions. We proposed the architecture of the EJB service

software and run experiments to validate our over-subscription and migration models.

Finally, we run simulation with production traces to evaluate our approach. Simu-

lation results show that EJB significantly reduces large parallel jobs’ turnaround time

80

without sacrificing that of the smaller jobs. At the same time, EJB reduces system

fragmentation, thus improving utilization under heavy workloads.

5.1.2 Elasticity for On-demand and Batch Hybrid System

Motivated by real-world use case scenarios in national labs, we developed a technique

to jointly satisfy on-demand request and batch job performance. Specifically, we intro-

duced a service called Balancer between existing schedulers and underlying resources.

Similar to our design goal in EJB, Balancer is also non-invasive in a way that it doesn’t

require complete changes in existing resource scheduling systems and it does not require

users to adopt new interfaces.

We described an architecture and implementation for dynamic non-invasive resource

reassignment between two systems: a system providing resources on-demand and a sys-

tem providing resources based on availability, that balances their respective objectives

in terms of the number of satisfied on-demand requests and utilization. We proposed

three algorithms for balancing resources in this context: the Basic Algorithm providing

a baseline of our systems, the Hint Algorithm that models the behavior where experi-

mental users can register the upcoming need for on-demand cycles, and the Predictive

algorithm for cases where such advance notice is not possible.

Based on a real-life scenario representing two years worth of on-demand and batch

workloads at Argonne National Laboratory, we demonstrated that by using our model on

existing resources we could reduce the current investment in on-demand infrastructure

by 82%, while at the same time improving the mean batch wait time almost by an

order of magnitude (8x). Our large-scale experiments driven by synthetic traces derived

from production trace show that Balancer significantly reduces the need for dedicated

node reserves by up to 10% for on-demand requests and simultaneously improves batch

performance and overall utilization.

5.1.3 Elastic Resource Abstraction for Heterogeneous Resources

Managing dynamism when executing an scientific application on multiple heterogeneous

HPC resources is difficult due to the complexity of choosing resources and distributing

the application’s tasks over them, especially when performance is a concern.

81

This Thesis offers three main contributions to the issue of characterizing multiple

heterogeneous and dynamic HPC resources: (1) abstractions to represent resource con-

figuration, capabilities, and availability, which hide heterogeneity and complexity; (2)

the implementation of these abstractions in middleware designed to facilitate executions

of scientific applications; (3) experiments demonstrating how data-driven analysis can

improve resource characterizations.

5.2 Future Research Directions

5.2.1 Improving Elasticity by Using Container Based Solutions

In recent years, the container technique has drawn much attention of the HPC com-

munity [95, 96]. Firstly, compared to traditional VM-based virtualization, container is

lightweight: the start-up time for a container is in sub-seconds compared to minutes for

a VM. Secondly, container has built in environment and version control, which greatly

simplifies software deployment. It also allows root access within a container. Thirdly,

container support fine-grained resource sharing using techniques such as Cgroups to con-

trol limits for resource usage, including CPU, memory, disk I/O, and network. Despite

the isolation provided by container is sometimes inadequate to meet rigorous industry

requirements, the security level provided by containers is mostly enough in the academia

world.

In the context of elastic scheduling, container will improve existing solutions as well

as enable new use cases. Our Balancer approach can leverage container to improve

usability. One method is to schedule container on VM: Balancer negotiates resources

between batch and on-demand schedulers. In our current implementation OpenStack is

used as the on-demand scheduler, and each on-demand request is scheduled with a VM.

Compared to container, VM is a course-grained way of scheduling resources. Container

merges the gap between VM and tasks. Each task can be wrapped in a container, and

multiple containers can be quickly launched on a VM. In this case, user doesn’t need to

manage task runtime environment on a VM.

Furthermore, we can allow containers to run directly on physical nodes. This use

case enables more fine-grained resource sharing. For example when resources are idle in

an HPC cluster, we can quickly run HTC tasks to utilize the idle resources. This type of

82

cycle stealing doesn’t need to go through Balancer, since these HTC workloads can be

killed when a batch job is scheduled. This use case enables a batch/on-demand/HTC

three-way resource sharing. On-demand is given high priority. On-demand requests can

reclaim idle nodes from the batch scheduler. At the other end of the spectrum, HTC

workload has the lowest priority. They fill transient idleness in the cluster, such that

overall utilization will further improve.

5.2.2 Unified Batch and On-demand Scheduler

In recent years, there is a trend of converging HPC and Cloud [97, 98, 99]. Most works

focus on hybrid usage of HPC and Cloud. However, we envision a unified HPC scheduler

to support different types of SLAs. The counterpart schedulers in Datacenters such as

Borg [72] supports both time-critical and batch workloads. But as we have discussed

in Section 3.4, they treat batch as second-class citizens, meaning that batch jobs can

be killed, which is unacceptable in HPC. Besides, Borg is not open-sourced. The HPC

world needs its own all-in-one scheduler that can support both traditional batch jobs

and on-demand jobs.

5.2.3 Node-level Resource Partitioning and Sharing

At the single-node level, the broad adoption and ubiquitous usage of multi-core architec-

tures has led a single server to be viewed more and more like a cluster. has required more

careful engineering to unleash the performance of a single node. Though general-purpose

CPUs have yet to attain 1000s of cores per chip as [100] had predicted, resource con-

tentions triggered by node-level multitasking have urged researchers to examine smarter

ways to co-locate multiple applications on the same node [101]. For a single applica-

tion, work like [102] has shown how to optimize application end-to-end performance by

functionally partitioning multi-cores. Moreover, the fusion of CPU-GPU [103] has led

many researchers to explore better approaches of coordinating CPU-GPU to efficiency

work together.

References

[1] Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang

Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang,

and Xiaofei Chen. 18.9pflopss nonlinear earthquake simulation on sunway taihu-

light: Enabling depiction of 18-hz and 8-meter scenarios. In Proceedings of the

International Conference for High Performance Computing, Networking, Storage

and Analysis, SC ’17, pages 2:1–2:12, New York, NY, USA, 2017. ACM.

[2] Business Insider. This vertical farm in newark, new jersey, could be the key to

solving some of agriculture’s biggest problems, 2018.

[3] Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally,

Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al.

Exascale computing study: Technology challenges in achieving exascale systems.

Defense Advanced Research Projects Agency Information Processing Techniques

Office (DARPA IPTO), Tech. Rep, 15, 2008.

[4] TOP500.org. First us exascale supercomputer now on track for 2021, 2016.

[5] Nikolay A Simakov, Joseph P White, Robert L DeLeon, Steven M Gallo,

Matthew D Jones, Jeffrey T Palmer, Benjamin Plessinger, and Thomas R Furlani.

A workload analysis of nsf’s innovative hpc resources using xdmod. arXiv preprint

arXiv:1801.04306, 2018.

[6] Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. Elasticity in cloud

computing: What it is, and what it is not. In Proceedings of the 10th International

Conference on Autonomic Computing (ICAC 13), pages 23–27, San Jose, CA,

2013. USENIX.

83

84

[7] F. Liu and J. B. Weissman. Elastic job bundling: an adaptive resource request

strategy for large-scale parallel applications. In SC15: International Conference

for High Performance Computing, Networking, Storage and Analysis, pages 1–12,

Nov 2015.

[8] Feng Liu, Kate Keahey, Pierre Riteau, and Jon Weissman. Dynamically nego-

tiating capacity between on-demand and batch clusters. In Proceedings of the

International Conference for High Performance Computing, Networking, Storage,

and Analysis, SC ’18, pages 38:1–38:11, Piscataway, NJ, USA, 2018. IEEE Press.

[9] M. Turilli, F. Liu, Z. Zhang, A. Merzky, M. Wilde, J. Weissman, D. S. Katz, and

S. Jha. Integrating abstractions to enhance the execution of distributed applica-

tions. In 2016 IEEE International Parallel and Distributed Processing Symposium

(IPDPS), pages 953–962, May 2016.

[10] Shantenu Jha, Murray Cole, Daniel S. Katz, Manish Parashar, Omer Rana, and

Jon Weissman. Distributed computing practice for large-scale science and engi-

neering applications. 25, 08 2013.

[11] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,

Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John

Shalf, Samuel Webb Williams, et al. The landscape of parallel computing research:

A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183,

EECS Department, University of California, Berkeley, 2006.

[12] Walfredo Cirne and Francine Berman. Using moldability to improve the perfor-

mance of supercomputer jobs. Journal of Parallel and Distributed Computing,

62(10):1571–1601, 2002.

[13] Walfredo Cirne and Francine Berman. When the herd is smart: aggregate be-

havior in the selection of job request. Parallel and Distributed Systems, IEEE

Transactions on, 14(2):181–192, 2003.

[14] Gladys Utrera, Siham Tabik, Julita Corbalan, and Jesus Labarta. A job scheduling

approach for multi-core clusters based on virtual malleability. In Euro-Par 2012

Parallel Processing, pages 191–203. Springer, 2012.

85

[15] Allen B Downey. Predicting queue times on space-sharing parallel computers.

In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages

209–218. IEEE, 1997.

[16] Allen B Downey. Using queue time predictions for processor allocation. In Job

Scheduling Strategies for Parallel Processing, pages 35–57. Springer, 1997.

[17] Warren Smith, Valerie Taylor, and Ian Foster. Using run-time predictions to

estimate queue wait times and improve scheduler performance. In Job Scheduling

Strategies for Parallel Processing, pages 202–219. Springer, 1999.

[18] Walfredo Cirne and Francine Berman. A model for moldable supercomputer

jobs. In Parallel and Distributed Processing Symposium., Proceedings 15th In-

ternational, pages 8–pp. IEEE, 2001.

[19] Rich Wolski. Experiences with predicting resource performance on-line in com-

putational grid settings. ACM SIGMETRICS Performance Evaluation Review,

30(4):41–49, 2003.

[20] Hui Li, David Groep, Jeffrey Templon, and Lex Wolters. Predicting job start

times on clusters. In Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE

International Symposium on, pages 301–308. IEEE, 2004.

[21] Dan Tsafrir, Yoav Etsion, and Dror G Feitelson. Backfilling using system-

generated predictions rather than user runtime estimates. Parallel and Distributed

Systems, IEEE Transactions on, 18(6):789–803, 2007.

[22] Daniel Nurmi, John Brevik, and Rich Wolski. Qbets: queue bounds estimation

from time series. In Job Scheduling Strategies for Parallel Processing, pages 76–

101. Springer, 2008.

[23] Garrick Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE

conference on Supercomputing, page 8. ACM, 2006.

[24] Andy B. Yoo, Morris A. Jette, and Mark Grondona. Slurm: Simple linux

utility for resource management. In Dror Feitelson, Larry Rudolph, and Uwe

86

Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages

44–60, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.

[25] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, workloads,

and user runtime estimates in scheduling the ibm sp2 with backfilling. Parallel

and Distributed Systems, IEEE Transactions on, 12(6):529–543, 2001.

[26] Feng Liu and Jon Weissman. Elastic job bundling: An adaptive resource request

strategy for large-scale parallel applications. Technical report, TR15-006, Depart-

ment of Computer Science and Engineering, University of Minnesota, 2015.

[27] Andre Luckow, Mark Santcroos, Ole Weidner, Andre Merzky, Pradeep Mantha,

and Shantenu Jha. P*: A Model of Pilot-Abstractions. In 8th IEEE International

Conference on e-Science 2012, 2012.

[28] David H Bailey, Eric Barszcz, John T Barton, David S Browning, Russell L Carter,

Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S

Schreiber, et al. The nas parallel benchmarks. International Journal of High

Performance Computing Applications, 5(3):63–73, 1991.

[29] Futuregrid. futuregrid.org.

[30] Dominique LaSalle and George Karypis. Mpi for big data: New tricks for an old

dog. Parallel Computing, 40(10):754–767, 2014.

[31] Jason Ansel, Kapil Aryay, and Gene Coopermany. Dmtcp: Transparent check-

pointing for cluster computations and the desktop. In Parallel & Distributed

Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–12.

IEEE, 2009.

[32] Julien Adam, Jean-Baptiste Besnard, Allen D. Malony, Sameer Shende, Marc

Perache, Patrick Carribault, and Julien Jaeger. Transparent high-speed network

checkpoint/restart in mpi. In Proceedings of the 25th European MPI Users’ Group

Meeting, EuroMPI’18, pages 12:1–12:11, New York, NY, USA, 2018. ACM.

[33] Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/

workload/.

futuregrid.org

http://www.cs.huji.ac.il/labs/parallel/workload/

http://www.cs.huji.ac.il/labs/parallel/workload/

87

[34] pyss - the python scheduler simulator. https://code.google.com/p/pyss/.

[35] David J Lilja. Measuring computer performance: a practitioner’s guide. Cam-

bridge University Press, 2000.

[36] Dror G Feitelson. Metric and workload effects on computer systems evaluation.

Computer, 36(9):18–25, 2003.

[37] Adam Wierman. Revisiting the performance of large jobs in the M/GI/1 queue.

In Proceedings of the Forty-Fifth Annual Allerton Conference On Communication,

Control, and Computing, pages 607–614, 2007.

[38] Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold

Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. The case for tiny tasks in

compute clusters.

[39] Jon B Weissman, Lakshman Rao Abburi, and Darin England. Integrated schedul-

ing: the best of both worlds. Journal of Parallel and Distributed Computing,

63(6):649–668, 2003.

[40] Rajesh Sudarsan and Calvin J Ribbens. Reshape: A framework for dynamic

resizing and scheduling of homogeneous applications in a parallel environment.

In Parallel Processing, 2007. ICPP 2007. International Conference on, page 44.

IEEE, 2007.

[41] Rajesh Sudarsan and Calvin J Ribbens. Scheduling resizable parallel applications.

In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Sym-

posium on, pages 1–10. IEEE, 2009.

[42] https://bitbucket.org/francis_liu/pyss.

[43] P. Marshall, K. Keahey, and T. Freeman. Improving utilization of infrastructure

clouds. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and

Grid Computing, pages 205–214, May 2011.

[44] David P. Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer.

Seti@home: An experiment in public-resource computing. Commun. ACM,

45(11):56–61, November 2002.

https://code.google.com/p/pyss/

https://bitbucket.org/francis_liu/pyss

88

[45] O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Decon-

structing amazon ec2 spot instance pricing. In 2011 IEEE Third International

Conference on Cloud Computing Technology and Science, pages 304–311, Nov

2011.

[46] Garrick Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE

Conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM.

[47] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, and David Jackson. The

portable batch scheduler and the maui scheduler on linux clusters. In Proceedings

of the 4th Annual Linux Showcase & Conference - Volume 4, ALS’00, pages 27–27,

Berkeley, CA, USA, 2000. USENIX Association.

[48] Irfan Habib. Virtualization with kvm. Linux J., 2008(166), February 2008.

[49] Kate Keahey, Pierre Riteau, Dan Stanzione, Tim Cockerill, Joe Manbretti, Paul

Rad, and Ruth. Paul. Chameleon: a Scalable Production Testbed for Computer

Science Research. In Contemporary High Performance Computing vol. 3. Ed. Jeff

Vetter. Springer, 2017.

[50] John R. Lange, Kevin Pedretti, Peter Dinda, Patrick G. Bridges, Chang Bae,

Philip Soltero, and Alexander Merritt. Minimal-overhead virtualization of a large

scale supercomputer. In Proceedings of the 7th ACM SIGPLAN/SIGOPS Interna-

tional Conference on Virtual Execution Environments, VEE ’11, pages 169–180,

New York, NY, USA, 2011. ACM.

[51] Bryce Allen, Rachana Ananthakrishnan, Kyle Chard, Ian Foster, Ravi Madduri,

Jim Pruyne, Stephen Rosen, and Steve Tuecke. Globus: A case study in software

as a service for scientists. In Proceedings of the 8th Workshop on Scientific Cloud

Computing, ScienceCloud ’17, pages 25–32, New York, NY, USA, 2017. ACM.

[52] Lavanya Ramakrishnan, Piotr T. Zbiegel, Scott Campbell, Rick Bradshaw,

Richard Shane Canon, Susan Coghlan, Iwona Sakrejda, Narayan Desai, Tina

Declerck, and Anping Liu. Magellan: Experiences from a science cloud. In

Proceedings of the 2Nd International Workshop on Scientific Cloud Computing,

ScienceCloud ’11, pages 49–58, New York, NY, USA, 2011. ACM.

89

[53] Qiming He, Shujia Zhou, Ben Kobler, Dan Duffy, and Tom McGlynn. Case study

for running hpc applications in public clouds. In Proceedings of the 19th ACM

International Symposium on High Performance Distributed Computing, HPDC

’10, pages 395–401, New York, NY, USA, 2010. ACM.

[54] I. Sadooghi, J. H. Martin, T. Li, K. Brandstatter, K. Maheshwari, T. P. P. de Lac-

erda Ruivo, G. Garzoglio, S. Timm, Y. Zhao, and I. Raicu. Understanding the

performance and potential of cloud computing for scientific applications. IEEE

Transactions on Cloud Computing, 5(2):358–371, April 2017.

[55] F. J. Clemente-Castello, B. Nicolae, R. Mayo, and J. C. Fernandez. Performance

model of mapreduce iterative applications for hybrid cloud bursting. IEEE Trans-

actions on Parallel and Distributed Systems, PP(99):1–1, 2018.

[56] M. Parashar, M. AbdelBaky, I. Rodero, and A. Devarakonda. Cloud paradigms

and practices for computational and data-enabled science and engineering. Com-

puting in Science Engineering, 15(4):10–18, July 2013.

[57] G. Fox and S. Jha. Conceptualizing a computing platform for science beyond 2020:

To cloudify hpc, or hpcify clouds? In 2017 IEEE 10th International Conference

on Cloud Computing (CLOUD), pages 808–810, June 2017.

[58] S. Niu, J. Zhai, X. Ma, X. Tang, and W. Chen. Cost-effective cloud hpc resource

provisioning by building semi-elastic virtual clusters. In 2013 SC - International

Conference for High Performance Computing, Networking, Storage and Analysis

(SC), pages 1–12, Nov 2013.

[59] A. Marathe, R. Harris, D. K. Lowenthal, B. R. de Supinski, B. Rountree, and

M. Schulz. Exploiting redundancy and application scalability for cost-effective,

time-constrained execution of hpc applications on amazon ec2. IEEE Transactions

on Parallel and Distributed Systems, 27(9):2574–2588, Sept 2016.

[60] Ishai Menache, Ohad Shamir, and Navendu Jain. On-demand, spot, or both: Dy-

namic resource allocation for executing batch jobs in the cloud. In 11th Interna-

tional Conference on Autonomic Computing (ICAC 14), pages 177–187, Philadel-

phia, PA, 2014. USENIX Association.

90

[61] Y. Gong, B. He, and A. C. Zhou. Monetary cost optimizations for mpi-based hpc

applications on amazon clouds: checkpoints and replicated execution. In SC15:


and Analysis, pages 1–12, Nov 2015.

[62] R. Chard, K. Chard, K. Bubendorfer, L. Lacinski, R. Madduri, and I. Foster.

Cost-aware cloud provisioning. In 2015 IEEE 11th International Conference on

e-Science, pages 136–144, Aug 2015.

[63] Rich Wolski, John Brevik, Ryan Chard, and Kyle Chard. Probabilistic guar-

antees of execution duration for amazon spot instances. In Proceedings of the


and Analysis, SC ’17, pages 18:1–18:11, New York, NY, USA, 2017. ACM.

[64] Simon Delamare, Gilles Fedak, Derrick Kondo, and Oleg Lodygensky. Spequlos: A

qos service for bot applications using best effort distributed computing infrastruc-

tures. In Proceedings of the 21st International Symposium on High-Performance

Parallel and Distributed Computing, HPDC ’12, pages 173–186, New York, NY,

USA, 2012. ACM.

[65] Marcus Carvalho, Walfredo Cirne, Franciso Brasileiro, and John Wilkes. Long-

term SLOs for reclaimed cloud computing resources. In ACM Symposium on

Cloud Computing (SoCC), pages 20:1–20:13, Seattle, WA, USA, 2014.

[66] Supreeth Shastri, Amr Rizk, and David Irwin. Transient guarantees: Maximizing

the value of idle cloud capacity. In Proceedings of the International Conference for

High Performance Computing, Networking, Storage and Analysis, SC ’16, pages

85:1–85:11, Piscataway, NJ, USA, 2016. IEEE Press.

[67] Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, and Thomas Mosci-

broda. Tr-spark: Transient computing for big data analytics. pages 484–496. ACM

Symposium on Cloud Computing, October 2016.

91

[68] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and

Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Pro-

ceedings of the 42Nd Annual International Symposium on Computer Architecture,

ISCA ’15, pages 450–462, New York, NY, USA, 2015. ACM.

[69] Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur

Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo

Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. Morpheus: Towards

automated slos for enterprise clusters. In 12th USENIX Symposium on Operating

Systems Design and Implementation (OSDI 16), pages 117–134, Savannah, GA,

2016. USENIX Association.

[70] Ganesh Ananthanarayanan, Christopher Douglas, Raghu Ramakrishnan, Sriram

Rao, and Ion Stoica. True elasticity in multi-tenant data-intensive compute clus-

ters. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC

’12, pages 24:1–24:7, New York, NY, USA, 2012. ACM.

[71] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.

Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-

grained resource sharing in the data center. In Proceedings of the 8th USENIX

Conference on Networked Systems Design and Implementation, NSDI’11, pages

295–308, Berkeley, CA, USA, 2011. USENIX Association.

[72] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric

Tune, and John Wilkes. Large-scale cluster management at google with borg. In

Proceedings of the Tenth European Conference on Computer Systems, EuroSys

’15, pages 18:1–18:17, New York, NY, USA, 2015. ACM.

[73] Michael Cardosa and Abhishek Chandra. Resource bundles: Using aggregation for

statistical wide-area resource discovery and allocation. In Distributed Computing

Systems, 2008. ICDCS’08. The 28th International Conference on, pages 760–768.

IEEE, 2008.

[74] Ajaykrishna Raghavan, Abhishek Chandra, and Jon Weissman. Tiera: towards

flexible multi-tiered cloud storage instances. In Proceedings of the 15th Interna-

tional Middleware Conference, pages 1–12. ACM, 2014.

92

[75] Yogesh Simmhan and Lavanya Ramakrishnan. Comparison of resource platform

selection approaches for scientific workflows. In Proceedings of the 19th ACM In-

ternational Symposium on High Performance Distributed Computing, pages 445–

450. ACM, 2010.

[76] Sudharshan Vazhkudai, Jennifer M. Schopf, and Ian T. Foster. Predicting the

performance of wide area data transfers. In Proceedings of the 16th International

Parallel and Distributed Processing Symposium, IPDPS ’02, pages 270–, Washing-

ton, DC, USA, 2002. IEEE Computer Society.

[77] Tevfik Kosar and Miron Livny. Stork: Making data placement a first class citizen

in the grid. In Distributed Computing Systems, 2004. Proceedings. 24th Interna-

tional Conference on, pages 342–349. IEEE, 2004.

[78] Chuang Liu and Ian Foster. A constraint language approach to matchmak-

ing. In Proceedings of the 14th International Workshop on Research Issues on

Data Engineering: Web Services for E-Commerce and E-Government Applica-

tions (RIDE’04), RIDE ’04, pages 7–14, Washington, DC, USA, 2004. IEEE

Computer Society.

[79] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat. Design and im-

plementation tradeoffs for wide-area resource discovery. In High Performance

Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International

Symposium on, pages 113–124, July 2005.

[80] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Ha-

zlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and

N. Wilkins-Diehr. Xsede: Accelerating scientific discovery. Computing in Sci-

ence & Engineering, 16(5):62–74, Sept.-Oct. 2014.

[81] Ruth Pordes, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, Alain Roy,

Paul Avery, Kent Blackburn, Torre Wenaus, Frank Wrthwein, Ian Foster, Rob

Gardner, Mike Wilde, Alan Blatecky, John McGee, and Rob Quick. The open

science grid. Journal of Physics: Conference Series, 78(1):012057, 2007.

[82] Stampede user guide. https://portal.tacc.utexas.edu/archives/stampede.

https://portal.tacc.utexas.edu/archives/stampede

93

[83] B Bockelman, T Cartwright, J Frey, E M Fajardo, B Lin, M Selmeci, T Tannen-

baum, and M Zvada. Commissioning the htcondor-ce for the open science grid.

Journal of Physics: Conference Series, 664(6):062003, 2015.

[84] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Distributed

resource management for high throughput computing. In hpdc, page 140. IEEE,

1998.

[85] John L Gustafson and Quinn O Snell. Hint: A new way to measure computer

performance. In System Sciences, 1995. Proceedings of the Twenty-Eighth Hawaii

International Conference on, volume 2, pages 392–401. IEEE, 1995.

[86] Ritu Garg and Awadhesh Kumar Singh. Adaptive workflow scheduling in grid

computing based on dynamic resource availability. Engineering Science and Tech-

nology, an International Journal, 18(2):256–269, 2015.

[87] Kousik Dasgupta, Brototi Mandal, Paramartha Dutta, Jyotsna Kumar Mandal,

and Santanu Dam. A genetic algorithm (ga) based load balancing strategy for

cloud computing. Procedia Technology, 10:340–347, 2013.

[88] Fatos Xhafa and Ajith Abraham. Computational models and heuristic methods

for grid scheduling problems. Future generation computer systems, 26(4):608–621,

2010.

[89] Ajay Tirumala, Feng Qin, Jon Dugan, Jim Ferguson, and Kevin Gibbs. iperf:

Tcp/udp bandwidth measurement tool. 01 2005.

[90] Ian Foster, Jonathan Geisler, Bill Nickless, Warren Smith, and Steven Tuecke.

Software infrastructure for the i-way high-performance distributed computing ex-

periment. In High Performance Distributed Computing, 1996., Proceedings of 5th

IEEE International Symposium on, pages 562–571. IEEE, 1996.

[91] Andrew S. Grimshaw, Wm. A. Wulf, and CORPORATE The Legion Team. The

legion vision of a worldwide virtual computer. Commun. ACM, 40(1):39–45, Jan-

uary 1997.

94

[92] Ian Foster, Carl Kesselman, and Steven Tuecke. The anatomy of the grid: En-

abling scalable virtual organizations. The International Journal of High Perfor-

mance Computing Applications, 15(3):200–222, 2001.

[93] Douglas Thain, Todd Tannenbaum, and Miron Livny. Distributed computing

in practice: the condor experience. Concurrency and computation: practice and

experience, 17(2-4):323–356, 2005.

[94] Chuang Liu and Ian Foster. A constraint language approach to grid resource

selection. 01 2003.

[95] Charles Zheng and Douglas Thain. Integrating containers into workflows: A

case study using makeflow, work queue, and docker. In Proceedings of the 8th

International Workshop on Virtualization Technologies in Distributed Computing,

pages 31–38. ACM, 2015.

[96] Pankaj Saha, Angel Beltre, Piotr Uminski, and Madhusudhan Govindaraju. Eval-

uation of docker containers for scientific workloads in the cloud. In Proceedings

of the Practice and Experience on Advanced Research Computing, PEARC ’18,

pages 11:1–11:8, New York, NY, USA, 2018. ACM.

[97] Gabriel Mateescu, Wolfgang Gentzsch, and Calvin J Ribbens. Hybrid computing-

where hpc meets grid and cloud computing. Future Generation Computer Systems,

27(5):440–453, 2011.

[98] Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake, and Supun Kamburuga-

muve. Big data, simulations and hpc convergence. In Tilmann Rabl, Raghunath

Nambiar, Chaitanya Baru, Milind Bhandarkar, Meikel Poess, and Saumyadipta

Pyne, editors, Big Data Benchmarking, pages 3–17, Cham, 2016. Springer Inter-

national Publishing.

[99] Thamarai Selvi Somasundaram and Kannan Govindarajan. Cloudrb: A frame-

work for scheduling and managing high-performance computing (hpc) applications

in science cloud. Future Generation Computer Systems, 34:47–65, 2014.

[100] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. Computer,

41(7):33–38, July 2008.

95

[101] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-

aware cluster management. In Proceedings of the 19th International Conference

on Architectural Support for Programming Languages and Operating Systems, AS-

PLOS ’14, pages 127–144, New York, NY, USA, 2014. ACM.

[102] M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann,

and G. Shipman. Functional partitioning to optimize end-to-end performance

on many-core architectures. In 2010 ACM/IEEE International Conference for

High Performance Computing, Networking, Storage and Analysis, pages 1–12,

Nov 2010.

[103] Sparsh Mittal and Jeffrey S. Vetter. A survey of cpu-gpu heterogeneous computing

techniques. ACM Comput. Surv., 47(4):69:1–69:35, July 2015.

elastic scheduling in hpc resource management systems

Documents