elastic scheduling in hpc resource management systems
TRANSCRIPT
Elastic Scheduling in HPC Resource Management Systems
A DISSERTATION
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Feng Liu
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Professor Jon Weissman
Dec, 2018
c© Feng Liu 2018
ALL RIGHTS RESERVED
Acknowledgements
I would like to express my utterly most gratitude to my advisor Professor Jon Weissman.
During the past 6 years and 4 months, there were countless occasions when I couldn’t
deliver results on time. He showed tremendous patience and guided me through the
difficulties. He gave me valuable advices and hints and tolerated my mistakes. He spent
enormous efforts advising me. I appreciate all of our discussions, separate meetings,
emails, and collaborative writings.
I would like to express my great gratitude and appreciation to Dr. Kate Keahey from
Argonne National Lab. She directed me with her incredible insights. She absolutely
insists on high work standards, emphasizes on details, and holds big pictures. I have
gained numerous experiences working under her direction, which will help me to become
a lot more productive in my future career.
The works presented in Chapter 2 and Chapter 4 were sponsored by the Department
of Energy under the AIMES (Abstractions and Integrated Middleware for Extreme-Scale
Science) project (DE-FG02-12ER26115, DE- SC0008617, DE-SC0008651). Thanks to
the participants of the AIMES project: Shantenu Jha, Matteo Turilli, Andre Merzky,
Daniel S. Katz, Michael Wilde, Zhao Zhang, and Yadu Nand Babuji.
I would like to thank Dr. Pierre Riteau for the initial implementation of Balancer
service in Chapter 3. The work presented in the chapter was supported by the U.S.
Department of Energy, under the DOE-LAB-14-1003 and the NSF under the NSF-
1443080 award. Results presented in the chapter were obtained using the Chameleon
testbed supported by the National Science Foundation.
Finally, I would like to thank my parents. They always support me, and encourage
me to pursuit my goals.
i
Dedication
To my parents.
ii
Abstract
High Performance Computing (HPC) aggregates the power of computer clusters to
tackle large problems empowering science. HPC resource scheduling today is faced with
multiple challenges. Firstly, most HPC clusters are managed by queue batch systems.
Batch scheduler maximizes application run-time efficiency while sacrifices response time
and sometimes utilization. Secondly, HPC clusters reserved for on-demand data analysis
are operated at low utilization. Thirdly, multiple heterogeneous and dynamic HPC
resources greatly complicate resource scheduling for distributed applications.
To solve these problems, this thesis presents several elastic scheduling approaches.
Elasticity means the ability to dynamically allocate resources based on workloads. Elas-
ticity is commonly supported in Cloud but is lacking in HPC. Our approaches include
new scheduling algorithms and implementations of the algorithms as services. Our ser-
vices leverage existing techniques and are non-invasive, meaning that they minimize the
changes to user interfaces.
We address the first problem using Elastic Job Bundling (EJB), a technique that
dynamically transforms a large batch job into multiple smaller subjobs so that the
subjobs will start early on immediately available resources. Simulation results show
that our approach reduces application mean turnaround time by up to 48%, reduces
resource fragmentation by up to 59%, and reduces priority inversions by 20%.
We address the second problem using Balancer, a technique that combines and
dynamically moves nodes between an on-demand cluster and a batch cluster. Our
results show that for a real-life scenario, our approach reduces the current investment
in on-demand cluster by 82% while at the same time improving the mean batch wait
time by 8x.
We address the third problem using Bundle, a resource abstraction that represents
heterogeneous resource capacities and capabilities in a uniform way. We implement
Bundle as a service on 10+ heterogeneous HPC resources. We use Bundle to draw on
insights of resources.
iii
Contents
Acknowledgements i
Dedication ii
Abstract iii
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Resource Scheduling in HPC Clusters . . . . . . . . . . . . . . . . . . . 2
1.2 Challenges in HPC Resource Scheduling . . . . . . . . . . . . . . . . . . 4
1.3 Elasticity in HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Summary of Research Contributions . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Adaptive Resource Request . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Dynamic Resource Negotiation . . . . . . . . . . . . . . . . . . . 7
1.4.3 Dynamic Resource Bundle . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Elastic Job Bundling: An Adaptive Resource Request Strategy for
Large-Scale Parallel Applications 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Elastic Job Bundling (EJB) . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 The Formation of Elastic Jobs . . . . . . . . . . . . . . . . . . . 12
iv
2.2.2 Running Applications on Elastic Jobs . . . . . . . . . . . . . . . 14
2.2.3 Taming Unpredictability . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . 18
2.3 EJB Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 TargetJobArrivalEvent Handler . . . . . . . . . . . . . . . . . . . 19
2.3.2 IdleJobSlotsAvailableEvent Handler . . . . . . . . . . . . . . . . 20
2.3.3 SubjobStartEvent Handler . . . . . . . . . . . . . . . . . . . . . . 22
2.3.4 TargetAppCompleteEvent Handler . . . . . . . . . . . . . . . . . 22
2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Trace-driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.2 Improving Elastic Job Turnaround Time . . . . . . . . . . . . . . 28
2.6.3 Migration behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.4 Multiple Elastic Jobs . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8.1 Moldable Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8.2 Malleable Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Dynamically Negotiating Capacity Between On-demand and Batch
Clusters 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Leases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Evaluating a Real-Life Scenario . . . . . . . . . . . . . . . . . . . 49
3.3.2 Evaluating Balancer Algorithms . . . . . . . . . . . . . . . . . . 53
v
3.3.3 Elasticity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 The Bundle Service for Elastic Resource Scheduling in HPC Environ-
ments 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 The Bundle Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 HPC Cluster Workload Characterization . . . . . . . . . . . . . . 72
4.4.2 Grid Nodes Performance Heterogeneity . . . . . . . . . . . . . . 73
4.4.3 Grid Network Performance Variation . . . . . . . . . . . . . . . . 74
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Conclusion 78
5.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Elasticity for Parallel Batch System . . . . . . . . . . . . . . . . 79
5.1.2 Elasticity for On-demand and Batch Hybrid System . . . . . . . 80
5.1.3 Elastic Resource Abstraction for Heterogeneous Resources . . . . 80
5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Improving Elasticity by Using Container Based Solutions . . . . 81
5.2.2 Unified Batch and On-demand Scheduler . . . . . . . . . . . . . 82
5.2.3 Node-level Resource Partitioning and Sharing . . . . . . . . . . . 82
References 83
vi
List of Tables
2.1 Upper-bounds of processor (Pmax) and runtime (Rmax) of two types of
immediately backfillable job slots. . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Traces used in our simulation . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Increase ’+’ or decrease ’-’ percentages of mean wait, run, and turnaround
time of elastic jobs compared to target jobs’ baseline results. . . . . . . 28
2.4 Summarize migration related statistics. . . . . . . . . . . . . . . . . . . . 30
2.5 Before-and-after comparison: confidence intervals are calculated at the
95% confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Fragmentation: np is the average number of idle processors, % is the
percentage of the idle processors in the cluster. . . . . . . . . . . . . . . 35
3.1 Experimental results for the most challenging week: there are 24,177
batch jobs and 141 on-demand requests being submitted in each experi-
ment. The wait time is measured in minutes and the reserve values are
given in nodes. For the dynamic case, the on-demand and batch utiliza-
tion refer to the portion of utilization coming from on-demand and batch
requests respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 BundleAgent supported platforms . . . . . . . . . . . . . . . . . . . . . . 70
4.2 BundleAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vii
List of Figures
1.1 The process of (1) user submits resource request, (2) job scheduling, and
(3) application execution in an HPC cluster. . . . . . . . . . . . . . . . . 3
1.2 A real-world week-long bursty workload from the Argonne National Lab’s
APS cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Overview of the Elastic Job Bundling. . . . . . . . . . . . . . . . . . . . 12
2.2 Illustration of elastic jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Mapping a parallel application’s processes to an elastic job, including
progress measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Finding immediately usable resources under EASY backfilling. . . . . . 17
2.5 Three cases for subjob submission in a waiting elastic job. . . . . . . . . 20
2.6 EJB system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Performance measurement of six NPB programs under over-subscription.
All programs are compiled of problem size CLASS=C, with NPROCS=100(bt,sp),
128(ft,is,lu,mg). Different problem sizes or NPROCS follow the same pat-
tern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Elastic job’s overall performance and variations. . . . . . . . . . . . . . . 29
2.9 Sensitivity analysis of Omax, λ, and ∆. . . . . . . . . . . . . . . . . . . . 31
2.10 Decision tree for elastic job selection . . . . . . . . . . . . . . . . . . . . 32
2.11 Bounded slowdown: side-by-side view before and after EJB is added,
grouped into elastic, non-elastic, and all jobs. . . . . . . . . . . . . . . . 34
2.12 Linear regressions of tw over job size before and after EJB is added.
Adjacent to x-axis indicates fairness. . . . . . . . . . . . . . . . . . . . . 35
2.13 Changing utilization: EJB is more resistant under high utilization. . . . 36
3.1 High-Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
3.2 Performance results of the Basic algorithm, five batch workloads and six
on-demand workloads, R = 0, W = 0. . . . . . . . . . . . . . . . . . . . 55
3.3 Performance results of the Basic algorithm with static reserve, three
batch workload, ρ = 10% on-demand workloads. . . . . . . . . . . . . . 57
3.4 Performance results of the Hint algorithm, H = 15 min or H = 30 min,
three batch workload, six on-demand workloads. . . . . . . . . . . . . . 59
3.5 Performance results of the Predictive algorithm, three batch workload,
six on-demand workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Measurements of elasticity based on U77 and rho=20% workloads. . . . 62
4.1 Overview of the Bundle layer. . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 An overview of Bundle architecture, blue shaded components comprise
the Bundle software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Visualization of a month-long workload of TACC Stampede HPC cluster. 72
4.4 Compute node performance clustering . . . . . . . . . . . . . . . . . . . 75
4.5 network performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ix
Chapter 1
Introduction
High Performance Computing, or HPC, generally refers to the practice of aggregating
the power of a cluster of computers to tackle large problems beyond the capacity of a
single node. Compared to general-purpose computing, HPC achieves high level perfor-
mance by co-designing high-end computing, storage, and network. From hurricane and
earthquake predictions to solving global hunger challenges, never before has HPC been
more crucial to empowering science for the benefit of humanity.
HPC allows scientists to simulate bigger models represented at finer scales at faster
speeds. Take the research area of earthquakes for example. Large earthquakes cause
devastating damage to human society. With the major advances occurring in HPC,
the ability to simulate the complex processes associated with major earthquakes helps
scientists in pushing the frontiers of studies targeting reducing seismic damage. For
instance, a work [1] published in 2017 runs large-scale nonlinear earthquake simulation
on Sunway TaihuLight – the world’s fastest supercomputer in that year. Their work
achieves over 15% of the supercomputer’s peak performance, with the extreme cases
demonstrating a sustained performance of over 18.9 Pflops, enabling the simulation of
Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution.
Not limited to scientific computing, the increasing adoptions of HPC technologies
also advance innovation in emerging areas such as Big Data, Internet of Things (IoT),
and Machine Learning (ML). One compelling example of such is AeroFarms, an IoT/ML-
driven vertical farm that controls plant’s operating and growing environment at fine
granularity. The vertical farm gathers data on every factors from moisture and nutrients
1
2
to light and oxygen, and sends the data to HPC facilities optimized for machine learning.
HPC technologies enable complex decision-making, such as real-time quality control that
relies on diverse types of data. Empowered by the deep integration of IoT/HPC/ML,
the farm is capable of using up to 95% less water than traditional field farming and
significantly improving annual productivity [2].
Driven by real-world HPC applications, both academia and industry have invested
large effort in building the next generation of HPC systems, i.e. Exascale computing [3],
which targets one exaflops per second by 2021 [4]. However, most emphasis was placed
on the architecture and infrastructure while HPC resource scheduling has not evolved
much compared to a decade ago. Almost all HPC clusters today are managed by queue-
based batch schedulers (e.g. SLURM, PBS, TORQUE) to arbitrate resource sharing
among applications. This type of scheduling aims to maximize application run-time effi-
ciency while sacrifices response time and sometimes utilization, which hinders the ability
of HPC system to successfully serve the needs of HPC users under different workload
patterns or supporting richer use scenarios. In order to support these needs, this the-
sis will explore alternative approaches which balance the aforementioned performance
goals, namely efficiency, utilization, and response time.
1.1 Resource Scheduling in HPC Clusters
As Figure 1.1 displays, in an HPC cluster, when a user wants to run an application,
e.g. a set of tasks, the user would need to submit a resource request to the cluster’s
scheduler. The request essentially specifies (a) number of processors, and (b) estimated
runtime needed to execute the tasks. The scheduler responds to the resource request by
granting a temporary ownership of resources, taking place between a well-defined start
time and end time. Such a temporary ownership is defined as a job, which is the basic
scheduling unit in an HPC resource scheduler. The distinction between a job and an
application is that the former can be seen as a lease which holds needed by the latter.
From a resource efficiency perspective, a direct method to achieve high performance
of an application is to give the job corresponding to the application exclusive ownership
of resources for a continuous period of time – a strategy commonly known as space share
3
Figure 1.1: The process of (1) user submits resource request, (2) job scheduling, and(3) application execution in an HPC cluster.
as opposed to time share, or multi-tenancy which allows multiple applications to con-
currently operate in a shared environment. Another important assumption with space
share is that an independent processor is assigned to every task within an application
such that space sharing is realized at task/processor level.
A batch scheduler arbitrates space share by managing a queue of jobs. It controls
the order at which jobs start running according to scheduling policies and resource avail-
ability. A job can start only when its entire resources are allocated in the processors ×time ‘shape’, such that the application can execute without interruption until comple-
tion or the job runtime expires. This job model can be characterized as ‘rigid-job’ and
‘all-or-nothing’. Rigid-job means a job’s size and duration isn’t flexible. All-or-nothing
describes the criteria that the scheduler has allocate the resources all at once, otherwise
the job will have to keep waiting.
Finally, another noteworthy trend in the HPC world is that in recent years, a new
generation of scientific equipment and device, ranging from advanced light sources to
geographic information systems (GIS), are capable of generating large volumes of exper-
imental data which requires rapid data analysis. Unlike traditional batch applications,
these new applications are more time-sensitive, meaning that they require on-demand
resource availability for data analytics rather than tolerate long wait time. Moreover,
fast turnaround time for data processing enables fast iterations of scientific experiments,
4
which ultimately accelerates the speed of scientific discovery. To serve this new class of
applications, many HPC system operators are building on-demand service frameworks
that support fast resource access.
1.2 Challenges in HPC Resource Scheduling
To understand the challenges in HPC resource scheduling, we need to analyze the trade-
offs made by existing approaches. In order to achieve high performance, the design of
HPC resource scheduling systems makes trade-offs between efficiency, utilization, and
response time.
First of all, HPC scheduling trades off response time for high utilization and effi-
ciency. Because of the high cost of building and operating HPC clusters, high utilization
is usually the major scheduling concern, which means that the job queues are usually
over subscribed by many users’ job requests. As a result, job requests usually have to
wait for long times (hours to days) in queues until they can be started. Moreover, as de-
scribed in the previous section, HPC job scheduling applies rigid-job and all-or-nothing
– two strategies that exacerbate long job waiting times, since it is difficult to find suffi-
cient resources that are concurrently available for a given period of time, especially for
large jobs with long runtimes.
A recent study [5] conducted on 20 representative HPC clusters over the past five
years clearly reflected the long wait time issue. While the backlog of jobs shows a
substantial fluctuation over time, the demand for resources measured by core years
consistently exceeded the capacity of the existing resources. Despite the wide variation
in job wait times among the 20 clusters, large jobs have much longer wait times compared
to small jobs.
Second of all, the pursuit of high efficiency in HPC sometimes hinders utilization.
This type of trade-off could also be explained by the limitation of rigid-jobs. When jobs’
sizes are rigid and predetermined at resource request time, they won’t be able to utilize
spontaneously available resources. If there were no other jobs to fill the idle resources,
those resources will be wasted, thus lowering utilization. With the large scale of today’s
HPC clusters, a 1-2% loss in utilization could mean a significant waste of computational
power and energy, any money.
5
Figure 1.2: A real-world week-long bursty workload from the Argonne National Lab’sAPS cluster.
Third of all, HPC infrastructures operated to support on-demand availability trade
off utilization for low response time. Resources are over provisioned according to peak
usage such that there are always free resources to fulfill on-demand requests. As a
real-world example shown in Figure 1.2, the APS cluster is a production on-demand
cluster operated in Argonne National Lab (ANL). This example shows a typical pattern
of on-demand workload: highly bursty but low in average utilization.
Based on the above analysis, we can summarize the challenges that HPC scheduling
needs to address as follows:
• Challenge 1: Batch jobs, especially large ones, tend to suffer long waiting time,
which hinders the usability of HPC systems;
• Challenge 2: Rigidity limits jobs’ opportunities to use spontaneous availability,
thus impairing utilization;
• Challenge 3: The support for on-demand availability significantly undermines
utilization.
The thesis will address all three challenges developing new elastic scheduling method-
ologies – a strategy commonly implemented by cloud computing, to improve HPC re-
source scheduling.
6
1.3 Elasticity in HPC
In the context of cloud computing, elasticity can be defined as “the ability of a system to
adapt to workload changes by automatically provisioning and deprovisioning computing
resources, such that at each point in time the available resources match the current
demand as closely as possible” [6]. In order to satisfy the conditions of elasticity, it
would not only require systems to quickly provision resources on demand, but also
require applications to be able to adapt to the resources available at run-time. Both
conditions however do not generally hold in the realm of HPC.
• Firstly, nearly all HPC facilities are operated by queue-based batch systems that
do not support on-demand resource provisioning, partly because preemption is
not allowed.
• Secondly, many HPC applications feature parallel computing tasks which require
a fixed number of processors throughout their runtime. This rigidity has greatly
limited the abilities of HPC applications to dynamically adapt to system resource
availability at run-time. Breaking the limitation would be a solid step towards
bringing some degree of elasticity to current HPC resource management paradigm.
Achieving elasticity in HPC systems would help address the key challenges as listed
in previous section. Specifically, if we can enable HPC applications to dynamically adapt
to available resources, these applications can take opportunities to utilize fragmented
resources such that they can start sooner and make progress faster, which ameliorates
challenges 1 and 2 at the same time. Moreover, if HPC systems can support resource
provisioning in an on-demand manner, thus addressing challenge 3, it will make HPC
systems very appealing to users who not only desire high performance but also benefit
from cloud-like resource provisioning – the best of both worlds.
1.4 Summary of Research Contributions
The thesis will present my effort in studying the methods of and building systems using
the elastic scheduling approaches. At a high-level, the thesis takes a problem-driven
approach: identifying problems by observing what users and system administrators are
7
facing in day-to-day operations. In the context of the thesis, users refer to those who run
applications on HPC systems, including scientific researchers, engineers, and students
from academic institutions.
High-level speaking, our approaches share several commonalities: We do not assume
that the users have expert-level programming skills. Also, we minimize the changes to
user interfaces. The changes made by our solutions are transparent to users. Therefore,
we chose to implement our methods in the middleware layer – on top of existing systems.
With respect to evaluations, we adopt the trace-driven approach. We validate our
approaches using real-world traces collected from production HPC facilities.
1.4.1 Adaptive Resource Request
In today’s batch queue HPC cluster systems, the user submits a job requesting a fixed
number of processors. The system will not start the job until all of the requested re-
sources become available simultaneously. When cluster workload is high, large sized jobs
will experience long waiting time due to this policy. To solve this problem, we propose
a new approach that dynamically decomposes a large job into smaller ones to reduce
waiting time, and lets the application expand across multiple subjobs while continuously
achieving progress. This approach has three benefits: (i) application turnaround time
is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our
approach does not depend on job queue time prediction but exploits available backfill
opportunities. Simulation results have shown that our approach reduces application
mean turnaround time by up to 48%, reduces resource fragmentation by up to 59%,
and reduces priority inversions by 20% [7].
1.4.2 Dynamic Resource Negotiation
The recent improvements in experimental devices, ranging from light sources to sensor-
based deployments for Smart Cities projects, lead to the need to analyze more data
on-demand so that they can be effectively used in the management of the experimental
or observational cycle. This means that small, dedicated analysis clusters used by
many experimental communities are no longer sufficient and their users are increasingly
looking to expand their capacity by integrating HPC resources into their workflow. This
8
presents a challenge: how can we provide on-demand execution within HPC clusters
operated mostly as batch?
Our answer to this question is the design and evaluation of the Balancer: a service
that dynamically moves nodes between an on-demand cluster configured with cloud
technology (e.g., OpenStack) and a batch cluster configured with a batch scheduler (e.g.,
Torque) with the changes for supporting on-demand resource reclamations. We propose
three algorithms for moving nodes between on-demand and batch partitions and evaluate
them experimentally both in the context of real-life traces representing two years of a
specific institutional need, and via experiments in the context of synthetic traces that
capture generalized characteristics of potential batch and on-demand traces. Our results
for the real-life scenario show that our approach reduces the current investment in on-
demand infrastructure by 82% while at the same time improving the mean batch wait
time almost by an order of magnitude (8x) [8].
1.4.3 Dynamic Resource Bundle
Large-scale distributed scientific applications are concurrently scheduled on multiple
HPC resources which are diverse in architectures and interfaces, and temporally vari-
ant in performance and workload. Executing an application on multiple heterogeneous
and dynamic resources is difficult due to the complexity of choosing resources and dis-
tributing the applications tasks over them, especially when the users and the developers
of the application lack of a good understanding of the general properties and perfor-
mance of the dynamic resources. This thesis addresses this issue by devising uniform
resource abstractions called the Resource Bundles, implementing them in middleware,
and experimental evaluations to show the benefits of our methodology.
The abstractions represent characterizations of resource capacities and capabilities.
We collected resource information over a year of 10 diverse HPC clusters of XSEDE and
NERSC, and thousands of distributed servers of OSG. These resource characterizations
offer useful insights on how distributed applications should be coupled with multiple
heterogeneous resources [9].
9
1.5 Outline
Chapter 2 will present Elastic Job Bundling, a service that dynamically determines re-
source requests and elastically execute parallel applications in existing HPC clusters.
Chapter 3) will present Balancer, a service that dynamically negotiates resources be-
tween a batch scheduler and an on-demand scheduler. Chapter 4 will present Bundle,
a service that supports elastic coupling between distributed applications and heteroge-
neous resources. Chapter 5 will conclude and discuss future work directions.
Chapter 2
Elastic Job Bundling: An
Adaptive Resource Request
Strategy for Large-Scale Parallel
Applications
2.1 Introduction
Scientific research today in areas such as fluid dynamics and climate modeling is largely
dependent on simulations which have large computational needs [10]. Parallel computers
are commonly used to address such problems of ever increasing scale [11]. With the rapid
growth of scientific parallel programs designed to execute simultaneously on hundreds to
thousands of processors, swiftly provisioning a large number of processors has become
more challenging.
Massively parallel supercomputers have long been the most popular platform for
executing large-scale scientific applications. Due to the high cost of these machines,
users usually space-share them by submitting individual job requests to the batch queue
system. Each job request contains the number of desired processors P and run time
estimation R. Once a job is scheduled, it gains exclusive use of the P processors until
it finishes before R, or is killed when R expires.
10
11
Mapping each application’s resource request to a P ×R shape is convenient for users
to specify and simplifies batch scheduler design. However, this rigid scheme may also
cause the following problems: (i) when system workload is high, it is difficult to find
enough free processors for large jobs which leads to long waiting time; and (ii) when
most jobs are large, a comparatively small number of free processors cannot be efficiently
utilized, since these fragments are unusable for any waiting job. Giving higher priorities
to large jobs will not solve these problems, particularly in the event that the workload
is dominated by large jobs.
In this work, we propose a new technique addressing the queue waiting problem
called Elastic job bundling. When a large job of size P ×R is waiting in the queue, we
decompose it into several smaller subjobs of size Px×Rx (Px < P ) to reduce wait time.
This technique then manages the time overlap of subjobs to allow the application to
continuously execute and make progress.
In contrast to prior approaches such as [12, 13, 14], our technique: 1) does not require
any changes to the batch scheduler, 2) does not depend on queue time prediction, and
3) does not require any changes to the application (e.g. moldability or malleability).
We evaluate our approach using real-world workloads. Preliminary results reveal
that our approach:
• on average reduces target job waiting & turnaround time by up to 69% & 48%
respectively;
• on average reduces system-wide job waiting & turnaround time by up to 39% &
27% respectively;
• promotes fairness in terms of waiting time between large and small jobs;
• lowers system fragmentation by up to 59%.
2.2 Elastic Job Bundling (EJB)
Elastic job bundling (EJB) is a software layer that operates between parallel application
end-users and HPC batch systems (see Figure 2.1). The goal of EJB is to reduce the
turnaround time of parallel applications, especially those that demand a large number of
12
Figure 2.1: Overview of the Elastic Job Bundling.
processors. EJB accepts ordinary job requests and transforms them into multiple smaller
subjobs which can start earlier than the original job. Applications initially start running
on these smaller subjobs with downgraded performance due to over-subscription. During
run time, the application will dynamically expand to processors subsequently acquired
by EJB, through additional subjob requests when more resources become available.
2.2.1 The Formation of Elastic Jobs
Traditionally, one parallel application A is bound to a single job J , with fixed processors
P and run time estimation R, which can be expressed as A 7→ J = (P,R). A batch
scheduler will either allocate all of the P × R resource or keep the job waiting. This
all-or-nothing job scheduling strategy can lead to inefficiency. Consider the example in
Figure 2.2a. Because all job requests are rigid, the three jobs experience long waiting
time despite the presence of many idle processors. Intuitively, by changing the “shapes”
of the waiting jobs in a way that they can adapt to the dynamic workload, we can not
only reduce queue waiting time, but also improve resource utilization.
EJB implements this idea as follows: EJB treats a job request sent to it as a target
job Jt, and the application bound to Jt as the target application, At 7→ Jt = (Pt, Rt).
EJB tries to improve Jt’s turnaround time by first decomposing Jt into several smaller
subjobs Jx = (Px, Rx), x = 1, . . . , n, Px < Pt. For example, if the jobs in Figure 2.2a
were submitted to EJB, it would treat those jobs as target jobs and decompose them to
smaller subjobs which can start much earlier and increase utilization (Figure 2.2b).
13
pro
cess
ors
time
free
J1
Running jobs
J2J3
(a) Rigid monolithic jobs experience long waiting time: Jobpriority J1 > J2 > J3.
pro
cess
ors
time
free
Running jobs
J11
J21
J12
J22
J31
J32
J33J23
(b) subjob decomposition: Jxy are sub-jobs decomposed from monolithic jobJx.
Je1
Je3
Je2
(c) Elastic job composi-tion: Jex is the elas-tic job corresponsing tomonolithic job Jx.
Figure 2.2: Illustration of elastic jobs.
Second, EJB “bundles” the resource allocations from the independent subjobs to
create an integrated malleable job Je = (J1, J2, . . . , Jn), called an elastic job, as Fig-
ure 2.2c shows. Third, EJB runs the target application continuously on the elastic job,
At 7→ Je, which will be discussed in the next section. At any point in time, the number
of total processors allocated to Je will be ≤ Pt since we maintain Pt processes of At at
all times.
A subjob looks like an ordinary parallel job to the batch scheduler. The prefix
“sub” is only meant to articulate a composition relationship between subjobs and the
integrated elastic job. The notations introduced in this Section are summarized as
follows:
• target job: Jt = (Pt, Rt);
14
• target application: At;
• subjob: Jx = (Px, Rx), x = 1, . . . , n;
• elastic job: Je = (J1, J2, . . . , Jn).
2.2.2 Running Applications on Elastic Jobs
When running a target application on an elastic job, the number of processors allocated
to all concurrently running subjobs can change. The total duration of an elastic job
can be divided into intervals. The number of processors in each interval stays the same.
However, EJB does not change the number of parallel processes in the application.
Instead, EJB adapts the target application to the elastic job through over-subscription
and migration. Thus, the application structure or logic need not change.
Given At 7→ Jt = (Pt, Rt), we can know that At has Pt processes. By running At
exclusively on Jt, with one process per processor, the run time of At is Rt,
pAt(Pt) = Rt (2.1)
Suppose At is compute-bound with balanced workload which is typical of many SPMD
applications. Under over-subscription, At is run on q processors, q < Pt. Under an even
distribution, each processor is time-shared by up to⌈Ptq
⌉processes where each process
on the same processor is given an equal share of the CPU. In this case, the expected
performance degradation would be proportional to⌈Ptq
⌉, such that:
pAt(q) = ∆ ·⌈Pt
q
⌉·Rt (2.2)
Where ∆ is a penalty factor that models the severity of performance degradation due to
over-subscription. In the ideal case, a ∆ = 1 indicates that the performance degradation
is linearly proportional to the degree of over-subscription.
Obviously, one processor can only support a limited number of processes for over-
subscription, due to memory constraints or context switching cost. We denote by Omax
an upper bound on the degree of over-subscription. For simplicity in this study, we
assume Omax to be the same for different applications, such that q ∈ [⌈
PtOmax
⌉, Pt]. Our
15
technique is applicable to more complex degradation models or to differing values of
Omax, but these are the subject of future work.
When a new subjob Jx is added to Je, EJB migrates a subset of At’s running
processes to Jx’s processors, lowering the degree of over-subscription. Before a running
subjob Jy terminates, EJB must migrate all the processes running in Jy to Je’s other
continuing subjobs. This type of cross-subjob migration can be performed in bulk, such
that all the migrated processes are migrated concurrently. At stops making progress
during migration intervals. We first assume that each bulk migration interval has a
fixed maximum duration of λ seconds. To evaluate the impact of migration cost, we
can vary λ.
Je has two types of intervals: RUN and MIGRATE. Suppose there are l intervals
in Je. In interval k (k = 1, . . . , l), Je has qk concurrently usable processors and interval
length=Lk. Now we model At’s progress on Je as:
∑k
Lk
pAt(qk)= 100%
where k = 1, . . . , l and k.type = RUN
(2.3)
Equation (2.3) can be understood as follows: the completion of At requires that the
accumulated progress made by every interval sums to 100%. Based on Equation (2.3),
we can: (i) estimate At’s progress at any point during it’s run time, (ii) estimate the
time it takes to achieve a certain amount of progress, and (iii) given At’s current progress
and upcoming intervals, estimate At’s completion time.
We demonstrate this progress model in Figure 2.3. In this example, a 3 process At is
submitted with Jt = (4, 800s). Je contains 3 subjobs: J1 = (1, 1460s), J2 = (1, 1060s),
and J3 = (2, 440s). Je has 7 intervals, the durations of which are marked below the
interval number. In interval 1, since only J1 is running, Je has 1 processor. Each of
the 4 processes of At over-subscribe the same processor in a time-shared manner, such
that At makes progress at a rate of 14 . By the end of interval 1, when J2 starts running,
At’s progress is 12.5%. With J2’s 1 processor added, Je has 2 processors. Interval 2 is
a MIGRATE interval. Suppose its length λ = 20s, within which process 3 and 4 are
migrated to J2. Then in interval 3, At makes progress at a rate of 12 . By the end of
interval 3, when J3 starts running, At’s progress is 37.5%. Since J3 terminates before
16
core
3
4
2
1 p1
p2
p3
p4
1 2 3 4 .. .. 1 2 3 4 .. .. .. .. 1 2
.. .. .. .. 3 4
1
3
2
4
1 2 .. ..
3 4 .. ..
2
4
1 2 .. 1
3 4 .. 3
job
application
J1 =(1,1460s)
J2 =(1,1060s)
J3 =(2,440s)
J t =(4,800s)
intervals 1 2 3 4 5 6 7
400s 20s 400s 400s 200s20s 20s
process
12.5% 37.5% 87.5% 100%0%
migration
progress
durations
Je =(J1,J2,J3)
Figure 2.3: Mapping a parallel application’s processes to an elastic job, includingprogress measurement.
At’s completion time, 2 MIGRATE intervals 4 and 6 are added. Processes 2 and 4
which were migrated to J3 in interval 4, must be migrated back to J1 and J2 in interval
6 before J3 terminates. There is no over-subscription in interval 5, by the end of which
At’s progress is 87.5%. Interval 7 is the last interval, by the end of which At’s progress
is 100%, At then completes. Je’s runtime is the summation of its intervals: 1460s.
2.2.3 Taming Unpredictability
EJB needs to control the sizes of subjobs to enable them to be scheduled early, and
to ensure that they overlap in run time to allow for migration. However, accurate
queue wait time prediction is known to be a difficult problem despite many efforts in
this area [15, 16, 17, 18, 19, 20, 21, 22]. We address this challenge by controlling the
shape P ×R of the subjobs such that they can be immediately scheduled to run on the
fragmented idle resources. E.g., production schedulers such as TORQUE [23] or SLURM
[24] are capable of providing immediately available resources information through the
user interface such as showbf or slurmbf.
At a first glance, one may think that it would be difficult to find sufficient idle
17
pro
cess
ors
shadow time
running job
first
extrafree
now
1st
queued jobs
2nd 3rd
(a) EASY backfilling algorithm.
proc
esso
rs
shadow time
firstjobin
queue
extra Type-I Job Slot
proc
esso
rs
now
firstjobin
queue
free Type-IIJob Slot
now shadow time
(b) Immediately backfillable job slots: type-I and type-II
Figure 2.4: Finding immediately usable resources under EASY backfilling.
Table 2.1: Upper-bounds of processor (Pmax) and runtime (Rmax) of two types of im-mediately backfillable job slots.
Pmax Rmax
Type-I slot extra processors unlimited
Type-II slot free processors Tshadow − Tnow
resources especially on HPC clusters that are often over-committed. However, we argue
that a large factor contributing to job waiting time is due to the shape of the queued
jobs, as in the example given in Figure 2.2a. Due to its wide deployment, we present
how EJB can work with EASY backfilling [25] and later evaluate its performance.
We now briefly revisit the EASY backfilling algorithm. Each time the scheduling
algorithm runs, EASY tries to maximize utilization at that point of time, while only
guaranteeing the start time of the first job in the queue. The example in Figure 2.4a
shows at time “now”, three jobs are waiting and the number of available free processors
< processors required by the 1st job. EASY first loops over the running jobs in the
order of their expected termination time, until the available processors are sufficient for
18
the 1st job, when the 1st job is guaranteed to start. EASY calls this time the shadow
time Tshadow. If the available processors at Tshadow > processors required by the 1st
job, the surplus processors are extra. As a second step, EASY finds backfillable jobs
according to the condition that they do not delay the 1st job. In our example, both
the 2nd and 3rd job do not satisfy this condition, so they will keep waiting. If any
lower-priority job satisfies the backfill condition, they will be selected as backfill jobs to
start immediately, and they may add unbounded delay to the 2nd and 3rd job in our
example.
Figure 2.4b shows the upper-bound in both processor and time dimensions of the
shape of backfillable jobs. Table 2.1 lists the upper-bounds, which can be spatially
imagined as slots with height=Pmax and width=Rmax, which we call immediately back-
fillable job slots. There are two types of immediately backfillable job slots. The Pmax in
a type-I slot = extra processors. Type-I slot has no upper-bound for Rmax. The Pmax in
a Type-II slot = free processors. Simply speaking, jobs submitted to fill the type-I slot
can run on a smaller number of processors with unlimited runtime. Jobs submitted to
fill the type-II slot can run on a larger number of processors, but with limited runtime.
2.2.4 Assumptions and Limitations
In summary, we made the following assumptions for EJB:
1. EJB targets the optimization of large tightly-coupled (such as MPI) parallel ap-
plications. Embarrassingly parallel, or bags of tasks are comparatively easier to
schedule, since they do not require co-scheduled subjobs, nor cross-subjob migra-
tions.
2. Target applications are compute-bound, and not memory-bound. Otherwise, a
large memory footprint will prohibit processor over-subscription.
3. In this work, we assume that the underlying batch scheduler runs the EASY
backfilling algorithm, without additional priority control policies.
2.3 EJB Scheduling Algorithm
EJB runs a heuristic event-driven scheduling algorithm executed at four types of events:
19
1. TargetJobArrivalEvent : A target job is submitted to the EJB scheduler (EJB-
sched).
2. IdleJobSlotsAvailableEvent : New idle job slots become available.
3. SubjobStartEvent : A subjob starts running.
4. TargetAppCompleteEvent : The target application has run to completion.
The time at which the event happens is called Tnow.
2.3.1 TargetJobArrivalEvent Handler
When EJB-sched receives a job request At 7→ Jt = (Pt, Rt), EJB-sched first needs to
check if the shape Pt × Rt can be scheduled immediately by the batch scheduler. For
this purpose, EJB-sched submits a special subjob J0 = (Pt, Rt) to the batch scheduler.
If J0 starts running immediately, then we are done. Otherwise, EJB-sched creates a
new Je for At. EJB initializes Je as follows:
• Status: Je.Stat = WAIT ;
• Maximum processors needed: Je.Pmax = Pt;
• Currently usable processors: Je.Pcurrent = 0;
• List of subjobs: Je.SubjobList = [J0];
• List of available intervals: an interval is a data structure having the following
information:
– Type - [RUN |MIGRATE],
– Processors - concurrently usable processors,
– StartT ime - when the interval starts,
– Duration - how long is the interval,
– SubjobList - subjobs running during the interval,
– MigrationP lan - valid for MIGRATE interval,
initially Je.IntervalList = [empty];
20
Type-II slot S2
Type-I slot S1
J1
J1
J2
J1
Case 1
Case 3
Immediatelybackfillable job slots available
Case 2
At.Tc1
At.Tc2
At.Tc3
Tnow
1 2 3intervals L 1 Durations L 2 L 3
Figure 2.5: Three cases for subjob submission in a waiting elastic job.
• Current progress of At: At.P rogress = 0%;
• At’s estimated completion time: At.Tc =∞.
J0 functions as a place holder in the batch queue.
2.3.2 IdleJobSlotsAvailableEvent Handler
When Je.Stat = WAIT
EJB-sched checks whether S1 and/or S2 are big enough to run the entire At. There are
three cases to check when satisfying this condition, as Figure 2.5 shows:(i) submit one
subjob which could fit S1, (ii) submit one subjob which could fit S2, and (iii) submit two
subjobs to fit both S1 and S2 respectively. EJB-sched estimates At.Tc, if feasible, for
each of the three cases. EJB-sched will then submit subjobs that produce the shortest
estimated completion time. If none of the three cases are met, then EJB-sched will do
nothing.
Case 1: if ∃S1 and S1.Pmax ≥ PtOmax
, then by submitting subjob J1 = (P1, L1), P1 =Pt⌈Pt
S1.Pmax
⌉ , L1 = pA(P1), At.Tc = Tnow + L1.
21
Case 2: if ∃S2 and S2.Pmax ≥ PtOmax
, then by running on P1 = Pt⌈Pt
S2.Pmax
⌉ processors,
L1 = pA(P1). If L1 < S2.Rmax, then J1 = (P1, L1), At.Tc = Tnow + L1.
Case 3: if (i) ∃ both S1&S2, and (ii) S1.Pmax ≥ PtOmax
, and (iii) Pt⌈Pt
S2.Pmax
⌉ > Pt⌈Pt
S1.Pmax
⌉EJB-sched may simultaneously submit two subjobs such that A will (i) run on both
subjobs in L1; (ii) migrate processes from J2 to J1 in L2; (iii) resume in L3 while
running only on J1 such that:
• P1 = Pt⌈Pt
S1.Pmax
⌉ ,
• P2 = Pt⌈Pt
S2.Pmax
⌉ − P1,
• L1 = S2.Rmax − λ,
• L2 = λ,
• L3 = (100%− L1pA(P1+P2)
) · pA(P1),
• J1 = (P1, L1 + L2 + L3),
• J2 = (P2, L1),
• At.Tc = Tnow + L1 + L2 + L3.
When Je.Stat = RUNNING
EJB-sched checks whether adding more resources to Je could advance At.Tc. This is
not always true because in order to increase speedup after adding more processors, Je
needs to pay the price of migration. EJB-sched will only decide to allocate more subjobs
to Je when the benefit outweighs the cost. EJB-sched needs to evaluate at most three
possible schedules based on resource availability and the current status of Je. Basically,
Je.IntervalList will be updated with the newly available processors. EJB-sched can
then re-estimate the new At.Tc according to the updated Je.IntervalList:
Case 1: submit a new subjob Jx = (Px, Rx) with Rx = new At.Tc − Tnow. This case
applies for a type-I available job slot, or a type-II slot when the slot’s Rmax is sufficiently
22
long. This case instantly triggers a migration in which processes in existing subjobs are
partially migrated to Jx. All the subsequent intervals will increase their qk by Jx.Px.
Case 2: submit a new subjob Jx = (Px, Rx) with Rx < new At.Tc − Tnow. This case
applies for a type-II job slot with small Rmax. Besides triggering an instant expansion
migration, this case will also schedule a shrinkage migration before Jx terminates. The
Processors associated to every interval in Je.IntervalList between Tnow and Tnow +Rx
will be incremented by qk.
Case 3: submit two new subjobs Jx = (Px, Rx) and Jx+1 = (Px+1, Rx+1), Jx will run
until the recalculated completion time and Jx+1 will terminate earlier. This case is a
combination of cases 1 and 2.
Many fine-grained optimization such as combining/removing migration intervals are
considered in our algorithm. For brevity, we omit how the shapes of Jx and Jx+1
are determined and how At.Tc is recalculated; please refer to [26] for details. Based
on the evaluation results in the above cases, EJB-sched will choose the schedule that
can produce the earliest completion time. Whenever new resources become available,
EJB-sched will call this event handler unless Je has reached full parallelism Je.Pmax.
2.3.3 SubjobStartEvent Handler
When a subjob starts, EJB-sched performs process migration as scheduled. However, if
the place holder job J0 starts, based on A’s current progress, EJB-sched has the options
of (i) migrating all running processes to J0, or (ii) continuing execution on existing
subjobs and cancel J0, or (iii) restarting At on J0 and discarding currently achieved
progress. EJB-sched will choose the option which can produce earliest completion time.
2.3.4 TargetAppCompleteEvent Handler
When the application terminates earlier than the projected finish time At.Tc, EJB-sched
will cancel all running subjobs. EJB-sched will also cancel J0 if it is still in queue.
23
2.4 Implementation
Figure 2.6 presents the architecture of the EJB system. At a high level, the EJB system
consists of two parts: the EJB Manager and the EJB Worker. The EJB manager can be
launched on any machine which has a connection to the HPC cluster’s front node. Users
of the EJB system can submit job requests to EJB-sched through an interface similar to
the batch submission. EJB-sched runs the scheduling algorithm described in last section.
EJB-sched interacts with the batch scheduler only through ordinary calls such as show
job queue status, submit job, and cancel job. In order to control and manage
the elastic job, scheduling operations for all elastic jobs are placed on a Operation
Queue. There are two types of Operations: launch which submits the application with
a computed degree of over-subscription, and migrate-dest which decides two things:
the group of processes that will be migrated, and the destination subjob that will receive
the processes. The EJB Controller is in charge of sending these operations to the EJB
Workers running in each subjob. This mode of operation can be seen as similar to the
pilot job [27], in which a resource is first acquired by a pilot job, and then tasks are
scheduled into that resource. In our case, when a subjob starts, the EJB worker will
direct the target application to perform the scheduled operations in that subjob. There
will be only a small number of messages sent between the EJB Controller and EJB
Worker throughout an elastic job’s lifecycle.
Over-subscription is supported by most MPI implementations including OpenMPI,
MPICH, and its derivatives. For example, the OpenMPI run-time environment detects
over-subscription and sets MPI processes to degraded mode which means that they
yield processors when idle. In order to validate our performance degradation model
of Equation (2.2), we use the NAS Parallel Benchmarks (NPB [28]) and measure the
over-subscription performance using the FutureGrid [29] testbed.
In Figure 2.7, NPB programs of fixed problem size and amount of parallelism were
run on fewer processors to produce over-subscription. We measure the end-to-end execu-
tion time at different over-subscription levels. The measured times are compared against
one-process-per-core. We can observe sub-linear (ft/is), linear (bt), and super-linear
(lu/mg/sp) performance degradations from Figure 2.7. Our key finding is: degradation
correlates with scalability. E.g. two of the programs having super-linear degradations,
24
control
Batch Scheduler
Nodes
HPC Cluster
subjobs
EJB Worker
EJBScheduler
OperationQueue
EJB Manager
EJB
Con
trol
ler
detect idle slots
Job Request
submit/cancel jobs
trigger scheduledoperations:
oversubscription,migration
elasticjobs
Figure 2.6: EJB system architecture
0 2 4 6 8 10 12 14 16
Degree of over-subscription
0
5
10
15
20
25
30
Nor
mal
ized
runt
ime bt
ftislumgsp
∆ = 1
∆ = 1.57
Figure 2.7: Performance measurement of six NPB programs under over-subscription.All programs are compiled of problem size CLASS=C, with NPROCS=100(bt,sp),128(ft,is,lu,mg). Different problem sizes or NPROCS follow the same pattern.
lu and sp, also achieve super-linear speedup in our experimental cluster. ft and is, both
show sub-linear degradation and also show sub-linear speedup. bt shows both linear
speedup and degradation. The only exceptional case is mg, which shows slightly sub-
linear speedup, but super-linear degradation. For this type of parallel programs, a more
conservative estimation of ∆ is needed.
Another key point is that the maximum number of processes on a single physical
25
processor is limited by memory size. For example, a parallel application will be killed
if the memory usage exceeded the physical memory size. Memory-bounded parallel
applications require out-of-core techniques such as [30] to be able to run with EJB. The
actual benefits in this situation depend on what level of over-subscription is workable
and how big the performance degradation is.
To enable migration, we use DMTCP [31], a user-level checkpoint/restart tool for
parallel applications including MPI. DMTCP does not require re-compilation of ap-
plication nor system privileges. Migration includes three steps: global checkpointing,
moving checkpoint images, and restart.
DMTCP supports checkpointing by adding a checkpoint management thread to ev-
ery process at start-up time. During checkpointing, all application processes are simul-
taneously suspended and the checkpoint images are written to disk by every DMTCP
checkpoint management threads. For compute nodes equipped with a shared file system
(as on FutureGrid), explicitly moving the image is not required. For restart, each EJB
Worker is directed to resume its own group of suspended processes, thus completing the
migration.
Thus on FutureGrid, the migration time is predominantly spent on checkpointing,
which is determined by the parallel application’s total memory usage and disk-write
speed. For example, in our experimental cluster, checkpointing a NPB bt program of
size C and 100 processes takes about 30 seconds, generating in total 3GB checkpoint
images at 100MB/s. On clusters supporting parallel I/O, the migration speed could be
greatly accelerated. Ultimately, the migration time grows sub-linearly with application
scale represented by number of processes [32]. For example, when the application scale
increases to 2000 processes, checkpoint time increases to 280s. In subsection 2.6.2, we
will run sensitivity analysis of varying migration cost from 60 to 600 seconds.
2.5 Trace-driven Simulation
We simulated EJB using logs of real parallel workloads from production systems [33] to
assess feasibility. Table 2.2 lists the 4 selected traces used by our simulation. These 4
traces have been widely used by previous studies of parallel job scheduling algorithms.
Our simulator is based on PYSS [34] – an event-based scheduler simulator developed
26
Table 2.2: Traces used in our simulation
Log Files CPUs Jobs Duration Uti%
CTC-SP2-1996-3.1-cln 338 77,222 7/96-5/97 85.2%
SDSC-SP2-1998-4.1-cln 128 59,725 4/98-4/00 83.5%
SDSC-BLUE-2000-4.1-cln 1,152 243,314 4/00-1/03 76.7%
KTH-SP2-1996-2 100 28,489 9/96-8/97 70.4%
by the Parallel Systems Lab at Hebrew University. In order to emulate how EJB works
in practice, the simulator’s EasyBackfillScheduler which functions as a cluster batch
scheduler is kept unchanged. Job traces contain both job walltime and runtime. The
former is user estimated run length. The latter is the application’s actual run length
recorded after it terminates, walltime ≥ runtime. In simulation, the job’s actual runtime
is unknown to EJB-sched. EJB-sched calculates the projected completion time based
on the job’s walltime. However, the simulator keeps track of the actual progress based
on the runtime, and triggers the TargetAppCompleteEvent once the actual progress is
100% (see Equation (2.3)).
In theory, any job in the trace can be submitted to EJB-sched. Nevertheless, jobs
requesting only a few processors cannot be further optimized through over-subscription.
If they experience long waiting time, it can be an indication of truly high system work-
load and our approach cannot find free slots under this condition. We set the minimum
P of an eligible elastic target job to be 8. We set the following default values: the
maximum degree of over-subscription, Omax = 8, the migration duration, λ = 120(s),
and the performance degradation factor, ∆ = 1. Note that λ should be affected by
application scale. 120s is a conservative value, which is the cost to checkpoint an appli-
cation with 400 processes [32]. Which is the upper-bound of the application size in all
of our traces. We use this conservative value to make sure migration time is sufficient
which prevents the situation of unfinished migration. In practice, the algorithm should
use varying λ as input, which is determined by application size.
Furthermore, each trace’s first 1% jobs, as well as the jobs that terminate after
last job arrival are excluded from the performance analysis. This is a commonly used
technique to reduce the impact of warm up and cool down effects.
27
2.6 Evaluation
We evaluate EJB through a series of experiments based on simulation. Our baseline for
comparison is a system scheduler that runs EASY Backfilling only. Overall, the results
reveal the following performance benefits of EJB:
• elastic job performance is significantly improved;
• non-elastic job performance is either not impacted or slightly improved;
• system fragmentation is reduced;
• fairness between jobs of different sizes is promoted.
We start by carefully choosing the appropriate performance metrics (Section 2.6.1). We
then measure how elastic jobs are improved (Section 2.6.2). We then evaluate how
migration cost impacts elastic job performance (Section 2.6.3). Finally, we study the
cluster-wide performance when co-scheduling many elastic jobs together (Section 2.6.4).
2.6.1 Performance Metrics
The elastic job’s turnaround time (tt) is the time when the target job is submitted to
EJB-sched to the point when the target application completes, which is also when all
subjobs terminate. When dividing the elastic job’s tt by the baseline tt, we have the
speedup of turnaround time:
Stt =baseline tt
elastic job tt(2.4)
The elastic job’s waiting time (tw) is measured from the target job’s submission to the
start of the first subjob of the elastic job. The elastic job’s run time (tr) is measured
from the time the first subjob belonging to the elastic job starts, to the time the elastic
job’s last subjob terminates. Elastic job’s bounded slowdown (Slo) is defined as
Slo =elastic job tt
baseline tr(2.5)
Notice that we don’t use elastic job tr in calculating slowdown, for the reason that
slowdown should be compared against the runtime on a dedicated system, without
over-subscription and migration. Bounded-slowdown substitutes a job’s baseline tr with
28
Table 2.3: Increase ’+’ or decrease ’-’ percentages of mean wait, run, and turnaroundtime of elastic jobs compared to target jobs’ baseline results.
tracetargetjobs
percentage change of mean
tw tr tt [95% conf. interval]
CTC 16,167 -50.4% +29.3% -33.9% [-34.7%,-33.1%]
SDSC 14,329 -68.9% +57.6% -48.4% [-49.5%,-47.3%]
BLUE 64,090 -59.8% +26.9% -36.1% [-36.7%,-35.6%]
KTH 4,399 -66.5% +34.9% -37.8% [-39.7%,-35.9%]
AVG -61.4% +37.2% -39.1%
10s when tr ≤ 10s. Bounded-slowdown avoids super-short jobs generating very large
slowdown values.
We measure system fragmentation as the average number of idle processors in the
cluster while the batch queue is not empty. This measurement excludes the period
when all the jobs in the cluster have received resources, yet there are still unallocated
processors. For example, if jobs never wait, then system fragmentation will always be 0,
independent of idle processors. As another example, if the scheduler is able to perfectly
fill all resources with jobs, then the system fragmentation is also 0.
2.6.2 Improving Elastic Job Turnaround Time
As a first step towards evaluating EJB’s performance, we isolate EJB’s impact on a
single target job by simulating one elastic job in each run of our simulator. We then
compare the elastic job performance against the baseline of the target job. Target jobs
are all the jobs which have P ≥ 8 and baseline tw > 0. Note: we do not need to
know tw accurately but simply whether it is non-zero. We can know this if J0 starts
immediately. As Table 2.3 shows, we simulated ≥ 100, 000 such jobs in four traces
combined. Figure 2.8a provides a clearer visual view of the how turnaround time has
been improved.
Elastic jobs’ mean tt is 39.1% faster than the baseline value, with variations between
traces. As expected, the EJB results in significantly shorter tw (61.4% lower) at the
expense of longer tr (37.2% higher) due to over-subscription and migration. Detailed
29
CTC SDSC BLUE KTH0
10000
20000
30000
40000
50000
60000
70000
Mea
njo
btu
rnar
ound
tim
e[s
ec]
tw (baseline)tr (baseline)tw (elastic)tr (elastic)
(a) Side-by-side view of how turnaround time improves by trans-forming target jobs into elastic jobs.
10−1 100 101 102 103 104
0.00.20.40.60.81.0
ctc
10−1 100 101 102 103 104
sdsc
10−1100 101 102 103 104 105
0.00.20.40.60.81.0
blue
10−1100 101 102 103 104 105
kth
Elastic job’s Stt [log10]
(b) Cumulative distribution function (CDFs) of the speedup of theturnaround time (Stt) of all elastic jobs.
Figure 2.8: Elastic job’s overall performance and variations.
distributions of Stt are depicted in Figure 2.8b which shows that most target jobs benefit
from being elastic. Some exceptionally well-performing jobs have tt 1, 000 times faster
than before. 1/4 of the target jobs’ tt are unchanged and < 3% of the elastic jobs
experience worse results.
Next, we perform sensitivity analysis to Omax, λ, and ∆. Figure 2.9 shows the results
for the CTC trace only, as other traces reveal similar trends. First, in Figure 2.9a we
30
Table 2.4: Summarize migration related statistics.
trace subjobs migrations migration duration resource overhead
CTC 3.3 2.1 8.5% 19.7%
SDSC 2.8 1.7 5.6% 16.4%
BLUE 2.7 1.5 8.6% 18.2%
KTH 2.7 1.6 6.4% 24.1%
vary Omax ∈ [2, 4, 8, 16, 32, 64]. The larger Omax, the greater the benefit of EJB. After
Omax has reached 16, further increasing Omax won’t bring evident performance gains.
Second, we evaluate whether the performance improvements are sensitive to the
migration cost. In Figure 2.9b, we vary λ from 1 to 10 minutes. The performance is not
very sensitive to the migration cost. This can be explained as the number of migrations
on average is small and the tt of the target job is much larger compared to migration
time, e.g. on average 9 hours in the CTC workload.
In Figure 2.9c, we vary ∆ from 0.8 to 2.0. The elastic job mean tt is not very
sensitive to degradation. This is due to (i) the fact that many elastic jobs are capable
of finding enough processors in the later stages of their life cycle, thus eliminating the
over-subscription overhead afterwards, and that (ii) tw still accounts for a considerable
proportion of tt even with EJB .
2.6.3 Migration behavior
Table 2.4 characterizes elastic job overhead with respect to the number of migrations.
The subjobs column shows that on average each elastic job consists of about 3 subjobs,
and conducts bulk cross-subjob migrations approximately twice. Actually, more than
60% of the elastic jobs contain more than one subjob, and around 40% of the elastic
jobs have experienced at least one bulk migration. In very rare cases, the number of
subjobs and migrations can reach > 20. This shows that the performance gain of EJB
is not only a result of moldability, but also the result of migrations.
The migration duration column shows that migration durations on average account
for 5 − 8% of an elastic job’s run time. Furthermore, extra CPU resources may be
spent due to migration and over-subscription. The resource overhead column shows
31
0 10 20 30 40 50 60 70
-10%-15%-20%-25%-30%-35%-40%-45%
ctc
Omax
mea
ntt
decr
ease
%
(a) Elastic job mean tt percent re-duction as a function of the maxdegree of over-subscription.
1m
in2
min
4m
in
10m
in
-30%-32%-34%-36%-38%-40%
ctc
λ
mea
ntt
decr
ease
%
(b) Elastic job mean tt percentreduction as a function of the mi-gration cost.
0.81.01.21.41.61.82.0
-20%
-25%
-30%
-35%
-40%
-45%ctc
∆
mea
ntt
decr
ease
%
(c) Elastic job mean tt percent re-duction as a function of the per-formance degradation factor.
Figure 2.9: Sensitivity analysis of Omax, λ, and ∆.
that elastic jobs have a 16− 24% resource overhead which is measured by processor ×time. A main factor contributing to this is the inaccuracy in the tr estimations. Based
on user provided trs, the EJB algorithm may decide that it is beneficial to perform
additional migrations. However, the real tr of these elastic jobs are much shorter, such
that the migrations may be unnecessary. To address this issue in our future work, we can
use the similar approach of [21] to more accurately estimate tr according to historical
job information and make migration decisions based on the adjusted tr.
32
>=8
<8
elastic job
non-elasticjob
number ofprocessors
candidatejob
non-candidatejob
>0 wait time
0 wait time
Figure 2.10: Decision tree for elastic job selection
2.6.4 Multiple Elastic Jobs
In Section 2.6.2, we analyzed how EJB impacts single job performance. In this section,
we try to understand the comprehensive performance impact when many elastic jobs
coexist, in effect competing for resources with each other and with other non-elastic
jobs. The following simulations are meant to emulate real-world conditions when users
arbitrarily submit job requests to EJB-sched.
The impacts of EJB are measured on:(i) elastic jobs, (ii) non-elastic jobs, and (iii) all
jobs. The impact determined by measuring how jobs perform differently after introduc-
ing EJB can be tricky. Since for each separate job, its performance in terms of tt or
tr can be largely dependent on background workload during its tt. From a single job’s
perspective, its background workload can be totally different if EJB were to be deployed.
We solve this dilemma by applying a statistical method called Before-and-After
Comparisons [35]. The Before-and-After comparison is designed to evaluate whether by
adding some new features to a system, the performance change is statistically significant.
In our context, the method works in this way: for each workload, we run the simulation
twice before and after involving EJB. Then for each performance metric, we have a
pair of results corresponding to each job’s before and after case. Next, we calculate a
confidence interval for the means of the differences of each paired value. If this confidence
interval does not include zero, then we can conclude with a certain confidence that there
is a statistically significant difference before and after introducing EJB.
First, we simulate an extreme condition by submitting all jobs that request at least
8 processors to EJB-sched. Table 2.5 shows the Before-and-After comparison results.
Notice that the number of elastic jobs are different from that of Table 2.3. We use the
decision tree in Figure 2.10 to determine which jobs will become elastic.
33
Table 2.5: Before-and-after comparison: confidence intervals are calculated at the 95%confidence level.
Workload Job typeElasticjobs
Mean tt Mean tw Mean tr
Before After (change %) conf. interval Before After (change %) conf. interval Before After (change %)
CTC
elastic 21,035 40,276 35,512 (-11.8%) (-5,091,-4,437) 31,695 25,863 (-18.4%) (-6,165,-5,498) 8,581 9,649 (+12.4%)
non-elastic 59,123 19,268 17,874 (-7.2%) (-1,503,-1,285) 7,008 5,614 (-19.9%) (-1,503,-1,285) 12,260 —
all 76,446 25,049 22,728 (-9.3%) (-2,442,-2,201) 13,801 11,186 (-18.9%) (-2,737,-2,493) 11,248 11,542 (+2.6%)
SDSC
elastic 18,790 58,235 40,888 (-29.8%) (-17880,-16813) 48,468 29,001 (-40.2%) (-20,016,-18,917) 9,767 11,887 (+21.7%)
non-elastic 40,333 10,589 8,676 (-18.1%) (-2,052,-1,775) 5,412 3,499 (-35.3%) (-2,052,-1,775) 5177 —
all 59,123 25,731 18,913 (-26.5%) (-7021,-6615) 19,096 11,604 (-39.2%) (-7,701,-7,282) 6,636 7,309 (+10.2%)
BLUE
elastic 135,302 16,863 14,881 (-11.8%) (-2,067,-1,899) 12,015 9,626 (-19.9%) (-2,475,-2,302) 4,848 5,254 (+8.4%)
non-elastic 105,560 4,096 3,186 (-22.2%) (-941,-878) 1,109 199 (-82.0%) (-941,-878) 2,987 —
all 240,862 11,268 9,756 (-13.4%) (-1,562,-1,463) 7,235 5,495 (-24.1%) (-1,791,-1,690) 4,033 4,261 (+5.7%)
KTH
elastic 5,811 32,457 25,632 (-21.0%) (-7,347,-6,302) 22,137 13,673 (-38.2%) (-9,010,-7,918) 10,320 11,959 (+15.9%)
non-elastic 22,392 11,523 11,565 (+0.4%) (-56,141) 2,982 3,024 (+1.4%) (-56,141) 8,541 —
all 28,203 15,836 14,463 (-8.7%) (-1509,-1236) 6,929 5,218 (-24.7%) (-1,853,-1,568) 8,907 9,245 (+3.8%)
All jobs submitted to EJB-sched are candidate jobs. EJB-sched only transforms a
candidate elastic job to an elastic job when the job’s original shape cannot be started
immediately. The increase in the number of elastic jobs (e.g. the number of elastic jobs
in CTC has increased from 16,167 in Table 2.3 to 21,035 in Table 2.5) indicates that
when we saturate the cluster with elastic jobs, a greater number of jobs are identified
as eligible for elasticity. The reason is that the mutual influence between elastic jobs
causes more jobs that were inelastic originally because tw = 0, to now become elastic.
However, the mean turnaround time of elastic jobs is significantly reduced.
Table 2.5 shows that for all the workloads except KTH, wide use of EJB not only
results in shorter tt for elastic jobs, but surprisingly improves the response time of
non-elastic jobs, and the improvement is statistically significant. For KTH, elastic jobs
are also significantly faster than before. Non-elastic jobs in KTH are on average 0.4%
slower after EJB is applied. However this performance degradation is not of statistical
significance since its confidence interval (−56, 141) crosses 0.
The performance results measured by bounded slowdown shown in Figure 2.11 are
consistent with the turnaround time results such as in Table 2.5 (column 5). The
maximum slowdown (which is too large to be shown in the graph) experienced by the
most unlucky job also decreases. [36] indicates that the mean turnaround time and the
mean slowdown are seperately dominated by long and short jobs, thus EJB is not biased
toward any type of job. Actually, we observe that large jobs with short tr benefit greatly
from EJB. These jobs previously suffered from long waiting time due to the height of
their original shape. EJB enables these jobs to start earlier, hence they will complete
34
020406080
ctc
0
50
100
150sdsc
elas non-elas all0
1020304050
blue
elas non-elas all0
50100150200250
kth
Bou
nded
slow
dow
n
beforeafter
Figure 2.11: Bounded slowdown: side-by-side view before and after EJB is added,grouped into elastic, non-elastic, and all jobs.
in less time.
In order to evaluate how EJB promotes fairness, we did a linear regression analysis
of all job’s tw and over job size in Figure 2.12. We admit that job’s tw does not have
strict linear correlation with job’s size. However, the trend is that larger jobs tend
to wait longer. Actually, large jobs are known to suffer more than small jobs under
scheduling policies that optimize mean tt or slowdown [37]. By comparing the slopes
of regression lines generated from the results before and after EJB is added, we can see
that the slope of the tw under EJB is flatter indicating less sensitivity to processor size
(i.e. is more fair). We have also measured the total number of priority inversions, which
drops about 20% when EJB is applied. This is further evidence of fairness.
Table 2.6 shows the measurement of fragmentation as defined in Section 2.6.1. The
result shows that with EJB, average system utilization is higher when there are jobs
in the queue which indicates EJB uses the idle processors to help queueing jobs start
sooner.
Finally, in Figure 2.13 we measure EJB performance by synthetically decreasing/increasing
system utilization through changing job’s arrival rate. From the results we can see when
cluster utilization is low, EJB performs similar to batch scheduling. However, in clusters
with high utilization, EJB performs significantly better.
35
0 100 200 3000.0
0.5
1.0
1.5
2.0×105 ctc
0 50 100 1500.0
0.5
1.0
1.5
2.0×105 sdsc
1 301 601 901 12010.0
0.5
1.0
×105 blue
20 40 60 80 1000
2
4
6
×104 kth
Job size (number of processors)
Wai
tti
me
(sec
onds
)
beforeafter
Figure 2.12: Linear regressions of tw over job size before and after EJB is added.Adjacent to x-axis indicates fairness.
Table 2.6: Fragmentation: np is the average number of idle processors, % is the per-centage of the idle processors in the cluster.
Tracebefore after
changesnp % np %
CTC 33.7 10.0% 22.4 6.6% -34.0%
SDSC 13.8 10.8% 5.9 4.6% -57.4%
BLUE 129.5 11.2% 53.8 4.7% -58.0%
KTH 16.0 16.0% 6.6 6.6% -58.8%
2.7 Discussion
We have shown that EJB reduces large job turnaround time with minimal impact on
small jobs. We attempt to explain this interesting phenomenon as EJB is, in effect,
homogenizing system workload, by decomposing large jobs into smaller ones. Com-
pared to larger jobs, smaller jobs can allow schedulers to allocate resource more quickly
and improve load balance [38]. Ultimately the performance improvement comes from
reduced system fragmentation. When workload is high, EJB lowers the average size of
jobs. When workload is low, EJB generates additional jobs to exploit idle resources.
Another point worth discussion is: On a EJB-ready HPC cluster, when should EJB
36
1.01.52.02.53.03.5
×104 ctc
0123456×104 sdsc
50% 60% 70% 80% 90%0.51.01.52.02.53.03.5
×104 blue
50% 60% 70% 80% 90%1.01.52.02.53.03.54.0
×104 kth
utilization
mea
ntt
w/o EJBw/ EJB
Figure 2.13: Changing utilization: EJB is more resistant under high utilization.
be activated? Our view is that EJB can be dynamically switched on/off according to
system workload. Users can be given the option of specifying whether they would like
to pay a little bit more resource quota in return for faster turnaround time. When the
batch queue length has exceeded a certain threshold, the administrator could decide to
enable EJB to reduce wait time.
2.8 Related Work
Characterized by different patterns of resource usage, parallel jobs are categorized by
three types. Rigid jobs require a fixed number of processors. Moldable jobs can be
executed on several processor sizes. The actual number of processors is determined
at the start, and never changes. Malleable jobs may change the number of processors
during execution. Bringing flexibility to parallel jobs to adapt them to system workload
has been extensively studied. The essentials of these studies are twofold. First, is the
mechanism to allow a parallel job to use different number of processors. Second, is the
scheduling strategies used such as moldability or malleability. This section will briefly
compare EJB with several representative approaches.
37
2.8.1 Moldable Jobs
Cirne’s works in [12, 13] rely on applications to be moldable and job waiting time to
be predictable to improve moldable job turnaround time. It chooses the job size based
on which size might produce the shortest tw + tr. The merit of this approach is that
it requires no system changes. Nonetheless, estimating job waiting time can be very
error-prone. Also, many applications are not moldable, e.g. some applications can only
be decomposed into restricted degrees of parallelism such as powers of two. Moreover,
by definition moldable jobs can not grow to a larger resource footprint to gain further
speedup even when free resources become available after the moldable job starts running.
Commercial cluster schedulers like Moab support moldable job requests, in which
the user provides several options for job size and walltime. The scheduler will choose
an option based on whichever option can be met first. Basically, this is similar to our
approach but the application must be moldable and migration to enable expansion of
parallelism is not supported. We have evaluated this situation by setting migration
cost to infinity, and the performance was shown to be inferior to EJB due to lack of
adaptation to additional resources.
2.8.2 Malleable Jobs
Malleable (or adaptive) jobs have the attractive property that they dynamically adapt
to system workload [39]. ReSHAPE [40, 41] is a framework supporting dynamically
changing the number of processors of iterative, structured (2-D decomposition) applica-
tions, for the purpose of both selecting the best number of processors to yield the best
efficiency by expanding/shrinking the processor size according to the system workload.
The merit of their work is a implementation of a library which is capable of dynami-
cally mapping data to different number of processors. The user of their approach needs
to insert primitives into the code to indicate a resizing point. Our approach does not
require application modification. Tightly-coupled malleable applications are difficult to
implement, and require runtime support at the system level. Utrer et al. [14] proposed
a job scheduling strategy based on virtual malleability: processes within the same node
can be over-subscribed to use fewer processors, such that free processors could be allo-
cated to queued jobs. However, their approach is based on the assumption that they can
38
deploy their own scheduler to control the cluster, while our approach does not require
any change to the system scheduler. Also, the migration within a node approach can
not expand a running application to other available physical nodes.
2.9 Conclusion
We have presented elastic job bundling (EJB), a new resource allocation strategy for
large parallel applications. EJB decouples the one-to-one binding between parallel appli-
cations and jobs, such that one application can run simultaneously on multiple smaller
jobs. By transforming one large job into multiple smaller ones, faster turnaround time
is possible especially on HPC clusters with high workload. We simulated our algo-
rithm using real-world job traces and show that EJB can (i) reduce target job’s mean
turnaround time by up to 48%, (ii) reduce system-wide mean job turnaround time by
up to 27%, and (iii) reduce system fragmentation by up to 59%.We have also presented
an implementation that can realize this approach.
We have made the EJB code available on github [42], such that anyone interested
can obtain and use the complete algorithm code and reproduce our experimental results.
Chapter 3
Dynamically Negotiating
Capacity Between On-demand
and Batch Clusters
3.1 Introduction
The recent improvements in experimental devices, ranging from light sources to sensor-
based deployments, lead not only to the generation of ever larger data volumes but to the
need to support time-sensitive execution that can be used effectively in the management
of experiments, observations, or other activities requiring quick response turnaround.
This means that small, dedicated analysis clusters used by many experimental commu-
nities are now no longer sufficient and their users are increasingly looking to expanding
their capacity by integrating high performance computing (HPC) resources into their
workflow. This presents a challenge: how can we provide on-demand execution within
HPC clusters which are today operated mostly as batch?
The inspiration for this project was provided by scientists from the Advanced Photon
Source (APS) at the Argonne National Laboratory (ANL). APS is currently operating
a cluster dedicated to experiment support: the execution of jobs run on the cluster has
to be completed in the shortest time possible; thus the need for dedicated resources.
However, as the experiments increasingly require greater processing power, an interest
39
40
arose in using HPC resources so long as they can be provisioned on demand in a cost-
effective manner and with environments suitable to APS computations. This conflicts
with the modus operandi of HPC resources today, which are usually available via batch
schedulers maximizing utilization and thus amortization of expensive resources, and do
not provide environment management. In this chapter, we propose a solution to this
use case.
This chapter presents the design and evaluation of the Balancer: a service that dy-
namically moves nodes between an on-demand cluster configured with cloud technology
(in our case OpenStack) and an on-availability cluster configured with a batch sched-
uler (in our case Torque) as the need for on-demand availability changes. The ability
to integrate commodity, generally used technologies was an important requirement of
our design. Another requirement was to make it as non-invasive as possible, i.e., to
not kill or checkpoint running batch jobs as we have done in [43], or rely on specialized
adjustments to scheduling policies of the existing tools. We propose three different al-
gorithms for moving nodes between the on-demand and batch partitions and evaluate
them, first experimentally in the context of real-life traces representing two years of a
specific institutional need, and then via experiments in the context of synthetic traces
that capture generalized characteristics of potential batch and on-demand traces.
Our results, based on a real-life scenario, show first that combining capacities and
workloads of on-demand and batch clusters can provide sufficient capacity to satisfy all
on-demand requests while reducing the dedicated portion of the cluster by 82%, im-
proving the mean batch wait time almost by an order of magnitude (8x), and improving
the overall utilization as well. We secondly show that in a general case we can support
bursty on-demand workloads corresponding to up to 10% capacity of the cluster it shares
with batch workload in a non-invasive way, to achieve higher combined utilization.
In summary, this chapter makes the following contributions:
• We describe an architecture and implementation for dynamic non-invasive resource
reassignment between two systems: a system providing resources on-demand and
a system providing resources based on availability, that balances their respective
objectives in terms of the number of satisfied on-demand requests and utilization.
• We propose three algorithms for balancing resources in this context: the Basic
41
algorithm providing a baseline of our systems, the Hint algorithm that models the
behavior where experimental users can register the upcoming need for on-demand
cycles, and the Predictive algorithm for cases where such advance notice is not
possible.
• We evaluate these algorithms for different Balancer behaviors, to understand what
workloads we can successfully balance under this system, using two years of traces
from an experimental and a mid-scale cluster at Argonne. We show that we can not
only support the existing use case with dedicated resources significantly reduced,
but also scale the bursty on-demand workload to up to 10% of the capacity of the
cluster it shares with the batch workload in a non-invasive way.
3.2 Approach
The inspiration for this project was provided by scientists from the Advanced Photon
Source (APS) at the Argonne National Laboratory (ANL). APS provides a facility for
experiments in many scientific domains. To support them, it operates a cluster dedicated
to experimental analytics; the execution of jobs on this cluster is typically critical, has to
be completed in the shortest time possible, and thus typically run on demand. Since the
cluster is only periodically used, it is not well utilized. Recent experiments demonstrated
great utility of using HPC resources on demand; however, those HPC resources are
typically managed by batch schedulers that ensure good utilization required to amortize
the cost of expensive resources but do not ensure on-demand access in a cost-effective
manner.
Infrastructure-as-a-Service cloud technologies, such as OpenStack, have been a pop-
ular solution for on-demand access as they also provide environment management via the
deployment of virtual machines (VMs) or containers. We propose to use those existing
cloud technologies and provide a system that will combine them with HPC schedulers
in a non-invasive way, by arbitrating resource assignment between them. Specifically,
the system will meet the following objectives:
• Inject on-demand and environment management for the on-demand resources into
batch clusters such that we can schedule as many on-demand requests as possible
42
with as little impact on utilization as possible (i.e., maximizing utilization while
minimizing the number of rejected on-demand requests).
• Provide a solution in terms of existing commodity frameworks for both on-demand
and batch, such as OpenStack or Torque, such that the user’s interface to those
systems does not change and the changes to the systems themselves are minimal
though flexible.
• The solution should be minimally invasive in terms of interference with the normal
operation of the batch scheduler. We will not e.g., kill or checkpoint/snapshot jobs
in order to make room for on-demand requests as we have done in [43] or rely on
the availability of specialized queues with smaller sized jobs that can be used for
backfilling [7].
3.2.1 Leases
To explain our approach, we will use the concept of a lease, defined as a temporary
ownership of resources, taking place between a well-defined start time and end time.
In this chapter, we will differentiate leases based on their start time; the end time may
be bounded (e.g., assumed or specified to last a specific amount of time) or unbounded
(used until terminated by an event).
We define two types of leases, one reflecting the concern of users who are interested
in controlling the start time of their computations, and the other reflecting the concern
of the providers, who are interested in optimizing the utilization of their resources:
• An on-demand lease starts within a window of time W after the request has
been made and may or may not have a defined end time. Time W is typically
understood to be short, e.g., under a minute or two, and may comprise actions
such as virtual machine deployment and boot. This startup time can be arbitrary
and can include some system management, e.g., terminating jobs in order to make
room for a lease. On-demand requests are the most common type of request in
compute clouds and are implemented by all major cloud providers.
• An on-availability lease starts whenever the provider makes resources available for
the lease. Examples of on-availability leases include resource assignments given
43
out by a batch scheduler, high throughput leases implemented by systems such
as SETI@home [44], or spot pricing leases implemented by Amazon EC2 [45].
Since the lease may not (and generally does not) start immediately, the request is
typically placed on a queue; the provider selects it for resource allocation based
on a variety of concerns that generally favor increasing utilization but may also
take other factors into account (e.g., EC2 spot pricing).
3.2.2 Architecture
Our approach is to soft-partition nodes in a large cluster into two scheduling pools, an
on-demand pool and an on-availability pool, and to implement a mechanism that will
dynamically move nodes from one pool to the other to maximize our objectives.
In keeping with our assumptions, both resource managers (on-demand and on-
availability) are independent of each other; nodes in the on-demand pool are managed
by an on-demand resource manager (ODRM) while nodes in the batch pool are managed
by an on-availability resource manager (OARM). The users of each resource manager
use their respective interfaces to request resources; they are not affected by the presence
of the other resource manager, except as by having some requests rejected or delayed
due to changes in resource availability.
Our architecture (Figure 3.1) consists of a service, called the Balancer, which nego-
tiates adjustments in the respective sizes of the on-demand and on-availability resource
pools with ODRM and OARM. Implementing the Balancer as a separate service, dis-
tinct from both resource managers, allows us to implement bilateral negotiation and
also to implement the system with minimal changes to both resource managers. The
Balancer understands the status of each node in the whole resource, as well as whether
at any given moment they belong to the ODRM or OARM pool. However, the Balancer
only manages which nodes belong to which pool; the scheduling decisions are left to
ODRM and OARM. The boundaries between the pools are re-evaluated by the Bal-
ancer on an ongoing basis, negotiating with each scheduler for the availability of nodes
in the respective clusters.
The on-demand pool contains a group of nodes called the reserve R, which may be
set to zero. The reserve represents nodes that cannot be moved to the batch pool and
is intended to ensure that the system has up-front capacity to schedule resources when
44
Figure 3.1: High-Level Architecture
on-demand requests come in. Otherwise, the division between the on-demand and on-
availability parts of the cluster is fluid and constantly re-evaluated. Another parameter
of the system is the time window W , which defines how long the system can wait before
scheduling an on-demand request.
In the context of this chapter, the Balancer implements a simple one-way negotia-
tion which requests nodes from the OARM as needed; nodes from the ODRM are only
contributed by the resource manager itself. Under the current assumptions, execution
on the nodes in the on-availability pool has to be finished before they are contributed to
the Balancer; this means that the Balancer’s request for nodes from OARM to be con-
tributed in the allotted time may be unsuccessful. Ultimately, this negotiation protocol
can be extended to implement more complex constraints.
The interaction with OARM and ODRM takes place via the following interfaces.
Balancer Interface:
• request nodes(int n): request n additional nodes from the Balancer (the decision
of which specific nodes from the on-availability pool should be made available to
the ODRM is made by the Balancer)
• release nodes(node list): release specific nodes to the Balancer
• update nodes(node states): attempts to update status of specific nodes. Can
return an error if the status update is incorrect. This interface is used by OARM
when: (1) it attempts to run a job on a node it believes is in the OARM pool
45
and calls this interface to avoid a race condition with Balancer reclaiming this
node for ODRM at the same time; and (2) it finishes executing a job, giving the
opportunity for the Balancer to reclaim it if needed.
OARM Interface:
• reclaim node(nodename): reclaim a specific node, identified by its hostname,
from Balancer to OARM
• restore node(nodename): restore a specific node, identified by its hostname, from
Balancer to OARM
• get status of all nodes(): returns a list of data structures that for each node
describes if it is busy executing a job, free, or offline, and if it is executing a job
what is the remaining wall time.
3.2.3 Algorithms
Basic algorithm
The objective of the Basic algorithm is to implement a simple mechanism whereby
the Balancer requests nodes from the OARM as on-demand requests come in and uses
reserve as well as wait time to “pad” availability. Algorithm 1 shows the pseudo-code.
At any point of time, a node can be in any one of 4 states: OD Reserve, OD Alloc,
OA Idle, and OA Busy.
R is the number of nodes that are statically reserved for the ODRM pool. When
an on-demand request comes in, the Balancer allocates nodes from: (1) OD Reserve,
(2) OA Idle, and (3) OA Busy nodes whose jobs finish before time W . A request is
rejected if the Balancer cannot allocate n nodes before time W or immediately when
W = 0.
46
Input: R (default = 0), W (default = 0)Function request nodes(n):
nr ← nodes currently in OD Reserve stateni ← number of nodes in OA Idle stateif nr ≥ n then
allocate n OD Reserve nodeschange node state to OD Allocreturn node list
elseif nr + ni ≥ n then
reclaim nodes(n− nr)change node state to OD Allocreturn node list
elseif W = 0 then
return Rejectionelse
reclaim nodes(ni)wait for W secondsforeach received update nodes message do
reclaim nodes(1)if n nodes can be allocated then
change node state to OD Allocreturn node list
end
endif W expires then
return Rejectionreclaimed nodes are kept in OD Reserve state for I secondsbefore release to OARM pool
end
end
end
end
Function release nodes(node list):foreach node in node list do
change node state OD Alloc → OD Reserveif node is not statically reserved then
node is kept in OD Reserve state for I seconds before release toOARM pool
end
endAlgorithm 1: Basic algorithm
47
Hint algorithm
The Hint algorithm is a refinement of our original attempt at the Basic algorithm
and reflects the fact that in experimental communities it is often possible to determine
resource need within a short time (15-30 minutes), though it may not always be possible
to pinpoint it to a particular time days in advance. This allows us to implement a
dynamic reserve (i.e., a reserve that changes according to the situation).
Both functions request nodes and release nodes stay the same as in the Basic al-
gorithm. The Balancer introduces another interface add reserve(H,N), mandating the
Balancer to add N extra reserve nodes before time H. Here we are essentially parame-
terizing the Hint algorithm by two parameters: time H for “advance notice” or “hint”,
and N the number of nodes that are requested by the user or a third party. Note that no
nodes are statically reserved in the Hint algorithm—any nodes that are in OD Reserve
state for more than I seconds will be released to OARM pool.
Predictive algorithm
The Predictive algorithm is another refinement of the Basic algorithm that also imple-
ments the notion of a dynamic reserve but in situations when an advance notice is not
possible. The Predictive algorithm is run by an out of band predictor which invokes the
add reserve interface on behalf of the users. The predictor collects historical data from
the Balancer and use history to predict future on-demand requests.
Our Predictive algorithm is based on three observations of the arrival time of on-
demand requests. Firstly, the arrival time follows a strong diurnal pattern which can
be explained by the interactive nature of the APS workload. Secondly, the arrival time
shows moderate correlation between adjacent weeks, meaning that if there is a burst of
requests during several hours of one week, then there is likely a similar burst during the
same hours of the next week. Thirdly, sometimes there are bursts of requests during
the same hours of consecutive days. Our Predictive algorithm is described as follows:
The predictor divides each day into four 6-hour slots. At the end of each time slot,
the predictor queries the Balancer for how many nodes were requested during the time
slot. At the beginning of each time slot, the predictor invokes add reserve(0, N) to
reserve N nodes where N is estimated based on the peak number of requested nodes
48
during the same time slot of the last month, week, and day.
3.2.4 Implementation
Our implementation of the Balancer is configured to work with the Torque resource
manager [46] and the Maui cluster scheduler [47], used as OARM, and OpenStack (with
the KVM hypervisor [48]) as ODRM. It consists of a simple web service developed using
the Flask Python micro web framework, separately from either OpenStack or Torque. It
offers an HTTP endpoint capable of receiving resource requests as well as notifications
of resource status changes. To move nodes between the on-demand and on-availability
pools, the Balancer enables or disables them in Torque using the pbsnodes command
with arguments -o to disable or -c to enable.
The update nodes interface is implemented for Torque nodes as prologue and epi-
logue scripts which are triggered respectively when a job starts and ends execution
(whether successfully or not). These notifiers make HTTP requests to the balancer in
order to update its record of resource status, i.e., nodes available for stealing. No other
changes were required to integrate Torque into the system.
In order to make OpenStack work with the Balancer, we had to make small modifi-
cations to the OpenStack implementation: the scheduler (Nova) requests more resources
from the Balancer if it does not have enough available for scheduling virtual machines
requested by on-demand users (using request nodes), and resources are released to the
Balancer when instances are terminated (using release nodes). We also had to fix con-
currency issues in the scheduler when using large wait times which can make many
independent resource requests block and then resume execution at the same time.
3.3 Experimental Evaluation
We conduct our experimental evaluation in two stages. We first evaluate our approach
using the Basic algorithm in the context of a real-life scenario defined by two years
worth of traces reflecting the needs of on-demand and batch jobs at the Argonne National
Laboratory; this gives us insight into realistic demand and submission patterns. Second,
we use synthetic traces to generalize the problem and evaluate and compare the three
algorithms we formulated.
49
Our overall experimental methodology consisted of emulating the actual runs by
submitting traces of on-demand and batch requests to OpenStack and Torque configu-
rations respectively, on a cluster managed by the Balancer. The OpenStack submissions
use a mechanism called FakeDriver which, instead of launching a real VM, generates
the suitable internal events that track resource consumption. The Torque submissions
use a ”sleep” script for the duration of the job walltime.
3.3.1 Evaluating a Real-Life Scenario
To evaluate our approach we first ask the question: how would it fare under existing
shared on-demand/on-availability workloads in real-life computational centers? To an-
swer this question, we combined both workloads and resources of two systems used at
the Argonne National Laboratory (ANL). The on-demand side is represented by work-
loads ran on a small cluster in the Advanced Photon Source (APS) used for analytics
supporting real-time experiments; hence the need for immediate execution. The batch
side is represented by a general purpose mid-scale batch cluster in Laboratory Comput-
ing Resource Center (LCRC). Given this context, a more specific version of our question
is: if we combined both the on-demand/on-availability workloads and the resources cur-
rently executing those workloads under our approach, what advantages or disadvantages
would we observe?
To create a combined APS/LCRC workload we combined two years worth of job
execution traces from APS and LCRC (between 2013-10-06 and 2015-09-05). We first
mapped the job execution trace from APS onto on-demand VM deployment requests in
OpenStack as follows. At any APS job start/stop event, we evaluated how many one-
core jobs should be running and how many 16-core VMs would be needed to support
them, assuming that jobs would be tightly packed. If more or fewer VMs would be
needed as a result of a job start/stop request, an on-demand VM deployment request or
termination event would be generated. We then combined the APS on-demand trace and
the LCRC batch trace using the same start time for both, such that the VM deployment
requests are submitted to OpenStack and batch job requests are submitted to Torque.
To create a combined APS/LCRC cluster we proceeded as follows. The LCRC
cluster comprises 304 homogeneous 16-core nodes. The APS cluster consists of 57
heterogeneous multi-core nodes amounting to a total of 1092 cores. Since cores represent
50
the main scheduling concern in our experiment, we modeled APS capacity as 68 16-
core nodes (a total of 1088 cores, close to the actual 1092 capacity). The combined
APS/LCRC cluster is thus modeled as 304 + 68 = 372 16-core nodes.
We now set out to replay the combined APS/LCRC workload on a model of the
APS/LCRC combined cluster. Since we could not replay two years worth of traces in
real-time, we scaled down the experiment in space and time. To scale it in space, we
created an experimental environment that mapped each of the 372 combined cluster
nodes onto a Docker container, each with a unique hostname and IP address, connected
by an overlay network. An additional container represented the controller node. We
deployed the Docker containers on the Chameleon testbed [49] version 53, using the 24-
core 128 GiB RAM Xeon Haswell compute nodes, such that 24 containers were mapped
to each node. To scale the experiment in time, we mapped hours to minutes (i.e.,
accelerated 60x). Finally, we eliminated the ramp-up effect by preloading the cluster
with running jobs.
This still left us with a potentially very long experiment, so instead of replaying
two years worth of traces we focused on one week that would represent the greatest
challenge to our system. In the case of the batch trace, we defined ”challenging” as low
average node availability across 60 second periods, measured every second. In the case
of the on-demand trace, we defined ”challenging” as high total resource usage coming
from on-demand requests, calculated as a sum over the product of the time used by a
job and number of cores on which the job was running. We picked the week which had
the highest sum of usage and inverse of availability.
We now ran the experiments using traces from the most challenging week reflecting
the modifications above, such that the modified APS trace was submitted to OpenStack
and the LCRC trace was submitted to Torque. We measured the following qualities:
• Average utilization, defined as usage over time
• Mean batch wait time, defined as the time between when the job is submitted and
when it starts running
• Number of on-demand rejections, or reject rate, calculated as the ratio between
number of rejections and number of requests
51
Table 3.1 summarizes the results of this experiment in both static and dynamic
configurations. The shaded column in the static section reflects the existing scenario in
which the APS and LCRC clusters are separate: the LCRC cluster has 1002.8 minutes
mean batch wait time and the APS cluster has no rejections. A hypothetical scenario
where 100% of the combined resources are devoted to batch workload shows that the
lower bound of batch wait time for this trace is 122.5 min.
In the dynamic section of the table, we see the results of seven scenarios reflecting
different combinations of parameters R and W . We notice that the utilization of the
combined cluster improves by 4.8 to 5.6% across all dynamic scenarios, with mean
batch wait time decreasing by 85 to 88%; this is due to the fact that we can now utilize
the previously idle nodes of the dedicated on-demand cluster. However, there are 30
on-demand rejections when we choose R = 0 and W = 0; we can decrease them by
increasing either one of the parameters or both. From a practical perspective, the most
interesting observation is that the challenging week yielded no rejections for R = 12
nodes which corresponds to roughly 18% of the on-demand cluster: this means that
under the Basic algorithm, we could reduce our investment in hardware for the on-
demand cluster by 82% and still have all on-demand requests satisfied. Further, the
mean batch wait time under this scenario is almost the same as the lower bound for the
combined cluster established in the static column; this brings significant benefits to the
batch side as well.
Another scenario with no rejections occurs for R = 6 and W = 10; this means that
an on-demand request would execute within 10 minutes which is a relatively long wait
time in the context of this use case. This has been deemed not useful for our problem
formulation and thus we don’t explore on-demand wait time further in our experiments.
52
Tab
le3.1
:E
xp
erim
enta
lre
sult
sfo
rth
em
ost
chal
len
gin
gw
eek:
ther
ear
e24
,177
bat
chjo
bs
and
141
on-d
eman
dre
qu
ests
bei
ng
sub
mit
ted
inea
chex
per
imen
t.T
he
wai
tti
me
ism
easu
red
inm
inu
tes
and
the
rese
rve
valu
esar
egi
ven
inn
od
es.
For
the
dyn
am
icca
se,
the
on-d
eman
dan
db
atch
uti
liza
tion
refe
rto
the
por
tion
ofuti
liza
tion
com
ing
from
on-d
eman
dan
db
atc
hre
qu
ests
resp
ecti
vely
.
Para
metersettings
Sta
tic(B
aseline)
Dynamic
Dedicated
batch
nodes
W37
230
40
05
100
510
0Dedicated
on-d
emand
nodes
R0
6837
20
612
Combined
utilization
84.
4%80
.1%
1.25
%84
.9%
85.7
%85
.7%
85.3
%85
.3%
85.3
%85
.3%
Batch
utilization
84.
4%78
.8%
NA
84.5
%84
.4%
84.4
%84
.0%
84.1
%84
.0%
84.0
%
On-d
emand
utilization
NA
1.25
%1.
25%
0.38
%1.
25%
1.25
%1.
25%
1.25
%1.
25%
1.25
%
Batch
wait
time(m
in)
122
.510
02.8
NA
122.
014
7.0
147.
015
0.0
140.
615
0.4
130.
0
Rejections
141
00
303
31
10
0
53
3.3.2 Evaluating Balancer Algorithms
We next asked the question: how does our system perform in a generalized scenario?
What would happen if the on-demand or on-availability workloads were different—
larger or composed of a different mix of applications than in our real-life scenario?
In general, we sought to discover the relationship between the cluster capacity, the
type of workload, and configuration parameters or Balancer algorithms we would need
to employ to accommodate the on-demand workload while running the on-availability
workload undisturbed. We answered these questions by generating synthetic traces
representing both on-demand and batch workloads and running experiments with those
traces. In order to preserve continuity with our real-life experiments, we continue to
use the 372-node cluster as a base and each experiment represents one week.
Generating Synthetic Batch Workloads
We create five synthetic batch workloads as follows:
The Mainstream workload represents the “mainstream” workload condition in
the LCRC cluster. The workload is derived by randomly sampling 1% of all the jobs in
the LCRC traces. We retain the node number, walltime, and runtime of each job. Each
job’s submission time is calculated as the time offset from the beginning of the week
which the job is selected from, so that they add up to one week’s worth of submissions.
Since the Mainstream workload has a lower utilization compared to the real-life workload
described in Section 3.3.1 (66.5% versus 78.8%), we also generated workloads with higher
utilizations of 77% and 88%, and named them U66-Main (or in short U66), U77-Main,
and U88-Main, respectively. Specifically, we generate higher utilization workloads by
injecting additional jobs into the U66-Main workload.
The Wide workload (U66-Wide) is designed to model a workload composed of rel-
atively large parallel jobs. We derive the Wide workload directly from the mainstream
workload (U66-Main) by doubling the number of nodes of each job and randomly remov-
ing approximately half of the jobs to maintain close to the same aggregate utilization.
The Narrow workload (U66-Narrow) is designed to represent a workload com-
posed of small parallel jobs. To generate it, we split each job from the mainstream
workload (U66-Main) into two smaller jobs, each requiring half the number of nodes
54
(thus the utilization stays the same).
Generating Synthetic On-demand Workloads
Similar to the synthetic batch workloads, we create synthetic on-demand workloads by
abstracting workload patterns from real-life traces. In particular, we seek to preserve
their burstiness corresponding to periods when an APS experiment occurs causing the
demand for time-sensitive computation. We thus reuse the VM leases’ submission times
and durations from the challenging week. Since the utilization (denoted by ρ) of the
challenging week’s on-demand workload is 1.25%, in order to achieve higher utilization,
we multiply the number of nodes in the lease by 2x, 4x, 8x, 16x, and 24x. Thus, the ρ
of the synthetic workloads equals to 2.5%, 5%, 10%, 20%, and 30% respectively. These
synthetic workloads preserve the burstiness quality in real-life trace while exerting much
higher pressure on the Balancer. For example, when ρ = 30%, the peak arrival rate of
on-demand requests is 264 requests per minute.
Result analysis of the Basic algorithm
The experiments are similar to experiments in the previous section; we use various com-
binations of traces submitted to OpenStack and Torque respectively, having preloaded
the cluster with running jobs to mitigate the ramp-up effect. Figure 3.2 shows the
performance results of running the five batch workloads with six on-demand workloads
(x-axis) with the Basic algorithm (zero reserve). We skipped combinations for which
the sum of batch and on-demand workloads exceeds the capacity of the cluster.
Figure 3.2 shows that rejection rates are influenced by both the shape of batch jobs
in the trace and their density (i.e., batch utilization). While for the U66-Narrow trace
we don’t see rejections with ρ as high as 10%, this threshold drops to 5% as jobs become
wider, and the rejection rate stays firmly above zero for every ρ value for other traces.
Mean batch wait time follows a similar pattern as both smaller jobs and less utilization
make it easier for batch jobs to be scheduled. Our explanation for how batch job shape
affects performance is that it is easier for the batch scheduler to schedule narrower jobs
than wider ones. Thus jobs finish earlier such that more space will be left open when
on-demand requests arrive.
55
None 1.25 2.5 5 10 20 30
rho (%)
0
5
10
15
20
25
30
35
reje
ctra
te(%
)
U88
U77
U66-Wide
U66-Main
U66-Narrow
(a) Reject Rate
None 1.25 2.5 5 10 20 30
rho (%)
0
20
40
60
batc
hw
ait
tim
e(m
in)
U88
U77
U66-Wide
U66-Main
U66-Narrow
(b) Batch Wait Time
None 1.25 2.5 5 10 20 30
rho (%)
70
80
90
100
com
bin
edu
tili
zati
on(%
)
U88
U77
U66-Wide
U66-Main
U66-Narrow
(c) Combined Utilization
Figure 3.2: Performance results of the Basic algorithm, five batch workloads and sixon-demand workloads, R = 0, W = 0.
56
The utilization patterns follow strongly the utilization of the batch traces, although
all go up slightly as more on-demand jobs are added. Without any additional configu-
ration, our approach is thus able to support on-demand workloads demanding less than
10% of cluster capacity, depending on the shape and utilization of batch jobs. To put
this number in perspective, ρ from our real-life scenario was an order of magnitude
lower.
To make the Basic algorithm work for larger on-demand workloads, we need to use
the static reserve. We thus rerun the Basic algorithm with increasing R and observe the
trends of rejection rate dropping. Note that since the performance is mainly determined
by utilization compared to job shape, we will only use the mainstream batch workload
for the rest of this chapter.
Figure 3.3 illustrates what happens for ρ = 10%. We see that a relatively small
increase in the value of R (30 or 60 depending on on-demand trace density) can decrease
the number of rejections by a significant factor. However, to reduce them to or near
zero, we need a reserve of 120 nodes, roughly a third of the cluster. The negative
effect of reserving more nodes is that the batch job performance becomes significantly
worse (exponentially worse for ρ > 10%). This is reflected in the combined utilization,
which goes down with increased reserve for batch traces requiring higher capacity. A
high static reserve is thus a very expensive solution for accommodating on-demand
workloads higher than 10%; to look for a better one we turn to the Hint algorithm.
Result analysis of the Hint algorithm
To run experiments with the Hint algorithm, we used a program that simulated a user
notification to the Balancer 15 or 30 minutes before actual requests arrive. We used two
values for this user notification: 15 minutes and 30 minutes (H15 and H30, respectively).
Recall that our traces follow a real-time experimental pattern where a user would be
able to make such notification.
Figure 3.4 shows the rejection rates for the Hint algorithm. With ρ = 10% and given
a 30-minute hint, we get zero rejection rates for U66 and U77 and near zero (< 1%)
rejection rate for U88. With a slightly shorter advance notice of 15 minutes, we get
near zero (< 1%) rejection rate for U66 and low rejection rates (less than 4%) for U77
and U88. In comparison, the Basic algorithm evaluated in the previous section needed
57
0 30 60 90 120
R (reserve nodes)
0
10
20
30
40
reje
ctra
te(%
)
U88
U77
U66
(a) Reject Rate
0 30 60 90 120
R (reserve nodes)
0
1000
2000
batc
hw
ait
tim
e(m
in)
U88
U77
U66
(b) Batch Wait Time
0 30 60 90 120
R (reserve nodes)
75
80
85
90
com
bin
edu
tili
zati
on(%
)
U88
U77
U66
(c) Combined Utilization
Figure 3.3: Performance results of the Basic algorithm with static reserve, three batchworkload, ρ = 10% on-demand workloads.
58
a static reserve of 120 nodes, i.e., almost a third of the cluster to achieve the same
rejection rate. A relatively accurate but short-term estimate of resource need can be
then used to activate a dynamic reserve that is effectively equal to a static reserve of
120 nodes. Since a static reserve typically means purchasing and operating a cluster set
aside for on-demand experimental support, this observation has significant potential for
creating on-demand capacity.
At the same time, the impact on batch workload is much lower: for U66, batch job
mean wait time when R = 120 is over 9 hours, while the same measurement for a hint of
30 minutes is merely 50 minutes. This is because the dynamic reserve implemented by
the Hint algorithm acquires the nodes only when they are known to be needed and for
as long as they are needed. To understand how effective it is, we looked at how much
time nodes spent in reserved state without being used. Using the Basic algorithm with
R = 120; this time is 837,546 minutes whereas in the H30 case it is only 34,695 minutes,
a reduction of 96%. This significantly increases the flexibility of the system as nodes
are free to be allocated to the most pressing tasks.
Another benefit of the Hint algorithm is that it improves combined utilization: most
importantly, the combined utilization goes up rather than down as in the case of high
reserve. In particular, batch utilization stays approximately the same, meaning that
combining on-demand workload didn’t hurt batch overall. The increased utilization is
contributed by more on-demand workload being scheduled, e.g. the biggest boost comes
from (ρ=30%, H30), when on-demand utilization increases by 5.1% compared to (U66
and ρ=30%, Basic algorithm). This is also the first time the combined utilization goes
above 85% (for U66 and ρ=30%) demonstrating that we can indeed combine concerns
of on-demand and on-availability workloads better.
Results analysis of the Predictive algorithm
Sometimes, it is impossible to get a reliable estimate of an incoming bursty workload.
In those situations, we apply a heuristic algorithm as described in section 3.2.3 to
predictively adjust the reserve. To evaluate our algorithm, we first ran the predictor
offline using historical data and then run live experiments using the predictor’s output.
Figure 3.5 shows that the predictive algorithm performs equally well as the Hint
algorithm in terms of reject rates. When on-demand workload is low (<5%), batch wait
59
1.25 2.5 5 10 20 30
rho (%)
0.0
2.5
5.0
7.5
10.0
12.5
reje
ctra
te(%
)
U88,H30
U88,H15
U77,H30
U77,H15
U66,H30
U66,H15
(a) Reject Rate
1.25 2.5 5 10 20 30
rho (%)
0
100
200
300
400
500
batc
hw
ait
tim
e(m
in)
U88,H30
U88,H15
U77,H30
U77,H15
U66,H30
U66,H15
(b) Batch Wait Time
1.25 2.5 5 10 20 30
rho (%)
70
80
90
100
com
bin
edu
tili
zati
on(%
)
U88,H30
U88,H15
U77,H30
U77,H15
U66,H30
U66,H15
(c) Combined Utilization
Figure 3.4: Performance results of the Hint algorithm, H = 15 min or H = 30 min,three batch workload, six on-demand workloads.
60
1.25 2.5 5 10 20 30
rho (%)
0.0
2.5
5.0
7.5
10.0
12.5
reje
ctra
te(%
)
U88,Predict
U77,Predict
U66,Predict
(a) Reject Rate
1.25 2.5 5 10 20 30
rho (%)
0
500
1000
1500
2000
2500
batc
hw
ait
tim
e(m
in)
U88,Predict
U77,Predict
U66,Predict
(b) Batch Wait Time
1.25 2.5 5 10 20 30
rho (%)
80
85
90
com
bin
edu
tili
zati
on(%
)
U88,Predict
U77,Predict
U66,Predict
(c) Combined Utilization
Figure 3.5: Performance results of the Predictive algorithm, three batch workload, sixon-demand workloads.
61
time achieved by the Predictive algorithm is comparable to that of the Hint algorithm.
However, when on-demand workload becomes higher, batch wait time is 1-4 times longer
than in the Hint algorithm (H = 30 min). This can be explained by the fact that with-
out additional information, the predictor can only approximately estimate on-demand
request arrival time. To lower reject rate, our Predictive algorithm over-reserves nodes,
indicating that the Predictive algorithm is significantly less efficient at estimating when
the nodes will be needed.
Figure 3.5 also shows that the differences in utilization compared to the Hint al-
gorithm results are relatively small for low ρ and are correlated primarily to batch
utilization and over-reserving nodes. Thus with larger ρ and consequently more time
spent in reserve, despite the increase in on-demand utilization, it is not big enough to
overshadow the drop in batch utilization. Unless there is a predictor that can accurately
predict user behavior and make precise estimations of on-demand request arrival, the
Predictive algorithm does not perform as well as the Hint algorithm in balancing batch
and on-demand performance, even though the Predictive algorithm performs better
than the Basic algorithm.
3.3.3 Elasticity Analysis
We extend our evaluation by modeling our Balancer approach under the elastic schedul-
ing framework. The Balancer algorithms (basic + reserve, hint, predictive) aim at pro-
visioning resources in a batch dominant cluster to match the demand of on-demand
workloads. To capture the precision of elasticity, we propose the following definitions
and metrics:
•∑O is the accumulated amount of overprovisioned resources, e.g. reserved but
unused
• C is the number of resources in the entire system
• o =∑
OC the percentage of resources that have been overprovisioned (for the on-
demand workload) in the system.
• F is the number of on-demand resource allocation requests that have been rejected
• f is the percentage of on-demand requests that have been rejected
62
Figure 3.6: Measurements of elasticity based on U77 and rho=20% workloads.
The elasticity of a Balancer algorithm can be evaluated by two metrics {o, f}. The
goal is to minimize both metrics, such that an algorithm satisfies all on-demand requests
without over reserving resources. Out of the available workload combinations, we use one
combination (U77 batch + rho=20% on-demand) to demonstrate how our algorithms
achieve elasticity. The results are shown in Figure 3.6. First, the baseline algorithm
(static partitioning) results in f = 0 but at the cost of overly reserving resources at more
than 11% of the cluster’s capacity. Then, for the Basic algorithm, starting from R = 0
(the point which is depicted at the upper-left corner of the graph), with the increase
of R, the Basic algorithm gradually reduces f at the cost of increasing o. Basically, R
can be seen as the knob we use to tune the trade-off between f and o. Finally, the Hint
algorithm performs nearly as well as the optimal algorithm – with both f and o close
to zero, which represents the best result the balancer approach can achieve.
3.4 Related Work
Several research groups have been exploring the suitability of the cloud environment
for HPC applications in terms of virtualizing HPC execution (e.g. Palacios [50]), en-
abling a cloud interface for grid computing (e.g. Globus [51], Magellan [52]), and using
63
on-demand/on-availability leases [53, 54]. Other work has focused on combining HPC
and cloud systems in a hybrid environment to enable cloud bursting of HPC workload
from HPC clusters to the public cloud to meet deadline constraints of HPC applica-
tions [55, 56, 57]. To reduce the cost using public clouds, a number of groups have
proposed using cheaper yet unreliable spot instances to reduce the cost of executing
HPC applications [58, 59, 60, 61, 62, 63]. Spot instances suffer from volatility due to
price fluctuation and some work has proposed prediction methods to calculate statistical
availability guarantees [63]. Our work differs from the hybrid cloud paradigm in two
ways. First, it bursts on-demand applications to HPC clusters in a controlled fashion
to meet the requirements of both on-demand and HPC batch applications. Second,
unpredictable start times are avoided by reclaiming resources from the on-availability
HPC cluster to convert them to on-demand resources.
In the realm of executing mixed workloads, cluster and data center operators have
configured resource schedulers to improve facility utilization through co-scheduling of
multiple batch and latency-critical workloads. In the HPC community, prior efforts such
as Marshall et al. [43] target improving utilization of private IaaS clouds by opportunisti-
cally backfilling VMs on idle nodes which are not in use by on-demand requests, allowing
HTC workloads to run on backfilled VMs. The backfilled VMs do not support start time
constraints and are preemptible, unlike our work. SpeQuloS [64] explores providing QoS
for executing Bag-of-Tasks applications on opportunistic grids or cloud spot instances.
More recent works [65, 66] aim to improve the value of reclaimed cloud capacity by
providing Service Level Objectives (SLOs) or guarantees for their use. TR-spark [67]
proposed a big-data analysis framework customized to exploit transient cloud servers
for Spark applications. Other systems have addressed the problem of reconciling con-
flicts incurred at finer-grained resource sharing (e.g., Heracles [68] and Morpheus [69]).
Elastic schedulers such as [70] enable slots to be shared across applications to meet
their SLOs and improve utilization. In this environment, slots can be taken away from
loosely-coupled applications at run-time (e.g. Hadoop), but this is not applicable to
the HPC environment. Through combining batch and on-demand requests, the Bal-
ancer also achieves utilization improvements but within the operating constraints of an
HPC environment where batch applications are first-class citizens. The Balancer does
not preempt running batch jobs to enable on-demand job execution. Additionally, due
64
to the node-exclusive requirement of most HPC applications (unlike in a data center
environment), the Balancer does not consider node-level sharing between batch and
on-demand workloads, thus performance conflict is not a major concern in our work.
Finally, our work differs from prior work in that it enables resources to be dynami-
cally shared between batch and on-demand schedulers. Thus, it is not another cluster
scheduler that manages only the flow of jobs, but it also manages the flow of resources
from one class of service to another. Mesos [71] is most similar to our work. Mesos
adopts a two-level scheduling model which (1) offers available resources to frameworks
(e.g. Torque) such that a framework can either accept or reject an offer, and (2) each
framework scheduler schedules its own tasks onto the accepted resources. The Mesos
master plays a similar role as the Balancer in cross-framework resource allocation. How-
ever, unlike the Balancer, Mesos does not support time-bounded resource allocation nor
performance-aware resource reclamation. Both mechanisms are critical in the Balancer’s
target environment. Similarly, Google’s Borg [72] and open-source Kubernetes enable
the co-scheduling of mixed workloads, but do not adhere to the specific constraints in
our HPC environment, namely, that batch jobs are the main tenant and run exclusively
on allocated nodes. Thus, the Balancer must operate with fewer degrees of freedom than
these general-purpose schedulers. For this reason, we opted to design a new scheduling
system targeted to HPC and on-demand environments.
3.5 Conclusion
We proposed a model reconciling the needs of on-demand and batch workloads within
one system in a non-invasive way, i.e., by operating on cycle stealing rather than disrupt-
ing job execution. The model consists of a lightweight Balancer service that dynamically
arbitrates resource usage between an on-demand and on-availability scheduling frame-
work and can be adapted to existing technologies, such as OpenStack or Kubernetes for
on-demand, or Torque or Slurm for batch.
Based on a real-life scenario representing two years’ worth of on-demand and batch
workloads at Argonne National Laboratory, we demonstrated that by using our model
on existing resources we could reduce the current investment in on-demand infrastruc-
ture by 82%, while at the same time improving the mean batch wait time almost by an
65
order of magnitude (8x). By exploring how our model behaves under various configura-
tions and workloads, we found that it performs best in scenarios where the on-demand
workload represents less than 10% of the overall capacity (our real-life usage example
needed only 1.25%). When trying to increase this limit, we found that a relatively short
(15 to 30 minutes) advance notice of resource need is as effective as placing a static
reservation on a third of the cluster, which has significant implications for resource
usage and cost. In cases when it is not possible to obtain such advance notice, a sim-
ple prediction algorithm provides a reasonable compromise, yielding near zero rejection
rates with reasonable resource usage.
Chapter 4
The Bundle Service for Elastic
Resource Scheduling in HPC
Environments
4.1 Introduction
Large-scale scientific projects rely on distributed applications combining the resources of
multiple HPC infrastructures. These applications are designed to draw on the power of
heterogeneous computational, storage, and networking architectures, utilize the different
geographical locations, and exploit the diverse performance and availability of diverse
platforms. However, this type of design goals is extremely difficult to fully achieve due to
the complexity of aggregating different resource scheduling mechanisms and interfaces of
the underlying infrastructures in a way that is transparent to the users and application
developers.
Various HPC resources are heterogeneous in their architectures and interfaces, are
optimized for specific applications, and enforce tailored usage and fair policies. In con-
junction with temporal variation of demand, this introduces resource dynamism, e.g.,
the time varying availability, queue time, load, storage space, and network latency. Hid-
ing these factors from an application executing on multiple heterogeneous and dynamic
resources is difficult due to the complexity of choosing resources and distributing the
66
67
application’s tasks over them.
The AIMES 1 project (DE-FG02-12ER26115, DE-SC0008617, DE-SC0008651) ad-
dresses the above limitations by integrating abstractions representing distributed ap-
plications, resources, and execution processes into a pilot-based middleware. The mid-
dleware provides a platform that can specify distributed applications, execute them on
multiple resources and for different configurations.
In this chapter, we present our own contribution in the AIMES project – a study of
devising resource abstractions and dedicated services to provide characterization of re-
source capacity and capabilities and using the information to improve resource schedul-
ing decisions in dynamic and heterogeneous HPC environments. We implement the
abstraction and service in middleware, and use experimental evaluation to show the
benefits of our methodology:
• Our resource abstraction uniformly and consistently describe the core properties
of distributed computing resources.
• Our dedicated service draws insights on how to make better scheduling decisions.
Section 4.2 discusses the abstraction we defined. Section 4.3 discusses the imple-
mentation of the abstraction into a service. Section 4.4 presents the experimental eval-
uations. In Section 4.5, we provide a brief summary of related work. Section 4.6 reviews
the presented work.
4.2 The Bundle Abstraction
Scientific applications deployed in distributed, heterogeneous environments rely on accu-
rate and comprehensive resource characterization to guide their resource couplings from
end to end. Conceptually, resource couplings involve two aspects: the static coupling
and the dynamic coupling. The coupling is static when users select resources based on
the characteristics including capacity, performance, policies, and cost. Frequently this
process depends on user’s knowledge and past experience and the decisions are made
on an ad hoc basis. On the other hand, the coupling is dynamic when user or software
1An Integrated Middleware Framework to Enable Extreme Collaborative Science
68
WorkstationWorkstationWorkstationWorkstationWorkstationLocal Cluster
WorkstationWorkstationHPC ClusterWorkstationWorkstationCloud
Compute Network StorageComputeCompute(w/ Memory)
NetworkNetwork StorageStorage
Query Monitor Discover
ApplicationsApplicationsApplications
Re
sour
ceA
bstr
act
ions
Resource Representation Resource Interface
Figure 4.1: Overview of the Bundle layer.
monitors resource status and triggers resource adjustments when needed. This proce-
dure is not routinely practiced due to resource heterogeneity and dynamisms. Both
static and dynamic resource couplings require systematic ways of constructing resource
characterizations.
Prior works [73, 74] have shown that resource abstractions is a powerful methodology
for resource characterization and resource discovery. We propose the Bundle abstraction
(or in sort, Bundle) to bridge applications and heterogeneous resources via uniform
resource characterizations.
The core concept of the Bundle abstraction is Resource Bundle, which contains
an integrated group of resources. A Resource Bundle can include multiple resource
types including compute, storage, and network. A Resource Bundle does not own re-
source components – any resource components may be shared across multiple Resource
Bundles. Resource Bundle provides a convenient handle for aggregated query and mon-
itoring.
Bundle comprises two parts: (1) Resource Representation and (2) Resource Inter-
face, which are depicted in Figure 4.1. Resource representation characterizes hetero-
geneous resources with a large degree of uniformity, thus hiding complexity. Resource
representation models resources across three basic categories: compute, network, and
storage. Given that memory is mostly assigned together with processors, Bundle treats
69
memory as an attribute of the compute resource. Measurements that are meaningful
across multiple platforms are identified in each category. For example, the property
“setup time” of a compute resource means queue wait time on a HPC cluster or virtual
machine startup latency on a cloud [75].
The resource interface exposes information about resources availability and capa-
bilities via an API. Two query modes are supported: on-demand and predictive. The
on-demand mode offers real-time measurements while the predictive mode offers fore-
casts based on historical measurements of resource utilization instead of queue waiting
time, which is extremely hard to predict accurately [21, 15, 22].
The resource interface exposes three types of interface: querying, monitoring, and
discovering. The query interface uses end-to-end measurements to organize resource
information. For example, the query interface can be used to inquire how long it would
take to transfer a file from one location to a resource and vice versa. Although file
transfer times are difficult to estimate [76], proper tools [77] are capable of providing
estimates within an order of magnitude, which are still useful.
The monitoring interface can be used to inquire about resource state and to chose
system events for which to receive notification. For example, performance variation
within a cluster can be monitored so that when the average performance has dropped
below a certain threshold for a certain period, subscribers of such an event will be
notified. This may trigger subsequent scheduling decisions such as adding more resources
to the application.
The discovery interface, which is future work, will let the user request resources
based on abstract requirements so that a tailored bundle can be created. A language
for specifying resource requirements is being developed. This concept has been shown to
be successful for storage aggregates in the Tiera project [74], where resource capacities
and resource policies are specified in a compact notation. Similar concept of resource
matching language has also been adopted in relevant works [78, 79, 73].
70
Bundle API
HPCCluster
Bundle ManagerBundleAgent
ResourceBundle
Workstation
BundleAgent
Grid
BundleAgent Cloud
BundleAgent
Workload Manager applications
Figure 4.2: An overview of Bundle architecture, blue shaded components comprise theBundle software.
Table 4.1: BundleAgent supported platforms
FutureGrid testbed clusters India, Xray, Hotel, Sierra, Alamo
XSEDE HPC clusters Stampede, Trestles, Gordon, Blacklight
NERSC HPC clusters Hopper
OSG grids Most of the open sites
4.3 Implementation
Bundle is implemented as a loosely-coupled distributed software system consisting of
four components (see Figure 4.2). The first component is BundleAgent, which is de-
signed to work with individual platforms. BundleAgent collects the configuration of
each platform and constantly monitors the dynamic status of each platform. Bundle
software deploys a BundleAgent instance on each platform. There are two types of
BundleAgents: the LocalBundleAgent and the RemoteBundleAgent. LocalBundleAgent
runs inside the the resource it monitors. If local deployment is prohibited by policy, a
RemoteBundleAgent is deployed on an remote site. We have implemented BundleAgent
on an expanded set of resources across multiple organizations including XSEDE [80],
NERSC, and OSG [81] (Table 4.1).
The second component in the Bundle architecture is BundleManager that controls
71
Table 4.2: BundleAPI
name description
General
get list Get a list of all the accessible resources.
get config Get the current configuration of resource includ-ing number of nodes, queues, policy constraints.
Compute Resources
get workload Get real-time workload of a platform, includingnodes and jobs status.
Network Resources
get bandwidth Get the bandwidth of network connections be-tween two distributed resources.
and aggregates information from BundleAgent ’s. BundleManager maintains historical
data and runs all sorts of data processing.
The third component in the Bundle architecture is ResourceBundle, which combines
a group of resources that are integrated used to execute an application. ResourceBundle
describes resources using properties that are commonly applicable across heterogeneous
platforms. For example, HPC clusters are partitioned into different jobs queues where
each queue manages a certain number of nodes. Whereas in a Grid, nodes are partitioned
into different sites. So when queried for configuration and capacity ResourceBundle
organizes information of both HPC and Grid platforms based on queue/site partitions.
The same abstraction is used to describe other resource properties such as resource
acquisition time, compute power, and data transfer time.
The fourth component of Bundle is BundleAPI, which is a group of general interfaces
that support uniform interactions and operations on underlying resources. Table 4.2 lists
the major interfaces supported by BundleAPI. The design goal of BundleAPI is that
they are useful yet general enough for heterogeneous resources.
72
Figure 4.3: Visualization of a month-long workload of TACC Stampede HPC cluster.
4.4 Experiments
We quantitatively and qualitatively analyze both static and dynamic resource infor-
mation with the following three aspects: (1) HPC cluster workload variation (subsec-
tion 4.4.1), (2) grid compute node performance heterogeneity (subsection 4.4.2), and
(3) wide-area network performance (subsection 4.4.3).
4.4.1 HPC Cluster Workload Characterization
For an HPC cluster that comprises homogeneous compute nodes, the main source of
dynamism comes from workload variation. For a large-scale HPC cluster concurrently
shared by a lot of users, the workload variation is created by the aggregated usage of all
the applications. Such that one must observe all the queues and jobs for long periods
to have a good understanding of the HPC cluster. In other words, a snapshot of the
HPC cluster won’t provide adequate information for drawing useful insights on how to
improve scheduling decisions, such as choosing which cluster to schedule an application
at a certain time such that the expected wait time is shorter.
73
Figure 4.3 demonstrates the workload characterization with one month of data col-
lected by BundleAgent run on TACC’s Stampeded HPC cluster [82]. In 2014, Stampede
was the world’s 7th largest supercomputer. The upper graph shows the wait times of
every jobs submitted during the month. Each job is represented by a solid circle, with
the radius representing the job’s number of processors. The x coordinate of a job’s circle
represents the job’s submit time. The y coordinate of a job’s circle represents the job’s
wait time. The lower graph displays workload intensity measured by the combined node
hours requested by all the jobs submitted per hour.
The data reveals two phenomenons: (1) the temporal correlation of wait times among
jobs submitted in adjacent times, and (2) the skew in job wait time distributions. Unlike
what we had originally expected, the job wait times show clear patterns instead of
randomness. The wait times form multiple peaks: job wait times quickly increase until
reaching the highest value of a peak, then gradually decrease until back to normal or
hit the next peak. Combined with the workload intensity curves, we observe that the
wait times peaks appear after workload bursts. Also, we observe many large jobs whose
wait times are significantly shorter compared to the other jobs which are submitted at
similar times. This phenomenon indicates that the Stampede cluster prioritize large
jobs.
Based on the above observations, we draw the following conclusions regarding how to
make intelligent scheduling decisions on this HPC cluster. Firstly, to avoid excessively
long job wait time, workload intensity measured by the combined cpu hours requested
during the last several hours (e.g., 1 to 4 hours) is a good indicator. When that number
exceeded a certain threshold, job wait times will quickly increase. Secondly, due to this
cluster prioritize large jobs, when large jobs arrive in the queue, they will significantly
increase the overall job wait time expectations. Application schedulers can subscribe
with Bundle to receive event notifications when multiple large jobs are detected in the
queues, and use the events to de-prioritize this cluster when making scheduling decisions.
4.4.2 Grid Nodes Performance Heterogeneity
For grids that comprise heterogeneous compute nodes, scheduling an application’s tasks
on different groups of nodes will result in significant differences for an application’s per-
formance. Bundle characterizes performance of all the compute nodes within a grid.
74
Such that the performance information is provided to application schedulers for improv-
ing scheduling qualities. Specifically, modern grid job schedulers such as HTCondor [83]
allow users to specify which sites or nodes they want their applications to be scheduled
to. The grid scheduler will guarantee the user requirements during match making [84]
process. However, most grid users can’t leverage this mechanism to select suitable
resources due to lacking of node information.
Bundle sends probes to all the compute nodes in OSG [81]. The probes collect con-
figuration information and run benchmarks on each compute nodes. Then we aggregate
the performance information by clustering the nodes into groups such that the nodes in
each group have similar performance. We choose the HINT benchmark [85] for compute
node CPU/Memory performance measuring. The HINT benchmark measures multiple
aspects of a compute node performance including processor speed, precision, usable
memory size and bandwidth. HINT suits our needs for two reasons: Firstly, the HINT
results can be summarized by a single number “Net QUIPS” (Quality Improvement Per
Second). Bundle use this number as the first order of performance. User can query
Bundle to return a list of nodes that produce a least a certain Net QUIPS value.
Secondly, HINT measures QUIPS as a function of time. Performance of multiple
nodes can be reflected in the same graph, such that nodes will be compared using the
entire QUIPS curve. For example, Figure 4.4 shows the QUIPS curves of 13 arbitrary
compute nodes in the UC site, one of the largest sites of OSG. The performance of each
node is reflected in a descending curve, the higher the measurements the stronger the
performance. The graph shows that the 13 nodes can be clustered into three groups.
For example, given any one node, Bundle will be able to find all the similar nodes with
comparable performances. Application schedulers can use this grouping information
to schedule tasks to a cluster of similar nodes with known performance. Predictable
or known performance is key to success of many quality of service (QoS) -aware grid
schedulers [86, 87, 88].
4.4.3 Grid Network Performance Variation
Multi-site grids like OSG provide distributed facilities owned by different organizations.
For example, OSG today has more than a hundred active computational sites worldwide.
It is a common practice for a user to routinely select several sites for regular uses. A
75
10ns 100ns 1us 10us 100us 1ms 10ms 100ms 1s 10s 100sTime
5
10
15
20
25
30
35
MQ
UIP
S (
HIN
T B
ench
mark
)
Clustering nodes within the samedomain based on CPU performance
uct2-c111
uct2-c163
uct2-c061
uct2-c100
uct2-c403
uct2-c176
uct2-c390
uct2-c154
uct2-c108
uct2-c383
iut2-c070
iut2-c050
iut2-c209
Figure 4.4: Compute node performance clustering
major factor for scheduling to specific sites is cross-site network heterogeneity: some
sites are geographically closer to the user, or they have faster connection speed to a
shared storage system. In OSU, a intermediate file system called Stash provides the
storage for large input/output data. First, user will upload input data to Stash. Next,
the data will be transfered to the compute node once a task is scheduled, and results
will be sent back to Stash. In this 2-hop scenario, the network bandwidth between Stash
and each compute nodes is a determinant factor for application’s performance.
In this experiment, we measure network bandwidth between Stash and compute
nodes in OSG. As mentioned above, BundleAgent sends probes to every compute nodes.
The probes measure network bandwidth using iPerf [89]. Figure 4.5 shows the distri-
butions of network bandwidths of 9 different sites. The average bandwidths of these
sites range from 3 Mbit/s to 957 Mbit/s, more than 300 times difference. Our results
also show large network performance variation between different nodes of the same site.
Bundle measures both cross- and intra- network heterogeneity. This information is key
to the performance of inter-domain grid schedulers such as [89].
4.5 Related Work
Scheduling distributed applications to multiple HPC resources is a well-known research
problem. For example, the I-Way [90], Legion [91], Globus [92], and HTCondor [93]
76
0 200 400 600 800 1000 1200 1400 1600 1800Bandwidth (Mbit/s)
AGLT2
BNL-ATLAS
Crane
MWT2
NWICG_NDCMS
UCD
UCSDT2
UTA_SWT2
Cinvestav
Network bandwidth between OSG-Connect-Stashand nodes of different OSG sites
Figure 4.5: network performance
frameworks integrated existing tools to run distributed applications on multiple re-
sources. The abstraction and service presented in this chapter differs from the resource
representation layer in these frameworks by assuming multiple resources with dynamic
availability and diverse capabilities.
The Bundle service is built upon related work. The service leverages some of the
work done in information collection [73], resource discovery [94, 79], and resource char-
acterization as it relates to job wait time prediction [22] and its difficulties [21]. For
the job wait time characterization, our work takes an alternative approach. Instead
of trying to predict the exact wait time, our approach predicts high-level trends using
temporal and workload correlations.
4.6 Conclusion
Elastically scheduling a distributed application requires information about the applica-
tion requirements, and resource availability and capabilities. This information is used
to choose a suitable set of resources on which to run the application executable(s) and
a suitable scheduling of the application tasks. When considering multiple resources and
distributed applications, bringing together application- and resource-level information
requires specific abstractions that have to uniformly and consistently describe the core
properties of distributed applications, those of computing resources, and those of the
77
execution of the former on the latter.
Bundle abstraction bridges applications and diverse resources via uniform resource
characterizations. Despite the diverse interfaces and platforms, the core properties of
distributed computing resources including compute, storage, and network are used to
uniformly and consistently describe diverse resources. Bundle provides simple APIs
that are valid across different resource platforms.
We implemented the abstraction as a dedicated service and integrate the service
with other components of the AIMES middleware. The Bundle service is deployed on
10+ heterogeneous platforms to capture the dynamism of distributed resource. We run
multiple experiments to evaluate Bundle service from three different aspects including
the ability to characterize workload variations, node performance heterogeneity and
network variations. We use the data to draw on insights of how to improve scheduling
decisions in dynamic and heterogeneous HPC resources.
Chapter 5
Conclusion
HPC clusters offer large scale computational resources to scientific applications that
solve large scale problems. Very few scientific applications run on dedicated resources.
Instead, queue-based batch systems are devised to manage how expensive resources are
shared among users.
Under the fundamental assumption that every application should achieve highest
performance, a batch scheduler assigns exclusively a fixed amount of resources to each
application for the entire runtime. Although this policy maximizes resource efficiency, it
leads to long queue waiting time for applications. It also creates resource fragmentations
which lower system utilization, causing wastes of money and energy. These problems
become more critical when many large parallel applications wait for resource allocations
concurrently.
Different from traditional batch jobs, a new class of data-driven scientific applica-
tions require on-demand resources. To support the needs, HPC facilities reserve dedi-
cated clusters for on-demand requests. Because of the highly-bursty and low-on-average
workload pattern, the on-demand clusters create the low-utilization problem.
To mitigate these three problems, this thesis adopts and adapts the elastic schedul-
ing strategy which is commonly practiced in Cloud Computing, but rarely in HPC.
We explore different trade-offs between performance, utilization and response time and
contribute new techniques to the HPC resource scheduling study.
78
79
5.1 Research Contributions
We contribute by devising new approaches leveraging existing technologies which are
widely used. Our new approaches focus on services that interact with both users and
existing resource management systems (RMS). The services are implemented in nonin-
vasive ways, meaning that they do not seek to change the user’s interface to the RMS.
Our new approaches employ new heuristic and predictive algorithms. Our algorithms
provide knobs to turn, which make them flexible with different workloads and user
preferences.
5.1.1 Elasticity for Parallel Batch System
Based on the insights from our long-term HPC resource characterizations, we found rigid
job scheduling creates long wait time and undermines utilization. Inspired by Cloud’s
elastic scheduling, we developed a technique called Elastic Job Bundling (EJB) to break
the rigid coupling between application and job. EJB is a user-level service functioning as
a proxy between user and existing batch system. EJB receives original job requests and
transforms large job requests into multiple smaller subjobs. Using smaller subjobs, EJB
dynamically requests resources based on immediate availability, and uses the available
resources to run large parallel applications with downgraded performance. EJB trades
of application’s performance for lower response time and higher utilization.
We proposed the concept of elastic job and formulated the definition of elastic job.
We devised the approach to run a tightly-coupled parallel application on multiple elastic
jobs. This approach leverages existing techniques including processor over-subscription
and process migration. Furthermore, we proposed to rely on existing batch sched-
uler’s backfilling mechanism to acquire immediately available resources, such that our
approach can control the ‘shape’ and the start time of subjobs.
We designed the elastic scheduling algorithm, which is a heuristic event-driven
scheduling algorithm. The algorithm is triggered by periodical free resource detec-
tion and subjob’s callback functions. We proposed the architecture of the EJB service
software and run experiments to validate our over-subscription and migration models.
Finally, we run simulation with production traces to evaluate our approach. Simu-
lation results show that EJB significantly reduces large parallel jobs’ turnaround time
80
without sacrificing that of the smaller jobs. At the same time, EJB reduces system
fragmentation, thus improving utilization under heavy workloads.
5.1.2 Elasticity for On-demand and Batch Hybrid System
Motivated by real-world use case scenarios in national labs, we developed a technique
to jointly satisfy on-demand request and batch job performance. Specifically, we intro-
duced a service called Balancer between existing schedulers and underlying resources.
Similar to our design goal in EJB, Balancer is also non-invasive in a way that it doesn’t
require complete changes in existing resource scheduling systems and it does not require
users to adopt new interfaces.
We described an architecture and implementation for dynamic non-invasive resource
reassignment between two systems: a system providing resources on-demand and a sys-
tem providing resources based on availability, that balances their respective objectives
in terms of the number of satisfied on-demand requests and utilization. We proposed
three algorithms for balancing resources in this context: the Basic Algorithm providing
a baseline of our systems, the Hint Algorithm that models the behavior where experi-
mental users can register the upcoming need for on-demand cycles, and the Predictive
algorithm for cases where such advance notice is not possible.
Based on a real-life scenario representing two years worth of on-demand and batch
workloads at Argonne National Laboratory, we demonstrated that by using our model on
existing resources we could reduce the current investment in on-demand infrastructure
by 82%, while at the same time improving the mean batch wait time almost by an
order of magnitude (8x). Our large-scale experiments driven by synthetic traces derived
from production trace show that Balancer significantly reduces the need for dedicated
node reserves by up to 10% for on-demand requests and simultaneously improves batch
performance and overall utilization.
5.1.3 Elastic Resource Abstraction for Heterogeneous Resources
Managing dynamism when executing an scientific application on multiple heterogeneous
HPC resources is difficult due to the complexity of choosing resources and distributing
the application’s tasks over them, especially when performance is a concern.
81
This Thesis offers three main contributions to the issue of characterizing multiple
heterogeneous and dynamic HPC resources: (1) abstractions to represent resource con-
figuration, capabilities, and availability, which hide heterogeneity and complexity; (2)
the implementation of these abstractions in middleware designed to facilitate executions
of scientific applications; (3) experiments demonstrating how data-driven analysis can
improve resource characterizations.
5.2 Future Research Directions
5.2.1 Improving Elasticity by Using Container Based Solutions
In recent years, the container technique has drawn much attention of the HPC com-
munity [95, 96]. Firstly, compared to traditional VM-based virtualization, container is
lightweight: the start-up time for a container is in sub-seconds compared to minutes for
a VM. Secondly, container has built in environment and version control, which greatly
simplifies software deployment. It also allows root access within a container. Thirdly,
container support fine-grained resource sharing using techniques such as Cgroups to con-
trol limits for resource usage, including CPU, memory, disk I/O, and network. Despite
the isolation provided by container is sometimes inadequate to meet rigorous industry
requirements, the security level provided by containers is mostly enough in the academia
world.
In the context of elastic scheduling, container will improve existing solutions as well
as enable new use cases. Our Balancer approach can leverage container to improve
usability. One method is to schedule container on VM: Balancer negotiates resources
between batch and on-demand schedulers. In our current implementation OpenStack is
used as the on-demand scheduler, and each on-demand request is scheduled with a VM.
Compared to container, VM is a course-grained way of scheduling resources. Container
merges the gap between VM and tasks. Each task can be wrapped in a container, and
multiple containers can be quickly launched on a VM. In this case, user doesn’t need to
manage task runtime environment on a VM.
Furthermore, we can allow containers to run directly on physical nodes. This use
case enables more fine-grained resource sharing. For example when resources are idle in
an HPC cluster, we can quickly run HTC tasks to utilize the idle resources. This type of
82
cycle stealing doesn’t need to go through Balancer, since these HTC workloads can be
killed when a batch job is scheduled. This use case enables a batch/on-demand/HTC
three-way resource sharing. On-demand is given high priority. On-demand requests can
reclaim idle nodes from the batch scheduler. At the other end of the spectrum, HTC
workload has the lowest priority. They fill transient idleness in the cluster, such that
overall utilization will further improve.
5.2.2 Unified Batch and On-demand Scheduler
In recent years, there is a trend of converging HPC and Cloud [97, 98, 99]. Most works
focus on hybrid usage of HPC and Cloud. However, we envision a unified HPC scheduler
to support different types of SLAs. The counterpart schedulers in Datacenters such as
Borg [72] supports both time-critical and batch workloads. But as we have discussed
in Section 3.4, they treat batch as second-class citizens, meaning that batch jobs can
be killed, which is unacceptable in HPC. Besides, Borg is not open-sourced. The HPC
world needs its own all-in-one scheduler that can support both traditional batch jobs
and on-demand jobs.
5.2.3 Node-level Resource Partitioning and Sharing
At the single-node level, the broad adoption and ubiquitous usage of multi-core architec-
tures has led a single server to be viewed more and more like a cluster. has required more
careful engineering to unleash the performance of a single node. Though general-purpose
CPUs have yet to attain 1000s of cores per chip as [100] had predicted, resource con-
tentions triggered by node-level multitasking have urged researchers to examine smarter
ways to co-locate multiple applications on the same node [101]. For a single applica-
tion, work like [102] has shown how to optimize application end-to-end performance by
functionally partitioning multi-cores. Moreover, the fusion of CPU-GPU [103] has led
many researchers to explore better approaches of coordinating CPU-GPU to efficiency
work together.
References
[1] Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang
Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang,
and Xiaofei Chen. 18.9pflopss nonlinear earthquake simulation on sunway taihu-
light: Enabling depiction of 18-hz and 8-meter scenarios. In Proceedings of the
International Conference for High Performance Computing, Networking, Storage
and Analysis, SC ’17, pages 2:1–2:12, New York, NY, USA, 2017. ACM.
[2] Business Insider. This vertical farm in newark, new jersey, could be the key to
solving some of agriculture’s biggest problems, 2018.
[3] Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally,
Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al.
Exascale computing study: Technology challenges in achieving exascale systems.
Defense Advanced Research Projects Agency Information Processing Techniques
Office (DARPA IPTO), Tech. Rep, 15, 2008.
[4] TOP500.org. First us exascale supercomputer now on track for 2021, 2016.
[5] Nikolay A Simakov, Joseph P White, Robert L DeLeon, Steven M Gallo,
Matthew D Jones, Jeffrey T Palmer, Benjamin Plessinger, and Thomas R Furlani.
A workload analysis of nsf’s innovative hpc resources using xdmod. arXiv preprint
arXiv:1801.04306, 2018.
[6] Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. Elasticity in cloud
computing: What it is, and what it is not. In Proceedings of the 10th International
Conference on Autonomic Computing (ICAC 13), pages 23–27, San Jose, CA,
2013. USENIX.
83
84
[7] F. Liu and J. B. Weissman. Elastic job bundling: an adaptive resource request
strategy for large-scale parallel applications. In SC15: International Conference
for High Performance Computing, Networking, Storage and Analysis, pages 1–12,
Nov 2015.
[8] Feng Liu, Kate Keahey, Pierre Riteau, and Jon Weissman. Dynamically nego-
tiating capacity between on-demand and batch clusters. In Proceedings of the
International Conference for High Performance Computing, Networking, Storage,
and Analysis, SC ’18, pages 38:1–38:11, Piscataway, NJ, USA, 2018. IEEE Press.
[9] M. Turilli, F. Liu, Z. Zhang, A. Merzky, M. Wilde, J. Weissman, D. S. Katz, and
S. Jha. Integrating abstractions to enhance the execution of distributed applica-
tions. In 2016 IEEE International Parallel and Distributed Processing Symposium
(IPDPS), pages 953–962, May 2016.
[10] Shantenu Jha, Murray Cole, Daniel S. Katz, Manish Parashar, Omer Rana, and
Jon Weissman. Distributed computing practice for large-scale science and engi-
neering applications. 25, 08 2013.
[11] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,
Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John
Shalf, Samuel Webb Williams, et al. The landscape of parallel computing research:
A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, 2006.
[12] Walfredo Cirne and Francine Berman. Using moldability to improve the perfor-
mance of supercomputer jobs. Journal of Parallel and Distributed Computing,
62(10):1571–1601, 2002.
[13] Walfredo Cirne and Francine Berman. When the herd is smart: aggregate be-
havior in the selection of job request. Parallel and Distributed Systems, IEEE
Transactions on, 14(2):181–192, 2003.
[14] Gladys Utrera, Siham Tabik, Julita Corbalan, and Jesus Labarta. A job scheduling
approach for multi-core clusters based on virtual malleability. In Euro-Par 2012
Parallel Processing, pages 191–203. Springer, 2012.
85
[15] Allen B Downey. Predicting queue times on space-sharing parallel computers.
In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages
209–218. IEEE, 1997.
[16] Allen B Downey. Using queue time predictions for processor allocation. In Job
Scheduling Strategies for Parallel Processing, pages 35–57. Springer, 1997.
[17] Warren Smith, Valerie Taylor, and Ian Foster. Using run-time predictions to
estimate queue wait times and improve scheduler performance. In Job Scheduling
Strategies for Parallel Processing, pages 202–219. Springer, 1999.
[18] Walfredo Cirne and Francine Berman. A model for moldable supercomputer
jobs. In Parallel and Distributed Processing Symposium., Proceedings 15th In-
ternational, pages 8–pp. IEEE, 2001.
[19] Rich Wolski. Experiences with predicting resource performance on-line in com-
putational grid settings. ACM SIGMETRICS Performance Evaluation Review,
30(4):41–49, 2003.
[20] Hui Li, David Groep, Jeffrey Templon, and Lex Wolters. Predicting job start
times on clusters. In Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE
International Symposium on, pages 301–308. IEEE, 2004.
[21] Dan Tsafrir, Yoav Etsion, and Dror G Feitelson. Backfilling using system-
generated predictions rather than user runtime estimates. Parallel and Distributed
Systems, IEEE Transactions on, 18(6):789–803, 2007.
[22] Daniel Nurmi, John Brevik, and Rich Wolski. Qbets: queue bounds estimation
from time series. In Job Scheduling Strategies for Parallel Processing, pages 76–
101. Springer, 2008.
[23] Garrick Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE
conference on Supercomputing, page 8. ACM, 2006.
[24] Andy B. Yoo, Morris A. Jette, and Mark Grondona. Slurm: Simple linux
utility for resource management. In Dror Feitelson, Larry Rudolph, and Uwe
86
Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages
44–60, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
[25] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, workloads,
and user runtime estimates in scheduling the ibm sp2 with backfilling. Parallel
and Distributed Systems, IEEE Transactions on, 12(6):529–543, 2001.
[26] Feng Liu and Jon Weissman. Elastic job bundling: An adaptive resource request
strategy for large-scale parallel applications. Technical report, TR15-006, Depart-
ment of Computer Science and Engineering, University of Minnesota, 2015.
[27] Andre Luckow, Mark Santcroos, Ole Weidner, Andre Merzky, Pradeep Mantha,
and Shantenu Jha. P*: A Model of Pilot-Abstractions. In 8th IEEE International
Conference on e-Science 2012, 2012.
[28] David H Bailey, Eric Barszcz, John T Barton, David S Browning, Russell L Carter,
Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S
Schreiber, et al. The nas parallel benchmarks. International Journal of High
Performance Computing Applications, 5(3):63–73, 1991.
[29] Futuregrid. futuregrid.org.
[30] Dominique LaSalle and George Karypis. Mpi for big data: New tricks for an old
dog. Parallel Computing, 40(10):754–767, 2014.
[31] Jason Ansel, Kapil Aryay, and Gene Coopermany. Dmtcp: Transparent check-
pointing for cluster computations and the desktop. In Parallel & Distributed
Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–12.
IEEE, 2009.
[32] Julien Adam, Jean-Baptiste Besnard, Allen D. Malony, Sameer Shende, Marc
Perache, Patrick Carribault, and Julien Jaeger. Transparent high-speed network
checkpoint/restart in mpi. In Proceedings of the 25th European MPI Users’ Group
Meeting, EuroMPI’18, pages 12:1–12:11, New York, NY, USA, 2018. ACM.
[33] Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/
workload/.
87
[34] pyss - the python scheduler simulator. https://code.google.com/p/pyss/.
[35] David J Lilja. Measuring computer performance: a practitioner’s guide. Cam-
bridge University Press, 2000.
[36] Dror G Feitelson. Metric and workload effects on computer systems evaluation.
Computer, 36(9):18–25, 2003.
[37] Adam Wierman. Revisiting the performance of large jobs in the M/GI/1 queue.
In Proceedings of the Forty-Fifth Annual Allerton Conference On Communication,
Control, and Computing, pages 607–614, 2007.
[38] Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold
Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. The case for tiny tasks in
compute clusters.
[39] Jon B Weissman, Lakshman Rao Abburi, and Darin England. Integrated schedul-
ing: the best of both worlds. Journal of Parallel and Distributed Computing,
63(6):649–668, 2003.
[40] Rajesh Sudarsan and Calvin J Ribbens. Reshape: A framework for dynamic
resizing and scheduling of homogeneous applications in a parallel environment.
In Parallel Processing, 2007. ICPP 2007. International Conference on, page 44.
IEEE, 2007.
[41] Rajesh Sudarsan and Calvin J Ribbens. Scheduling resizable parallel applications.
In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Sym-
posium on, pages 1–10. IEEE, 2009.
[42] https://bitbucket.org/francis_liu/pyss.
[43] P. Marshall, K. Keahey, and T. Freeman. Improving utilization of infrastructure
clouds. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing, pages 205–214, May 2011.
[44] David P. Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer.
Seti@home: An experiment in public-resource computing. Commun. ACM,
45(11):56–61, November 2002.
88
[45] O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Decon-
structing amazon ec2 spot instance pricing. In 2011 IEEE Third International
Conference on Cloud Computing Technology and Science, pages 304–311, Nov
2011.
[46] Garrick Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE
Conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM.
[47] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, and David Jackson. The
portable batch scheduler and the maui scheduler on linux clusters. In Proceedings
of the 4th Annual Linux Showcase & Conference - Volume 4, ALS’00, pages 27–27,
Berkeley, CA, USA, 2000. USENIX Association.
[48] Irfan Habib. Virtualization with kvm. Linux J., 2008(166), February 2008.
[49] Kate Keahey, Pierre Riteau, Dan Stanzione, Tim Cockerill, Joe Manbretti, Paul
Rad, and Ruth. Paul. Chameleon: a Scalable Production Testbed for Computer
Science Research. In Contemporary High Performance Computing vol. 3. Ed. Jeff
Vetter. Springer, 2017.
[50] John R. Lange, Kevin Pedretti, Peter Dinda, Patrick G. Bridges, Chang Bae,
Philip Soltero, and Alexander Merritt. Minimal-overhead virtualization of a large
scale supercomputer. In Proceedings of the 7th ACM SIGPLAN/SIGOPS Interna-
tional Conference on Virtual Execution Environments, VEE ’11, pages 169–180,
New York, NY, USA, 2011. ACM.
[51] Bryce Allen, Rachana Ananthakrishnan, Kyle Chard, Ian Foster, Ravi Madduri,
Jim Pruyne, Stephen Rosen, and Steve Tuecke. Globus: A case study in software
as a service for scientists. In Proceedings of the 8th Workshop on Scientific Cloud
Computing, ScienceCloud ’17, pages 25–32, New York, NY, USA, 2017. ACM.
[52] Lavanya Ramakrishnan, Piotr T. Zbiegel, Scott Campbell, Rick Bradshaw,
Richard Shane Canon, Susan Coghlan, Iwona Sakrejda, Narayan Desai, Tina
Declerck, and Anping Liu. Magellan: Experiences from a science cloud. In
Proceedings of the 2Nd International Workshop on Scientific Cloud Computing,
ScienceCloud ’11, pages 49–58, New York, NY, USA, 2011. ACM.
89
[53] Qiming He, Shujia Zhou, Ben Kobler, Dan Duffy, and Tom McGlynn. Case study
for running hpc applications in public clouds. In Proceedings of the 19th ACM
International Symposium on High Performance Distributed Computing, HPDC
’10, pages 395–401, New York, NY, USA, 2010. ACM.
[54] I. Sadooghi, J. H. Martin, T. Li, K. Brandstatter, K. Maheshwari, T. P. P. de Lac-
erda Ruivo, G. Garzoglio, S. Timm, Y. Zhao, and I. Raicu. Understanding the
performance and potential of cloud computing for scientific applications. IEEE
Transactions on Cloud Computing, 5(2):358–371, April 2017.
[55] F. J. Clemente-Castello, B. Nicolae, R. Mayo, and J. C. Fernandez. Performance
model of mapreduce iterative applications for hybrid cloud bursting. IEEE Trans-
actions on Parallel and Distributed Systems, PP(99):1–1, 2018.
[56] M. Parashar, M. AbdelBaky, I. Rodero, and A. Devarakonda. Cloud paradigms
and practices for computational and data-enabled science and engineering. Com-
puting in Science Engineering, 15(4):10–18, July 2013.
[57] G. Fox and S. Jha. Conceptualizing a computing platform for science beyond 2020:
To cloudify hpc, or hpcify clouds? In 2017 IEEE 10th International Conference
on Cloud Computing (CLOUD), pages 808–810, June 2017.
[58] S. Niu, J. Zhai, X. Ma, X. Tang, and W. Chen. Cost-effective cloud hpc resource
provisioning by building semi-elastic virtual clusters. In 2013 SC - International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC), pages 1–12, Nov 2013.
[59] A. Marathe, R. Harris, D. K. Lowenthal, B. R. de Supinski, B. Rountree, and
M. Schulz. Exploiting redundancy and application scalability for cost-effective,
time-constrained execution of hpc applications on amazon ec2. IEEE Transactions
on Parallel and Distributed Systems, 27(9):2574–2588, Sept 2016.
[60] Ishai Menache, Ohad Shamir, and Navendu Jain. On-demand, spot, or both: Dy-
namic resource allocation for executing batch jobs in the cloud. In 11th Interna-
tional Conference on Autonomic Computing (ICAC 14), pages 177–187, Philadel-
phia, PA, 2014. USENIX Association.
90
[61] Y. Gong, B. He, and A. C. Zhou. Monetary cost optimizations for mpi-based hpc
applications on amazon clouds: checkpoints and replicated execution. In SC15:
International Conference for High Performance Computing, Networking, Storage
and Analysis, pages 1–12, Nov 2015.
[62] R. Chard, K. Chard, K. Bubendorfer, L. Lacinski, R. Madduri, and I. Foster.
Cost-aware cloud provisioning. In 2015 IEEE 11th International Conference on
e-Science, pages 136–144, Aug 2015.
[63] Rich Wolski, John Brevik, Ryan Chard, and Kyle Chard. Probabilistic guar-
antees of execution duration for amazon spot instances. In Proceedings of the
International Conference for High Performance Computing, Networking, Storage
and Analysis, SC ’17, pages 18:1–18:11, New York, NY, USA, 2017. ACM.
[64] Simon Delamare, Gilles Fedak, Derrick Kondo, and Oleg Lodygensky. Spequlos: A
qos service for bot applications using best effort distributed computing infrastruc-
tures. In Proceedings of the 21st International Symposium on High-Performance
Parallel and Distributed Computing, HPDC ’12, pages 173–186, New York, NY,
USA, 2012. ACM.
[65] Marcus Carvalho, Walfredo Cirne, Franciso Brasileiro, and John Wilkes. Long-
term SLOs for reclaimed cloud computing resources. In ACM Symposium on
Cloud Computing (SoCC), pages 20:1–20:13, Seattle, WA, USA, 2014.
[66] Supreeth Shastri, Amr Rizk, and David Irwin. Transient guarantees: Maximizing
the value of idle cloud capacity. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis, SC ’16, pages
85:1–85:11, Piscataway, NJ, USA, 2016. IEEE Press.
[67] Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, and Thomas Mosci-
broda. Tr-spark: Transient computing for big data analytics. pages 484–496. ACM
Symposium on Cloud Computing, October 2016.
91
[68] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and
Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Pro-
ceedings of the 42Nd Annual International Symposium on Computer Architecture,
ISCA ’15, pages 450–462, New York, NY, USA, 2015. ACM.
[69] Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur
Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo
Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. Morpheus: Towards
automated slos for enterprise clusters. In 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 16), pages 117–134, Savannah, GA,
2016. USENIX Association.
[70] Ganesh Ananthanarayanan, Christopher Douglas, Raghu Ramakrishnan, Sriram
Rao, and Ion Stoica. True elasticity in multi-tenant data-intensive compute clus-
ters. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC
’12, pages 24:1–24:7, New York, NY, USA, 2012. ACM.
[71] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-
grained resource sharing in the data center. In Proceedings of the 8th USENIX
Conference on Networked Systems Design and Implementation, NSDI’11, pages
295–308, Berkeley, CA, USA, 2011. USENIX Association.
[72] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric
Tune, and John Wilkes. Large-scale cluster management at google with borg. In
Proceedings of the Tenth European Conference on Computer Systems, EuroSys
’15, pages 18:1–18:17, New York, NY, USA, 2015. ACM.
[73] Michael Cardosa and Abhishek Chandra. Resource bundles: Using aggregation for
statistical wide-area resource discovery and allocation. In Distributed Computing
Systems, 2008. ICDCS’08. The 28th International Conference on, pages 760–768.
IEEE, 2008.
[74] Ajaykrishna Raghavan, Abhishek Chandra, and Jon Weissman. Tiera: towards
flexible multi-tiered cloud storage instances. In Proceedings of the 15th Interna-
tional Middleware Conference, pages 1–12. ACM, 2014.
92
[75] Yogesh Simmhan and Lavanya Ramakrishnan. Comparison of resource platform
selection approaches for scientific workflows. In Proceedings of the 19th ACM In-
ternational Symposium on High Performance Distributed Computing, pages 445–
450. ACM, 2010.
[76] Sudharshan Vazhkudai, Jennifer M. Schopf, and Ian T. Foster. Predicting the
performance of wide area data transfers. In Proceedings of the 16th International
Parallel and Distributed Processing Symposium, IPDPS ’02, pages 270–, Washing-
ton, DC, USA, 2002. IEEE Computer Society.
[77] Tevfik Kosar and Miron Livny. Stork: Making data placement a first class citizen
in the grid. In Distributed Computing Systems, 2004. Proceedings. 24th Interna-
tional Conference on, pages 342–349. IEEE, 2004.
[78] Chuang Liu and Ian Foster. A constraint language approach to matchmak-
ing. In Proceedings of the 14th International Workshop on Research Issues on
Data Engineering: Web Services for E-Commerce and E-Government Applica-
tions (RIDE’04), RIDE ’04, pages 7–14, Washington, DC, USA, 2004. IEEE
Computer Society.
[79] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat. Design and im-
plementation tradeoffs for wide-area resource discovery. In High Performance
Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International
Symposium on, pages 113–124, July 2005.
[80] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Ha-
zlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and
N. Wilkins-Diehr. Xsede: Accelerating scientific discovery. Computing in Sci-
ence & Engineering, 16(5):62–74, Sept.-Oct. 2014.
[81] Ruth Pordes, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, Alain Roy,
Paul Avery, Kent Blackburn, Torre Wenaus, Frank Wrthwein, Ian Foster, Rob
Gardner, Mike Wilde, Alan Blatecky, John McGee, and Rob Quick. The open
science grid. Journal of Physics: Conference Series, 78(1):012057, 2007.
[82] Stampede user guide. https://portal.tacc.utexas.edu/archives/stampede.
93
[83] B Bockelman, T Cartwright, J Frey, E M Fajardo, B Lin, M Selmeci, T Tannen-
baum, and M Zvada. Commissioning the htcondor-ce for the open science grid.
Journal of Physics: Conference Series, 664(6):062003, 2015.
[84] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Distributed
resource management for high throughput computing. In hpdc, page 140. IEEE,
1998.
[85] John L Gustafson and Quinn O Snell. Hint: A new way to measure computer
performance. In System Sciences, 1995. Proceedings of the Twenty-Eighth Hawaii
International Conference on, volume 2, pages 392–401. IEEE, 1995.
[86] Ritu Garg and Awadhesh Kumar Singh. Adaptive workflow scheduling in grid
computing based on dynamic resource availability. Engineering Science and Tech-
nology, an International Journal, 18(2):256–269, 2015.
[87] Kousik Dasgupta, Brototi Mandal, Paramartha Dutta, Jyotsna Kumar Mandal,
and Santanu Dam. A genetic algorithm (ga) based load balancing strategy for
cloud computing. Procedia Technology, 10:340–347, 2013.
[88] Fatos Xhafa and Ajith Abraham. Computational models and heuristic methods
for grid scheduling problems. Future generation computer systems, 26(4):608–621,
2010.
[89] Ajay Tirumala, Feng Qin, Jon Dugan, Jim Ferguson, and Kevin Gibbs. iperf:
Tcp/udp bandwidth measurement tool. 01 2005.
[90] Ian Foster, Jonathan Geisler, Bill Nickless, Warren Smith, and Steven Tuecke.
Software infrastructure for the i-way high-performance distributed computing ex-
periment. In High Performance Distributed Computing, 1996., Proceedings of 5th
IEEE International Symposium on, pages 562–571. IEEE, 1996.
[91] Andrew S. Grimshaw, Wm. A. Wulf, and CORPORATE The Legion Team. The
legion vision of a worldwide virtual computer. Commun. ACM, 40(1):39–45, Jan-
uary 1997.
94
[92] Ian Foster, Carl Kesselman, and Steven Tuecke. The anatomy of the grid: En-
abling scalable virtual organizations. The International Journal of High Perfor-
mance Computing Applications, 15(3):200–222, 2001.
[93] Douglas Thain, Todd Tannenbaum, and Miron Livny. Distributed computing
in practice: the condor experience. Concurrency and computation: practice and
experience, 17(2-4):323–356, 2005.
[94] Chuang Liu and Ian Foster. A constraint language approach to grid resource
selection. 01 2003.
[95] Charles Zheng and Douglas Thain. Integrating containers into workflows: A
case study using makeflow, work queue, and docker. In Proceedings of the 8th
International Workshop on Virtualization Technologies in Distributed Computing,
pages 31–38. ACM, 2015.
[96] Pankaj Saha, Angel Beltre, Piotr Uminski, and Madhusudhan Govindaraju. Eval-
uation of docker containers for scientific workloads in the cloud. In Proceedings
of the Practice and Experience on Advanced Research Computing, PEARC ’18,
pages 11:1–11:8, New York, NY, USA, 2018. ACM.
[97] Gabriel Mateescu, Wolfgang Gentzsch, and Calvin J Ribbens. Hybrid computing-
where hpc meets grid and cloud computing. Future Generation Computer Systems,
27(5):440–453, 2011.
[98] Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake, and Supun Kamburuga-
muve. Big data, simulations and hpc convergence. In Tilmann Rabl, Raghunath
Nambiar, Chaitanya Baru, Milind Bhandarkar, Meikel Poess, and Saumyadipta
Pyne, editors, Big Data Benchmarking, pages 3–17, Cham, 2016. Springer Inter-
national Publishing.
[99] Thamarai Selvi Somasundaram and Kannan Govindarajan. Cloudrb: A frame-
work for scheduling and managing high-performance computing (hpc) applications
in science cloud. Future Generation Computer Systems, 34:47–65, 2014.
[100] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. Computer,
41(7):33–38, July 2008.
95
[101] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-
aware cluster management. In Proceedings of the 19th International Conference
on Architectural Support for Programming Languages and Operating Systems, AS-
PLOS ’14, pages 127–144, New York, NY, USA, 2014. ACM.
[102] M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann,
and G. Shipman. Functional partitioning to optimize end-to-end performance
on many-core architectures. In 2010 ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis, pages 1–12,
Nov 2010.
[103] Sparsh Mittal and Jeffrey S. Vetter. A survey of cpu-gpu heterogeneous computing
techniques. ACM Comput. Surv., 47(4):69:1–69:35, July 2015.